LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Understanding generalization error of SGD in nonconvex optimization

Photo by markusspiske from unsplash

The success of deep learning has led to a rising interest in the generalization property of the stochastic gradient descent (SGD) method, and stability is one popular approach to study… Click to show full abstract

The success of deep learning has led to a rising interest in the generalization property of the stochastic gradient descent (SGD) method, and stability is one popular approach to study it. Existing generalization bounds based on stability do not incorporate the interplay between the optimization of SGD and the underlying data distribution, and hence cannot even capture the effect of randomized labels on the generalization performance. In this paper, we establish generalization error bounds for SGD by characterizing the corresponding stability in terms of the on-average variance of the stochastic gradients. Such characterizations lead to improved bounds on the generalization error of SGD and experimentally explain the effect of the random labels on the generalization performance. We also study the regularized risk minimization problem with strongly convex regularizers, and obtain improved generalization error bounds for the proximal SGD. Introduction Many machine learning applications can be formulated as risk minimization problems, in which each data sample z ∈ R p is assumed to be generated by an underlying multivariate distribution D. The loss function l(·; z) : R → R measures the performance on the sample z and its form depends on specific applications, e.g., square loss for linear regression problems, logistic loss for classification problems and cross entropy loss for training deep neural networks, etc. The goal is to solve the following population risk minimization (PRM) problem over a certain parameter space Ω ⊂ R. min w∈Ω f(w) := Ez∼D l(w; z). (PRM) Directly solving the PRM can be difficult in practice, as either the distribution D is unknown or evaluation of the expectation of the loss function induces high computational cost. To avoid such difficulties, one usually samples a set of n data samples S := {z1, . . . , zn} from the distribution D, and instead solves the following empirical risk minimization (ERM) problem. min w∈Ω fS(w) := 1 n n

Keywords: generalization error; generalization; error sgd; loss

Journal Title: Machine Learning
Year Published: 2021

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.