A New Non-Convex Framework to Improve Asymptotical Knowledge on Generic Stochastic Gradient Descent

Stochastic gradient optimization methods are broadly used to minimize non-convex smooth objective functions, for instance when training deep neural networks. However, theoretical guarantees on the asymptotic behaviour of these methods remain scarce. Especially, ensuring almost-sure convergence of the iterates to a stationary point is quite challenging. In this work, we introduce a new Kurdyka Łojasiewicz theoretical framework to analyze asymptotic behavior of stochastic gradient descent (SGD) schemes when minimizing non-convex smooth objectives. In particular, our framework provides new almost-sure convergence results, on iterates generated by any SGD method satisfying mild conditional descent conditions. We illustrate the proposed framework by means of several toy simulation examples. We illustrate the role of the considered theoretical assumptions, and investigate how SGD iterates are impacted whether these assumptions are either fully or partially satisfied.


I. INTRODUCTION
We consider the unconstrained optimization problem minimize where F : R N → R is a continuously differentiable function (N ≥ 1), that is not necessarily assumed to be convex.We focus on the challenging situation when the evaluation of the gradient of F is subject to (stochastic) errors, during the iterative resolution procedure.This typically arises in important scenarios of supervised machine learning, when F is an expectation loss to be minimized through the access of data samples [16].In such context, the non-convexity of F results from the use of nonlinear models, such as deep neural networks, for mapping the data features [8], [21].
The most popular approach to solve (1) in this context is probably the stochastic gradient descent (SGD) [27] and its numerous variants [16], [21], [32].SGD generates a stochastic sequence (x k ) k∈N defined as where (g k ) k∈N is typically a random process defined on a probability space (Ω, F , P) aiming at approximating the true gradient (∇F (x k )) k∈N .Moreover, (α k ) k∈N corresponds to a positive stepsize sequence.Practical applications of such SGD schemes to supervised learning can be found for instance in [7], [20].
In general, SGD schemes satisfy conditional descent properties on the sequence (F (x k )) k∈N , with respect to the natural filtration (F k ) k∈N = (σ(x 0 , . . ., x k )) k∈N .Such properties can be obtained under technical assumptions on the noise on the approximated gradients and on the sequence (α k ) k∈N [4], [17], [28].In particular, for most SGD schemes, (F (x k )) k∈N satisfies an almost-supermartingale condition, and converges to a finite limit [28].In addition, the gradient process (∇F (x k )) k∈N (or a sub-sequence of it) converges to zero [4].
However, usually little is known on the asymptotic behaviour of the generated process (x k ) k∈N itself.The machine learning literature typically focuses instead on the search for (fast) convergence rates [14], [15], [19], [29], and is often limited to the convex (or even strictly convex) setting.The lack of study of stochastic algorithms in a non-convex framework in the literature calls for developing new theory in this context.In particular, it is necessary to leverage existing deterministic approaches, typically relying on Kurdyka-Łojasiewicz (KL) properties [6], [22].KL properties have initially been introduced to solve gradient flow problems in a continuous setting [1], [5].Nevertheless, it has proven to be particularly efficient to improve convergences guarantees of discrete optimization schemes when no convexity assumptions are made [2], [6], [11], [12], [26].
In this work, we introduce a new theoretical framework to prove almost sure (a.s.) convergence of the stochastic process (x k ) k∈N generated by SGD schemes of the form of (2), to a critical point of F , when F is not necessarily convex.We show that this result applies to any SGD schemes holding mild conditional descent properties, when F satisfies a Kurdyka-Łojasiewicz (KL) property [6], [22].We further empirically investigate how the considered assumptions practically control the convergence behaviour of SGD schemes, on some toy simulation examples.The current paper relies on our recent preprint [10].The originality of the current work is to specialize the study to SGD, and to present a comprehensive numerical illustration for the results.
The remainder of the paper is organized as follow.Section II introduces the theoretical tools required.We present our main theoretical contribution in Section III.Numerical experiments are presented in Section IV, and Section V concludes the work.

II. THEORETICAL BACKGROUND II-A. KL property
One key mathematical tool for proving convergence of non-convex deterministic optimization schemes is the KL property.It has been initially introduced by Łojasiewicz [23] and Kurdika [22], and has been at the core of major methodological developments in non-convex optimization analysis, in the last decades, starting with the seminal papers from [3], [5].
The definition of KL property is given below.Definition 1: (KL property) A differentiable function F : R N → R satisfies the Kurdyka-Łojasiewicz (KL) property on E ⊂ H, if for every x ∈ E, there exists a neighbourhood V of x, ζ > 0 and ϕ ∈ Φ ζ such that

II-B. Convergence analysis under KL property
Definition 1 was initially motivated in a continuous setting, through the gradient flow analysis [1].Indeed, KL property promotes finite gradient trajectories converging to the origin.As a discrete counterpart of gradient flows, gradient descent (GD) algorithms are thus expected to follow a similar behaviour.
Let us consider Problem (1), where F satisfies KL property.Let us build the sequence (x k ) k∈N generated by a (deterministic) GD algorithm of the form of where, for every k ∈ N, α k > 0 is a stepsize.If (∇F (x k )) k∈N and (x k+1 − x k ) k∈N are proportional, and if +∞ k=0 ∇F (x k ) < +∞ (e.g., if α k is small enough and F is Lipschitz smooth), then we can deduce from KL property that +∞ k=0 This result can then be used to deduce that (x k ) k∈N is a convergent Cauchy sequence.Detailed examples of convergence analysis of gradient-based schemes under KL property can be found in [2].
In the non-convex case, KL property thus allows to show that the limit point of (x k ) k∈N exists and cancels the gradient (i.e., stationary point), under mild requirements such as descent conditions, on (F (x k )) k∈N .Note that local convergence results to global solutions can also be obtained, when a good initialization (i.e., close enough to a global minimum of F ) is considered [18].

II-C. Uniformized KL property
Definition 1 might be sometimes limited, as it is too 'local' to be easily manipulated.Recently, [6] introduced an alternative version of the KL property that we introduce in the following Theorem.
Definition 2: (Uniformized KL property) Let C be a compact set in H and F : H → R be a differentiable function constant on C, satisfying the KL property on C.Then, there exist (ε, ζ) ∈ (0, +∞) 2 and ϕ ∈ Φ ζ such that A typical usage of Definition 2 is when (F (x k )) k∈N is decreasing and F coercive.Then, (F (x k )) k∈N actually converges to a finite limit while (x k ) k∈N is guaranteed to be bounded.As a consequence, taking C in Definition 2 as the set of cluster points of (x k ) k∈N allows (4) to be verified for any iterate starting from a certain rank.

III. CONVERGENCE OF SGD FOR NON-CONVEX OBJECTIVES III-A. Generic SGD scheme
Let us consider SGD schemes of the form of (2) for solving (1).We denote (x k ) k∈N , a stochastic process generated by (2).In our study, we assume that the process satisfies the two following conditions.
Assumption 1: F is coercive and β-Lipschitz differentiable on R N .
Assumption 2: F satisfies the KL property on the set of critical points of F .Moreover, this set can be written as the finite union of non-empty disjoint compacts subset.
Assumption 1 is a classical hypothesis usually made in the field of differentiable optimization context [25].In particular, it ensures the existence of a minimal value of F , denoted by F min .On the contrary, Assumption 2 is specific to our nonconvex context, as we do not have any convexity assumption on F .Omitting some technical details here, Assumption 2 is essential for us to obtain convergence results directly on the iterates following a similar strategy as those conducted, e.g., in [2], [6], [11], but generalized to our stochastic framework.Note also that the geometry imposed on the critical set is only slightly constraining.

III-B. Gradient approximation assumptions
Before establishing our main convergence result, we first introduce technical conditions on the stochastic approximations (g k ) k∈N of the gradients (∇F (x k )) k∈N , involved in the SGD updates.
Assumption 3: There exists three deterministic nonnegative sequences and, for every Although Assumption 3 may seem quite demanding, it actually gathers several typical scenarios: • Assumption 3-iv) and 3-v) are relative to the two first moments of the noise on the gradient term.Assumption 3-iv) classically requires the noise to be unbiased, which Assumption 3-v) is a mild condition on the noise variance, generalizing many encountered in the literature [7], [20], [31], [33].A non-zero (a k ) k∈N sequence typically models cases when F has a gradient confusion bound [30].
• Assumptions 3-i) and 3-ii) are classical summability and non-zero rules, controling the terms in Assumption 3-v).Assumption 3-ii) assumes a non-vanishing stepsize in the SGD update, and requires (b k ) k∈N to be bounded.
• Assumption 3-iii) is a non-increasing condition which naturally appears in our convergence proof.As shown in [10], Assumption 3 guarantees that the process satisfies some conditional descent properties.As a consequence, (F (x k ) k∈N ) k∈N converges a.s. to an a.s finite random variable F ∞ , and that (∇F (x k )) k∈N almost surely converges to 0 N (i.e., the zero vector of R N ).

III-C. Proposed KL analysis for stochastic framework
In the context of stochastic optimization, the use of KL property is challenging.A first idea would be to apply the uniformized KL property (Definition 2) to any trajectory (x k (ω)) k∈N , for every ω ∈ Ω.However, by doing so, (ǫ, ζ) and ϕ would be random objects whose analysis is very delicate, dependent on ω.For instance, conditional expectation operations would become tricky, and measurability of (F k ) k∈N would not be straightforward.
To overcome this challenge, we proposed in [10] a new extension of Theorem 2, better adapted to a stochastic optimization framework.
Proposition 1: Under Assumptions 1-3, there exists a bounded concave function ϕ and an a.s.finite positive discrete random variable K such that The advantage of Proposition 1 (whose detailed proof is given in [10]), lies in the random variable K which concentrates all the stochastic information.As such, this new tool tends to overcome some technical obstacles raised in KL-based convergence analysis [10], and allows to build a new convergence theorem, that we present hereafter in the SGD case.

III-D. Convergence result
We now introduce our main contribution, which is the almost-sure convergence result for the generic SGD scheme (2).Let us denote by F ∞ the almost-sure finite limit of (F (x k )) k∈N and E k [.] the conditional expectation operator relative to F k , for k ∈ N (i.e., for a given integrable or positive random variable, E k [.] corresponds to its best approximation regarding all information available on the process from state 0 to state k).We introduce, for all (k, γ) ∈ N × (0, 1), the event: Theorem 1: Under Assumptions 1, 2 and 3, if there exists γ 0 ∈ (0, 1) such that P(Ξ γ0,k ) = 1 for all k ∈ N, then (x k ) k∈N almost-surely converges to a critical point x ∞ of F .
Equality P(Ξ γ0,k ) = 1 (for all k ∈ N) supposes that process (x k ) k∈N is well-built enough to verify a suitable descent condition and to approach its limit F ∞ from above.Moreover, it also reflects a predictability condition; the conditional decreasing as well as the difference F ∞ − F min shall be controlled with respect to the evolution of the gradient norm.
The complete proof for Theorem 1 can be found in [10].It relies on the use of Proposition 1 as a cornerstone to establish the finite length of (x k ) k∈N almost surely.Up to our knowledge, Theorem 1 is one of the first results ensuring the almost convergence of a stochastic gradient type iterates in a non-convex setting.
Table I gives a few examples of state-of-the-art schemes directly verifying our specific Assumption 3.This table is not exhaustive, and Assumption 3 could be verified by other algorithms, e.g., [8], [9] (see also [10] for proximal algorithms).

IV. NUMERICAL ILLUSTRATIONS
In this section we conduct some experiments on a nonconvex scalar problem, so as to illustrate the behaviour of SGD algorithm when Assumptions 1, 2 and 3 are satisfied by function F , and its moments approximation gradient sequence.To do so, we proceed by gradually increasing the complexity of noise structure.Ass.3-i) and ii) ?SGD [31] α 0 B 2 0 Yes SGD [13] λ Examples of state-of-the-art algorithms satisfying conditions of Theorem 1, hence ensuring their convergence in a non-convex setting.For the sake of readability, we perused the same notations as the authors in their articles.All along this study, we work with the following F function.
(6) The graph of function F in ( 6) is illustrated in Figure 1.This function is non-convex, but Lipschitz-differentiable with Lipschitz constant equals to β = 2. Hence Assumption 1 holds.Moreover, since the graph of F is semi-algebraic, Assumption 2 is also verified [5].
In our simulations, we first numerically verify that (x k ) k∈N converges almost surely to a stationary point of F , denoted by x ∞ .Then we analyse more specifically how x ∞ is approximated by (x k ) k∈N , ideally in such a way that there exists γ 0 ∈ (0, 1) such that P(Ξ γ0,k ) = 1 for all k ∈ N. In particular, we run experiments considering different settings for the sequences (a k ) k∈N , (b k ) k∈N and (c k ) k∈N appearing in Assumption 3.

IV-A. Experiments under a
In this section we investigate the case where the only non-zero sequence (except for the stepsize) interfering in Such kind of noise assumption is generally considered as a baseline in the literature of stochastic optimization, as it is typically verified for the usual constant stepsize SGD scheme when F satisfies the Strong Growth condition [31].Assumptions 3 i)-iii) are then verified as soon as (b k ) k∈N is a non-increasing sequence.As the latter have to be bounded to fulfill Assumption 3-ii), it thus becomes equivalent to take b k ≡ b, for every k ∈ N, to also verify Assumption 3-v).
The approximation sequence is generated empirically so as to satisfy both Assumption 3-iv) and v).Specifically, we set, for every k ∈ N, For the stepsize, we set, for every k ∈ N, α k = α such that αb = β −1 .Finally, we introduce two different perturbation levels to test the SGD algorithm in particularly extreme cases: a moderate perturbation b = 10, and an excessively high perturbation b = 10 3 .
Table II shows the behaviour of process (x k ) k∈N considering three different initializations x 0 ∈ {−1/2, 1, 4π + ǫ}.The first initialization x 0 = −1/2 is located on the left of saddle point x = 0 (see graph of F in Figure 1).The second initialization x 0 = 1 is between the saddle point and the interval [π, 3π] of global minimizers.And the last one x 0 = 4π + ǫ is in a small neighborhood of the local maximizer x = 4π, taking ǫ = 10 −5 .In most scenarios, (x k ) k∈N converges to a stationary point so that F (x k ) k∈N remains above its limit.The only tricky case arises for high perturbation level when x 0 = 4π + ǫ, which seems to be too close to local maximizer x = 4π.

IV-B. Experiments under a k := 0
In this section, we no longer impose (c k ) k∈N to be equal to zero.Such a situation is regularly encountered as a first relaxed version of the noise resulting from the Strong Growth condition [31].One typical situation is when, for k ∈ N, the difference between g k and F ′ (x k ) follows a Gaussian distribution that remains independent from F k .We adopt such model to conduct our investigation.More specifically, we choose, for all k ∈ N, Here, e 2 k is normally distributed, with zero-mean and standard deviation σ k > 0, and does not depend on process (x k ) k∈N so as to have c k = 2σ 2 k .Moreover, e 1 k keeps the same properties as in Section IV-A.In order to easily verify Assumption 3 i)- In our simulations, we choose b = 10 as a moderate level of multiplicative noise, and add (c k ) k∈N as an additive one.Despite the higher complexity of the uncertainty model, we obtain slightly better results as the process is able to escape from saddle or local minimizer in all runs (see Figure 2).
A summary of the behaviour of process (x k ) k∈N under these conditions is reported in Table III.
Here we consider the more generic case where none of the sequences (a k ) k∈N , (b k ) k∈N nor (c k ) k∈N is equal to 0.

V. CONCLUSION
In this article we introduce a new theoretical framework to study almost sure convergence of SGD schemes, in a non-convex context.We further give numerical illustrations to investigate the behaviour of SGD processes, and the relevancy of the different assumptions necessary to ensure their almost sure convergence.This work is base on the theoretical work we initially conducted in [10], where we introduced a new KL framework to investigate almost sure convergence of stochastic processes, in a smooth but nonconvex context.