On the Hyperlink Between Optimization and Polynomials, Half 5

September 21, 2022

134

Six: All of this has occurred earlier than.
Baltar: However the query stays, does all of this should occur once more?
Six: This time I guess no.
Baltar: , I’ve by no means identified you to play the optimist. Why the change of coronary heart?
Six: Arithmetic. Legislation of averages. Let a fancy system repeat itself lengthy sufficient and finally one thing shocking may happen. That, too, is in God’s plan.

Battlestar Galactica *

Momentum with cyclical step-sizes has been proven to typically speed up convergence, however why? We’ll take a more in-depth have a look at this method, and with the assistance of polynomials unravel a few of its mysteries.

$$
defaa{boldsymbol a}
defrr{boldsymbol r}
defAA{boldsymbol A}
defHH{boldsymbol H}
defEE{mathbb E}
defII{boldsymbol I}
defCC{boldsymbol C}
defDD{boldsymbol D}
defKK{boldsymbol Okay}
defeeps{boldsymbol varepsilon}
deftr{textual content{tr}}
defLLambda{boldsymbol Lambda}
defbb{boldsymbol b}
defcc{boldsymbol c}
defxx{boldsymbol x}
defzz{boldsymbol z}
defuu{boldsymbol u}
defvv{boldsymbol v}
defqq{boldsymbol q}
defyy{boldsymbol y}
defss{boldsymbol s}
defpp{boldsymbol p}
deflmax{L}
deflmin{mu}
defRR{mathbb{R}}
defTT{boldsymbol T}
defQQ{boldsymbol Q}
defCC{boldsymbol C}
defEcond{boldsymbol E}
DeclareMathOperator*{argmin}{{arg,min}}
DeclareMathOperator*{argmax}{{arg,max}}
DeclareMathOperator*{reduce}{{reduce}}
DeclareMathOperator*{dom}{mathbf{dom}}
DeclareMathOperator*{Repair}{mathbf{Repair}}
DeclareMathOperator{prox}{mathbf{prox}}
DeclareMathOperator{span}{mathbf{span}}
defdefas{stackrel{textual content{def}}{=}}
defdif{mathop{}!mathrm{d}}

definecolor{colormomentum}{RGB}{27, 158, 119}
definecolor{colorstepsize1}{RGB}{217,95,2}
definecolor{colorstepsize2}{RGB}{117,112,179}
definecolor{colorstepsize}{RGB}{152,78,163}
defmom{{coloration{colormomentum}m}}
defstepzero{{coloration{colorstepsize1}h_0}}
defstepone{{coloration{colorstepsize2}h_1}}
deftildestepzero{{coloration{colorstepsize1}tilde{h}_0}}
deftildestepone{{coloration{colorstepsize2}tilde{h}_1}}
defmomt{{coloration{colormomentum}m_t}}
defht{{coloration{colorstepsize}h_t}}
defstept{{coloration{colorstepsize}h_t}}
defstep{{coloration{colorstepsize}h}}
definecolor{colorexternaleigenvalues}{RGB}{152, 78, 163}
definecolor{colorinternaleigenvalues}{RGB}{77, 175, 74}
defmuone{{coloration{colorexternaleigenvalues}{mu}_1}}
defLone{{coloration{colorinternaleigenvalues}{L}_1}}
defmutwo{{coloration{colorinternaleigenvalues}{mu}_2}}
defLtwo{{coloration{colorexternaleigenvalues}{L}_2}}
$$

Cyclical Heavy Ball

The commonest variant of stochastic gradient descent makes use of a continuing or lowering step-size. Nevertheless, some current works advocate as an alternative for cyclical step-sizes, the place the step-size alternates between two or extra values. Within the blink of an eye fixed, these strategies have gone from esoteric to ubiquitous: they’re now accessible in all main deep studying toolboxes,See for instance TensorFlow, Flax and PyTorch implementations. and located purposes past optimization in fields like probabilistic modeling.

The scheme that we’ll think about has two step-sizes $stepzero, stepone$, in order that odd iterations use one step-size whereas even iterations use a special one.There are delicate variations between the varied works printed to date. For instance, within the unique work (Smith, 2017), the step-sizes take a sequence of values between $stepzero$ and $stepone$. The variant that we analyze takes as an alternative two discrete step-sizes. The algorithm begins with an preliminary guess $xx_0$ and produces successive approximations $xx_1, xx_2, ldots$ to the minimizer of $f$ as follows:

Cyclical Heavy Ball
Enter: beginning guess $xx_0$, step-sizes $0 leq stepzero leq stepone$ and momentum ${coloration{colormomentum} m} in [0, 1)$.
$xx_1 = xx_0 – dfrac{stepzero}{1 + {color{colormomentum} m}} nabla f(xx_0)$
For $t=1, 2, ldots$
begin{align*}label{eq:momentum_update}
&text{${color{colorstepsize}h_t} = stepzero$ if $t$ is even and ${color{colorstepsize}h_t} = stepone$ otherwise}
&xx_{t+1} = xx_t – {color{colorstepsize}h_t} nabla
f(xx_t)+ {color{colormomentum} m}(xx_{t} – xx_{t-1})
end{align*}

By alternating between a small and a large step-size, the cyclical step-size algorithm has very different dynamics than the classical constant step-size variant:

In this post, we’ll show that not only the cyclical heavy-ball can reach the same convergence as optimal first-order methods such as Polyak and Nesterov momentum, but it can in fact achieve a faster convergence in some scenarios.Wait, didn’t I just say that that Polyak heavy ball is optimal? How can you beat an optimal method? Hold on for a bit longer and I’ll explain this apparent contradiction. We’ll even be able to quantify this speedup in terms of quantities such as the eigengap, a gap in the Hessian’s eigenvalues.

As in previous posts, the theory of orthogonal polynomials will be our best ally throughout the journey. We’ll start by expressing the residual polynomial of the cyclical heavy ball method as a convex combination of Chebyshev polynomials of the first and second kind (Section 2), and then we’ll use properties of Chebyshev polynomials to derive the worst-case optimal set of parameters (Section 3).

This post is based on a recent paper with the incredible Baptiste Goujaud, Damien Scieur, Aymeric Dieuleveut, and Adrien Taylor. The blog post only covers a subset of the results in the paper. In particular, the paper also has results on longer cycles and local convergence for non-quadratic objectives which are not covered in this blog post.

Related work

Very related to our paper is the recent work of Oymak. Here, the author derived convergence rates for gradient descent (without momentum) using a scheduler that takes one large, $K-1$ small steps instead of the alternating scheme that’s the subject of this post. He showed improved convergence rates of that scheme under the same eigengap assumption we’ll describe later. One important difference is that (Oymak 2021) doesn’t achieve the optimal rate on general strongly convex quadratics (see Appendix E.5 ). In contrast, the method we’ll describe here defaults to Polyak heavy ball and achieves the optimal rate on general quadratics.

Chebyshev Meets Cyclical Learning Rates

As in previous blog posts in this polynomials series, we’ll assume our objective function $f$ is a quadratic of the form
begin{equation}label{eq:opt}
f(xx) defas frac{1}{2}xx^top HH xx + bb^top xx + c~,
end{equation}
where $HH$ is a $dtimes d$ positive semi-definite matrix with eigenvalues in some set $Lambda$.

As we saw in Part 2 of this series, with any gradient-based method we can associate a residual polynomial $P_t$ that will determine its convergence behavior.The notion of residual polynomial was introduced in Part 1 of this series. This association results in the following inequality that relates the error at iteration $t$ by the maximum value of the polynomial over the Hessian eigenvalue support set $lambda$:
begin{equation}
|xx_t – xx^star| leq {color{purple}underbrace{vphantom{max_{[L]}}max_{lambda in Lambda
}}_{textual content{conditioning}}} ,
{coloration{teal}underbrace{vphantom{max_{[L]}}|P_t(lambda)|}_{textual content{algorithm}} } ,,
{coloration{brown}underbrace{vphantom{max_{[L]}}|xx_0 – xx^star|}_{textual content{initialization}}}~.
finish{equation}
This inequality highlights the significance of with the ability to certain the residual polynomial,We’ll typically seek advice from ${max_{lambda in Lambda}|P_t(lambda)|}$ because the convergence fee of the algorithm. as doing so ends in a certain for the error norm $|xx_t – xx^star|$, which is commonly the purpose within the evaluation of optimization algorithms. This and the following 2 sections are dedicated to characterizing and bounding the residual polynomial of the cyclical heavy ball methodology.

Let $P_{t}$ denote the residual polynomials related to the cyclical heavy ball methodology at iteration $t$. By our definition of the residual polynomial in Half 1, the sequence of residual polynomials admits the recurrence
start{equation}
start{aligned}
&P_{t+1}(lambda) = (1 + mother – {coloration{colorstepsize}h_{bmod(t, 2)}} lambda) P_{t}(lambda) – mother P_{t-1}(lambda)
&P_0(lambda) = 1,,quad P_1(lambda) = 1 – frac{stepzero}{1 + mother}lambda,.
finish{aligned}label{eq:residual_cyclical_recurrence}
finish{equation}
What makes this recurrence totally different than different’s we’e seen earlier than is that the phrases that multiply the earlier polynomials rely upon the iteration quantity $t$ by way of the step-size ${coloration{colorstepsize}h_{bmod(t, 2)}}$ .
This is sufficient to break the earlier proof approach.

Fortunately, we will eradicate the dependency on the iteration quantity by chaining iterations in a method that the recurrence jumps two iterations as an alternative of 1.
Evaluating the earlier equation at $P_{2t + 2}$ and fixing for $P_{2t + 1}$ offers
start{equation}
P_{2t + 1}(lambda) = frac{P_{2t + 2}(lambda) + mother P_{2t}(lambda)}{1 + mother – steponelambda},.
finish{equation}
Utilizing this to interchange $P_{2t+1}$ and $P_{2t-1}$ within the unique recurrence eqref{eq:residual_cyclical_recurrence} offers
start{equation}
frac{P_{2t + 2}(lambda) + mother P_{2t}(lambda)}{1 + mother – steponelambda} = (1 + mother – stepzero lambda) P_{2t}(lambda) – mother left(frac{P_{2t}(lambda) + mother P_{2t-2}(lambda)}{1 + mother – steponelambda} proper),,
finish{equation}
which fixing for $P_{2t + 2}$ lastly offers
start{equation}label{eq:recursive_formula}
P_{2t + 2}(lambda) = left( (1 + mother – stepzerolambda)(1 + mother – stepone lambda) – 2 mother proper) P_{2t}(lambda) – mother ^2 P_{2t – 2}(lambda),.
finish{equation}
This recurrence expresses $P_{2t+2}$ when it comes to $P_{2t}$ and $P_{2t – 2}$. It is therefore a recurrence for less than even iterations, however it has one enormous benefit with respect to the earlier one: the coefficients that multiply the earlier polynomials now do not rely upon the iteration quantity.

As soon as now we have this recurrence, following an analogous proof to the one we did in Half 2, we will derive an expression for the residual polynomial $P_{2t}$ primarily based on Chebyshev polynomials of the primary and second form. That is helpful as a result of it’s going to later make it straightforward to certain this polynomial –and develop convergence charges– utilizing identified bounds for Chebyshev polynomials.

Think about the cyclical heavy ball methodology with step-sizes $stepzero$, $stepone$, and momentum parameter $mother$.
Then the residual polynomial $P_{2t}$ of even diploma related to this methodology
will be written when it comes to Chebyshev polynomials of the primary ($T_{t}$) and second form $(U_{t})$
as
start{align}label{eq:theoremP2t}
P_{2t}(lambda) &= mother^t left( tfrac{2 mother}{1 + mother},
T_{2t}(zeta(lambda))
+ tfrac{1 – mother}{1 + mother} , U_{2t}(zeta(lambda))proper),,
finish{align}
with $zeta(lambda) = frac{1+mother}{2 sqrt{mother}}sqrt{(1 – tfrac{stepzero}{1 + mother}lambda)(1 – tfrac{stepone}{1+mother} lambda) }$. Since $zeta$ is evaluated solely at even polynomials, the end result does not rely upon the department of the sq. root operate. Observe $zeta$ will be both an actual or pure imaginary quantity.

Present proof

We are going to show this by induction. We are going to use that Chebyshev polynomials of the primary form $T_t$ and of the second form $U_t$ observe the recurrence
start{align}
&T_{2t + 2}(lambda) = (4 lambda^2 – 2) T_{2t}(lambda) – T_{2n – 2}(lambda) label{eq:recursion_first_kind}
&U_{2t + 2}(lambda) = (4 lambda^2 – 2) U_{2t}(lambda) – U_{2n – 2}(lambda) label{eq:recursion_second_kind},,
finish{align}
with preliminary situations $T_{0}(lambda) = 1,,~T_2(lambda) = 2 lambda^2 – 1$ and $U_{0}(lambda) = 1,,~U_2(lambda) = 4 lambda^2 – 1$.

Let $widetilde{P}_t$ be the right-hand facet of eqref{eq:theoremP2t}.
We begin with the case $t=0$ and $t=1$. From the preliminary situations of Chebyshev polynomials we see that
start{align}
widetilde{P}_0(lambda) &= smalldfrac{2mom}{1 + mother}
+ smalldfrac{1 – mother}{1 + mother} = 1 = P_0(lambda)
widetilde{P}_2(lambda) &= mother left( {smalldfrac{2mom}{1 + mother}}
(2 zeta(lambda)^2 – 1)
+ {smalldfrac{1 – mother}{1 + mother}} (4 zeta(lambda)^2 – 1)proper)
&= mother left( dfrac{4}{1 + mother}zeta(lambda)^2 – 1 proper)
&= 1 – (stepzero + stepone)lambda +{smalldfrac{stepzero stepone }{1+mother}}lambda^2 ,.
finish{align}

Let’s now compute $P_2$ from its recursive system eqref{eq:residual_cyclical_recurrence}, which provides:
start{align}
{P}_2(lambda) &= (1 + mother – stepone lambda)P_1(lambda) – mother P_0(lambda)
&= (1 + mother – stepone lambda)(1 – {smalldfrac{stepzero lambda}{1 + mother}}) – mother
&= 1 – (stepzero + stepone)lambda +{smalldfrac{stepzero stepone }{1+mother}}lambda^2 ,.
finish{align}
and so we will conclude that $widetilde{P}_2 = P_2$.

Now lets assume it is true as much as $t$. For $t+1$ now we have by the recursive system eqref{eq:recursive_formula}
start{align}
P_{2t+2}(lambda) &overset{textual content{Eq. eqref{eq:recursive_formula}}}{=} && left( 4 mother zeta(lambda)^2 – 2 mother proper) P_{2t}(lambda) – mother^2 P_{2t – 2}(lambda)
&overset{textual content{induction}}{=} && mother^{t+1}left( 4 zeta(lambda)^2 – 2 proper) left( {smalldfrac{2 mother}{mother+1}}, T_{2t}(zeta(lambda))
+ {smalldfrac{1 – mother}{1 + mother}}, U_{2t}(zeta(lambda))proper)
& &-& mother^{t+1} left( {smallfrac{2 mother}{mother+1}},
T_{2(t-1)}(zeta(lambda))
+ {smallfrac{1 – mother}{1 + mother}}, U_{2(t-1)}(zeta(lambda))proper)
&overset{textual content{Eq. (ref{eq:recursion_first_kind}, ref{eq:recursion_second_kind}})}{=} && mother^{t+1}left( {smallfrac{2 mother}{mother+1}}, T_{2(t+1)}(zeta(lambda)) + {smallfrac{1 – mother}{1 +
mother}}, U_{2(t+1)}(zeta(lambda))proper)
finish{align}
the place within the second id we used the induction speculation and within the third one the recurrence of even Chebyshev
polynomials.

This concludes the proof.

It is All In regards to the Hyperlink Operate

Partially 2 of this sequence of weblog posts, we derived the residual polynomial $P^{textual content{HB}}_{2t}$ of (fixed step-size) heavy ball. We expressed this polynomial when it comes to Chebyshev polynomials as
start{equation}
P^{textual content{HB}}_{2t}(lambda) = {coloration{colormomentum}m}^{t} left( {smallfrac{2
{coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}},
T_{2t}(sigma(lambda))
+ {smallfrac{1 – {coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}},
U_{2t}(sigma(lambda))proper),.
finish{equation}
with $sigma(lambda) defas {smalldfrac{1}{2sqrt{{coloration{colormomentum}m}}}}(1 +
{coloration{colormomentum}m} –
{coloration{colorstepsize} h},lambda)$.
It seems that the above is basically the identical polynomial because the one related to the cyclical heavy ball within the earlier theorem eqref{eq:theoremP2t}, the one distinction being the time period contained in the Chebyshev polynomials (denoted $sigma$ for the fixed step-size and $zeta$ for the cyclical step-size variant). We’ll name these hyperlink capabilities.

Within the case of fixed step-sizes, it is a linear operate $sigma(lambda)$, whereas for the cyclical step-size system is the extra sophisticated operate $zeta(lambda)$:
start{align}
sigma(lambda) &= {smalldfrac{1}{2sqrt{{coloration{colormomentum}m}}}}(1 +
{coloration{colormomentum}m} –
{coloration{colorstepsize} h},lambda) & textual content{ (fixed step-sizes)}
zeta(lambda) &= {smalldfrac{1+mother}{2 sqrt{mother}}}sqrt{huge(1 – {smalldfrac{stepzero}{1 + mother}}lambdabig)huge(1 – {smalldfrac{stepone}{1+mother}} lambdabig) }
& textual content{(cyclical step-sizes)}
finish{align}
Understanding the variations between these two hyperlink capabilities is vital to understanding cyclical step sizes.

Whereas the hyperlink operate for the fixed step-size heavy ball is at all times real-valued, the hyperlink operate of the cyclical variant is complex-valued. Therefore, to offer a significant certain on the residual polynomial we have to perceive the behaving of Chebyshev polynomials within the complicated airplane.

The 2 faces of Chebyshev polynomials

It seems that Chebyshev polynomials at all times develop exponentially within the diploma exterior of the actual section $[-1, 1]$. In different phrases, Chebyshev polynomials have two clear-cut regimes: linearly bounded in $[-1, 1]$, and exponentially diverging exterior.

Magnitude of Chebyshev polynomials of the primary form within the complicated airplane. $x$ and $y$ coordinates characterize the (complicated) enter, magnitude of the output is displayed on the $z$ axis, whereas the argument is represented by the colour.

Zeros of Chebyshev polynomials are concentrated in the actual $[-1, 1]$ section, and the polynomials diverge exponentially quick exterior it.

The next lemma formalizes this for the convex mixture of Chebyshev polynomials $frac{2{coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}, T_{t}(z)
+ {smallfrac{1 – {coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}}, U_{t}(z)
$ that seem within the heavy ball methodology:

Let $z$ be any complicated quantity, the sequence
$
left(huge|
smallfrac{2{coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}, T_{t}(z)
+ {smallfrac{1 – {coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}}, U_{t}(z)
huge|proper)_{t geq 0}
$
grows exponentially in $t$ for $z notin [-1, 1]$, whereas in that interval they’re bounded as $|T_t(z)| leq 1, |U_t(z)| leq t+1$.

Present proof

For this proof we’ll use the next specific expression for Chebyshev polynomials, legitimate for any complicated quantity $z$ apart from 1 and -1: Though we cannot use it on this put up, these asymptotic bounds maintain for a big class of orthogonal polynomials, see subsequent reference as an example. start{align}
T_t(z) = &~ frac{xi^t + xi^{-t}}{2} nonumber
U_t(z) = &~ frac{xi^{t+1} – xi^{-(t+1)}}{xi – xi^{-1}} label{eq:t_and_u},.
finish{align}
with $xi defas z + sqrt{z^2 – 1}$, which means $xi^{-1} = z – sqrt{z^2 – 1}$.Right here and within the earlier equation $sqrt{cdot}$ can denote any department of the sq. root operate. By symmetry, the output would be the identical for any department we select. By combining the two earlier expressions of the polynomials $T_t$ and $U_t$, we acquire
start{equation}label{eq:chebyshevs_xi}
smallfrac{2{coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}, T_{t}(z)
+ {smallfrac{1 – {coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}}, U_{t}(z)
=
frac{(xi^2 – mother)xi^t + (mother xi^2 – 1)xi^{-t}}{(1+mother)(xi^2-1)},.
finish{equation}

This expression has 2 exponential phrases. Relying on the magnitude of $xi$, now we have the next conduct: Observe Equation eqref{eq:chebyshevs_xi} will not be effectively outlined for $xi = pm 1$ which corresponds precisely to $z = pm 1$ already excluded from Equation eqref{eq:t_and_u} and that we deal with later.

If $|xi|gt 1$, then the primary time period grows exponentially quick.
If $|xi|lt 1$, then the second time period grows exponentially quick.
If $|xi|=1, |xi| neq pm 1$, then the sequence of curiosity is bounded.

Due to this fact,
$
left(huge|
smallfrac{2{coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}, T_{t}(z)
+ {smallfrac{1 – {coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}}, U_{t}(z)
huge|proper)_{t geq 0}
$
is bounded solely when $|xi|=1$ and grows exponentially quick in any other case. We’re nonetheless left with the duty of discovering the complicated numbers $z$ for which $xi$ has unit modulus.

For this, word that when $xi$ has a modulus 1, then its inverse $z – sqrt{z^2 – 1}$ can be its complicated conjugate.
This suggests that the actual a part of $xi$ and $xi^{-1}$ coincide, and so $sqrt{z^2 – 1}$ is solely imaginary and $z in (-1, 1)$.
Lastly, for the case $z in { -1, +1}$ which we excluded at the start of the proof, we will use the identified bounds $|T_t(pm 1)| leq 1$, $|U_t(pm 1)| leq (t+1)$ to conclude that it doesn’t develop exponentially.

A more in-depth look into the hyperlink operate

Given the earlier lemma that exhibits that our mixture of Chebyshev polynomials grows exponentially within the diploma exterior of $[-1, 1]$, we search to grasp when the picture of the hyperlink operate lies on this interval
because it’s on this regime the place we’ll count on to realize the very best charges. We’ll name the set of parameters that confirm this situation the sturdy area.We explored this area in Half 2 of this sequence within the context of the classical heavy ball methodology. The title is in reference to the step-size stability of the strategy: inside this area, the speed of convergence is impartial of the step-sizes and solely relies on the momentum time period.

The hyperlink operate $sigma$ for the fixed ste-size heavy ball is a linear operate, so its preimage $sigma^{-1}([- 1, 1])$ can be an interval. This explains why most analyses of the heavy ball assume the eigenvalues belonging to a single interval.

Whereas the preimage $sigma^{-1}([-1, 1])$ related to the fixed step-size hyperlink operate is at all times an interval,
issues turn out to be extra fascinating for the cyclical hyperlink operate $zeta$. On this case, the preimage will not be at all times an interval:

The preimage $zeta^{-1}([-1, 1])$ related to the cyclical step-size hyperlink operate is at all times a union of two intervals.

Chihara
Theodore Seio Chihara is thought for his analysis within the subject of orthogonal polynomials and
particular capabilities. He wrote the very influential e-book An Introduction to Orthogonal Polynomials.
The household of Al-Salam-Chihara and Brenke-Chihara polynomials are named in his honor. labored out the precise system for this preimage, which seems to be a union of two intervals of the identical dimension:
start{equation}
start{aligned}
zeta^{-1}([-1, 1]) = & left[frac{(1+mom)(stepzero + stepone)-Delta}{2 stepzero stepone}, frac{1+mom}{stepone}right]
& bigcup left[frac{1+mom}{stepzero}, frac{(1+mom)(stepzero + stepone)+Delta}{2 stepzero stepone}right],,
finish{aligned}label{eq:preimage}
finish{equation}
with $Delta defas sqrt{(1+mother)^2 (stepzero + stepone)^2- 4 (1 – mother)^2 stepzero stepone}$.

One of many primary variations between polynomials with fixed and cyclical coefficients is the conduct of their zeros. Whereas the zeros of orthogonal polynomials with fixed coefficients focus in an interval, these of orthogonal polynomials with various coefficients focus as an alternative on a union of two disjoint intervals.

This end result has profound implications. First, keep in mind that due to the exponential divergence of Chebyshev polynomials exterior the $[-1, 1]$ interval, we might just like the picture of the Hessian eigenvalues to be on this $[-1, 1]$ interval. The earlier end result says that the eigenvalue help set that may assure that is not a single interval (as was the case for the fixed step-size heavy ball) however a union of two disjoint intervals. We’ll discover the implications of this mannequin within the subsequent part.

A Refined Mannequin of the Eigenvalue Help Set

Within the following we’ll assume that the eigenvalues of $HH$ lie in a set $Lambda$ that could be a union of two intervals of equal dimension:
start{align}
Lambda defas [muone, Lone] cup [mutwo, Ltwo],,
finish{align}
with $Lone – muone = Ltwo – mutwo$. Observe that this parametrization is strictly extra common than the classical assumption that the eigenvalues lie inside an interval since we will at all times take $Lone = mutwo$, wherein case $Lambda$ turns into the $[muone, Ltwo]$ interval.

A amount that can be essential afterward is the relative hole or eigengap:
start{equation}
R defas frac{mutwo – Lone}{Ltwo – muone},.
finish{equation}
This amount measures the dimensions of the hole between the 2 intervals. $R=0$ implies that the 2 intervals are adjoining ($Lone = mutwo$ and the set $Lambda$ is in follow a single interval), whereas $R=1$ implies as an alternative that every one eigenvalues are concentrated across the largest and smallest worth.

It seems that many deep neural networks Hessians match into this mannequin with a non-trivial hole. And we do not have to look too far. The venerable MNIST displays this conduct with a quadratic goal, which comprises an outlier eigenvalue resulting in a non-trivial relative hole:

Extra complicated eigenvalue constructions are equally potential to mannequin with longer cycle lengths. Whereas we cannot go into these strategies on this weblog put up, some are mentioned in the paper.

A Rising Strong Area

The set of parameters for which the picture of the hyperlink operate is within the $[-1, 1]$ interval known as the sturdy area.We have mentioned the sturdy area intimately in Half 3 of this sequence. Two of an important properties are $i)$ the optimum convergence fee is achieved on this area, and $ii)$ the step-size does not affect the convergence fee on this area, which is absolutely decided by the momentum parameter. Utilizing identified bounds of Chebyshev polynomials $|T_t(z)| leq 1$ and $|U_t(z)| leq t+1$ for $z in [-1, 1]$, we will compute the convergence fee on this area. Particularly, plugging these bounds into the characterization of cyclical heavy ball’s residual polynomial in Goujaud’s theorem now we have:
start{equation}
r_{2t} defas max_{lambda in Lambda}|P_{2t}(lambda)| leq mother^{t} left( 1
+ {smallfrac{1 – {coloration{colormomentum}m}}{1 + {coloration{colormomentum}m}}}, 2 tright),.
finish{equation}
Asymptotically, the primary exponential time period $mother^{t}$ dominates, and now we have the asymptotic fee
start{equation}label{eq:asymptotic_rate}
limsup_{t to infty} sqrt[2t]{r_{2t}} = sqrt{mother}
finish{equation}

Surprisingly, this $sqrt{mother}$ asymptotic fee is the identical expression for fixed step-size heavy ball!. It may appear we have performed all this work for nothing …

Luckily, this isn’t true, our work has not been in useless.
There is a speedup, however the reason being delicate. Whereas it is true that the expression for the convergence fee is identical for fixed step-size and cyclical, the sturdy area will not be the identical. Therefore, the minimal momentum that we will plug on this expression will not be the identical, probably leading to a special asymptotic fee.

We now search to seek out the smallest momentum parameter we will take whereas staying throughout the sturdy area, which is to say whereas the pre-image $zeta^{-1}([-1, 1])$ comprises the eigenvalue set $Lambda$.

The world of that pre-image will be computed from the system eqref{eq:preimage}, and it seems that this space elements as a product of two phrases which can be growing in $mother$:
start{align}
& frac{sqrt{(1+mother)^2 (stepone+stepzero)^2- 4 (1 – mother)^2 stepzero stepone}}{stepzero stepone} + (1+mother)(frac{1}{stepone} – frac{1}{stepzero})
&= underbrace{vphantom{left[ sqrt{(stepone + stepzero)^2 – 4 left(frac{1-mom}{1+mom}right)^2} + stepzero – stepone right]}left[frac{1 + mom}{stepzero stepone} right]}_{textual content{constructive growing}}underbrace{left[ sqrt{(stepone + stepzero)^2 – 4 left(frac{1-mom}{1+mom}right)^2} + stepzero – stepone right]}_{textual content{constructive growing}},.
finish{align}

The second time period is growing since $(frac{1-mom}{1+mother})^2$ is lowering, and has a minimal at $mother=0$ the place its zero. Since a product of two growing and non-negative capabilities is growing, we conclude that the world is an growing operate of $mother$. For the reason that are of the sturdy area is growing with $mother$, the smallest worth of $mother$ whereas staying within the sturdy area is achieved when this area coincides with the eigenvalue help $Lambda$.Making the area precisely coincide with the eigenvalue help set is feasible due to the idea that the 2 intervals are of the identical dimension Therefore, for the optimum momentum, the inequalities that decide the sturdy area turn out to be equalities, resulting in the system
start{equation}
start{aligned}
zeta(muone) = 1,,quad
zeta(Lone) = 0 ,,quad
zeta(mutwo) = 0 ,,quad
zeta(Ltwo) = -1 ,.
finish{aligned}
finish{equation}

From the second and third we get the optimum step-size parameters
start{equation}
stepzero = frac{1+mother}{mutwo}~,quad stepone = frac{1+mother}{Lone},.
finish{equation}
The worth of those optimum step-sizes is illustrated within the determine under, the place we will see

Convergence fee (in coloration) for each alternative of step-size parameters. The optimum step-sizes are highlighted in orange.

We will see how the optimum step-sizes are equivalent for a trivial eigengap $(R=0)$, and develop additional aside with $R$.

Changing into the primary equation (or fourth, they’re redundant)
and transferring all momentum phrases to 1 facet now we have
start{align}
huge({smalldfrac{1 + mother}{2 sqrt{mother}}}huge)^2
& = {smalldfrac{Lone mutwo}{(Lone – muone)(mutwo – muone)}},.
finish{align}
Let $rho defas frac{Ltwo+muone}{Ltwo-muone} = frac{1 + kappa}{1-kappa}$ be the inverse of the linear convergence fee of the Gradient Descent methodology.
With this notation, we will rewrite the above equation as
start{equation}
Massive({smalldfrac{1 + mother}{2 sqrt{mother}}}Massive)^2 = frac{rho^2 – R^2}{1 – R^2},.
finish{equation}
Fixing for $mother$ lastly offers
start{equation}
mother = Massive( frac{sqrt{rho^2 – R^2} – sqrt{rho^2 – 1}}{sqrt{1 – R^2}}Massive)^2,.
finish{equation}

With these parameters, we will now write the cyclical heavy ball methodology with optimum parameters. It is a methodology that requires information of the smallest and largest Hessian eigenvalue and the relative hole $R$:

Cyclical Heavy Ball with optimum parameters
Enter: beginning guess $xx_0$ and eigenvalue bounds $muone, Lone, mutwo, Ltwo$.
Set: $rho = frac{Ltwo+muone}{Ltwo-muone}$, $R=frac{mutwo – Lone}{Ltwo – muone}$,
$mother = huge( frac{sqrt{rho^2 – R^2} – sqrt{rho^2 – 1}}{sqrt{1 – R^2}}huge)^2$

$xx_1 = xx_0 – frac{1}{mutwo} nabla f(xx_0)$
For $t=1, 2, ldots$
start{align*}label{eq:momentum_update_optimal}
&textual content{${coloration{colorstepsize}h_t} = textstylefrac{1+mother}{mutwo}$ if $t$ is even and ${coloration{colorstepsize}h_t} = textstylefrac{1+mother}{Lone}$ in any other case}
&xx_{t+1} = xx_t – {coloration{colorstepsize}h_t} nabla
f(xx_t)+ {coloration{colormomentum} m}(xx_{t} – xx_{t-1})
finish{align*}

Convergence Charges

We have seen within the earlier part that within the sturdy area, the convergence fee is a straightforward operate of the momentum parameter. The one factor left to do is then to plug our optimum parameter into this system eqref{eq:asymptotic_rate}. The next corollary gives it and an asymptotic model.

Let $x_t$ be the iterate generated by the above algorithm after a good variety of steps $t$ and let $rho defas frac{Ltwo+muone}{Ltwo-muone} = frac{1 + kappa}{1-kappa}$ be the inverse of the linear convergence fee of the Gradient Descent methodology. Then we will certain the iterate suboptimality by a product of two phrases, the primary of which decreases exponentially quick, whereas the second will increase linearly in $t$:
start{equation}
fracx_0 – x_star leq underbrace{vphantom{sum_i}textstyleBig(frac{sqrt{rho^2 – R^2} – sqrt{rho^2 – 1}}{sqrt{1 – R^2}}Massive)^{t} }_{textual content{exponential lower}} , underbrace{Massive( 1 + tsqrt{smalldfrac{rho^2-1}{rho^2 – R^2}}Massive)}_{textual content{linear improve}},.
finish{equation}
Asymptotically, the exponential time period dominates, and now we have the next asymptotic fee issue:
start{equation}
label{eq:cyclic_rate}
limsup_{t to infty} sqrt[2t]{frac{|x_{2t} – x_star|}x_0 – x_star} leq frac{sqrt{rho^2 – R^2} – sqrt{rho^2 – 1}}{sqrt{1 – R^2}},.
finish{equation}

How do these examine with these of the fixed step-size heavy ball methodology with optimum parameters (Polyak heavy ball)?
The asymptotic fee of Polyak heavy ball is $r^{textual content{PHB}} defas rho – sqrt{rho^2 – 1}$, and so we might like to check
start{equation}label{eq:polyak_rate}
start{aligned}
&r^{textual content{CHB}} = frac{sqrt{rho^2 – R^2} – sqrt{rho^2 – 1}}{sqrt{1 – R^2}}

textual content{vs } &r^{textual content{PHB}} defas rho – sqrt{rho^2 – 1},.
finish{aligned}
finish{equation}

The very first thing we will say is that each coincide when $R=0$, and that $r^{textual content{CHB}}$ is a lowering operate of $R$, so the speed improves with the relative hole. Aside from that, it is not clear the way to summarize the development by an easier system.

On the ill-conditioned regime, the place the inverse situation quantity goes to zero ($kappa defas tfrac{mu}{L} to 0$), which can be the primary case of curiosity, we’ll be capable to present an approximation that sheds some mild. On this regime, now we have the next charges (decrease is healthier): The asymptotic fee of gradient descent on this regime is $1 – 2 kappa + o(kappa)$. start{align}
&r^{textual content{PHB}} underset{kappa rightarrow 0}{=} 1 – 2sqrt{kappa} + o(sqrt{kappa})
&r^{textual content{CHB}} underset{kappa rightarrow 0}{=} 1 – 2frac{sqrt{kappa}}{{coloration{purple}sqrt{1-R^2}}} + o(sqrt{kappa}),,label{eq:approx}
finish{align}
the place $o(sqrt{kappa})$ comprises which to go zero sooner than $sqrt{kappa}$ as $kappa rightarrow 0$.
This sheds some mild on the development: the cyclical scheduler permits us to divide the complexity of the strategy by an element of $frac{1}{coloration{purple}sqrt{1-R^2}},$!

The determine under compares the convergence fee of Polyak heavy ball (Polyak), the cyclical heavy ball (Cyclical) and its approximated fee eqref{eq:approx} (Approx, in dashed strains). We see Cyclical behaving like Polyak for well-conditioned issues (small $kappa$) after which
becoming extra the super-accelerated fee (Approx) for ill-conditioned issues. This plot additionally exhibits that the approximated fee is an excellent approximation for $kappa lt 0.001$.

Conclusion and Views

On this put up, we have proven that cyclical studying charges present an efficient and sensible method to leverage a niche or bi-modal construction within the Hessian’s eigenvalues. We have additionally derived optimum parameters and proven that within the ill-conditioned regime, this methodology achieves an additional fee issue enchancment of $frac{1}{coloration{purple}sqrt{1-R^2}}$ over the accelerated fee of Polyak heavy ball, the place $R$ is the relative hole within the Hessian eigenvalues.

The mixture of a structured Hessian, momentum and cyclical studying charges is immensely highly effective, and right here we’re solely scratching the floor of what is potential.
Many essential circumstances haven’t but been studied, such because the non-quadratic setting (strongly convex, convex, and non-convex). Within the paper now we have empirical outcomes logistic regression displaying speedup of the cyclical heavy ball methodology, however no proof past native convergence. Equally open is the extension to stochastic algorithms.

Citing

If you happen to discover this weblog put up helpful, please think about citing as

Cyclical Step-sizes, Baptiste Goujaud and Fabian Pedregosa, 2022

with bibtex entry:

  
    @misc{goujaud2022cyclical,
      title={Cyclical Step-sizes},
      creator={Goujaud, Baptiste and Pedregosa, Fabian},
      url = {http://fa.bianp.internet/weblog/2022/cyclical/},
      12 months={2022}
    }

To quote its accompanying full-length paper, please use:

Tremendous-Acceleration with Cyclical Step-sizes, Baptiste Goujaud, Damien Scieur, Aymeric Dieuleveut, Adrien Taylor, Fabian Pedregosa. Proceedings of The twenty fifth Worldwide Convention on Synthetic Intelligence and Statistics, 2022

Bibtex entry:

  
    @inproceedings{goujaud2022super,
      title={Tremendous-Acceleration with Cyclical Step-sizes},
      creator={Goujaud, Baptiste and Scieur, Damien and Dieuleveut, Aymeric and Taylor,
        Adrien and Pedregosa, Fabian},
      booktitle = {Proceedings of The twenty fifth Worldwide Convention on Synthetic
        Intelligence and Statistics},
      sequence = {Proceedings of Machine Studying Analysis},
      12 months={2022}
    }

References

Previous articleThe Ember Instances – Subject No. 187

Next articleReverse kind a slice in Go (Golang)

On the Hyperlink Between Optimization and Polynomials, Half 5

Cyclical Heavy Ball

Related work

Chebyshev Meets Cyclical Learning Rates

It is All In regards to the Hyperlink Operate

The 2 faces of Chebyshev polynomials

A more in-depth look into the hyperlink operate

A Refined Mannequin of the Eigenvalue Help Set

A Rising Strong Area

Convergence Charges

Conclusion and Views

Citing

References

Asserting Python Software program Basis Fellow Members for This autumn 2023! 🎉

Pydantic Knowledge Validation & Python Net Safety Practices – The Actual Python Podcast

Python 3.11.9 is now accessible

LEAVE A REPLY Cancel reply

Most Popular

Calculating the Customary Deviation | C For Dummies Weblog

Assist enhance go concurrency code – Code Evaluate

Ep. 3: Making certain Information Integrity in Rust: Battling Overflow and Underflow

Verification and Validation for AI: From mannequin implementation to necessities validation » Synthetic Intelligence

Recent Comments

ABOUT US

POPULAR POSTS

Calculating the Customary Deviation | C For Dummies Weblog

Assist enhance go concurrency code – Code Evaluate

Ep. 3: Making certain Information Integrity in Rust: Battling Overflow and Underflow

POPULAR CATEGORY