Tuesday, May 7, 2024
HomePythonOn the Hyperlink Between Optimization and Polynomials, Half 3

On the Hyperlink Between Optimization and Polynomials, Half 3


I’ve seen stuff you folks would not imagine.
Valleys sculpted by trigonometric capabilities.
Charges on fireplace off the shoulder of divergence.
Beams glitter in the dead of night close to the Polyak gate.
All these landscapes can be misplaced in time, like tears in rain.
Time to halt.

A momentum optimizer *

$$
defaa{boldsymbol a}
defrr{boldsymbol r}
defAA{boldsymbol A}
defHH{boldsymbol H}
defEE{mathbb E}
defII{boldsymbol I}
defCC{boldsymbol C}
defDD{boldsymbol D}
defKK{boldsymbol Ok}
defeeps{boldsymbol varepsilon}
deftr{textual content{tr}}
defLLambda{boldsymbol Lambda}
defbb{boldsymbol b}
defcc{boldsymbol c}
defxx{boldsymbol x}
defzz{boldsymbol z}
defuu{boldsymbol u}
defvv{boldsymbol v}
defqq{boldsymbol q}
defyy{boldsymbol y}
defss{boldsymbol s}
defpp{boldsymbol p}
deflmax{L}
deflmin{mu}
defRR{mathbb{R}}
defTT{boldsymbol T}
defQQ{boldsymbol Q}
defCC{boldsymbol C}
defEcond{boldsymbol E}
DeclareMathOperator*{argmin}{{arg,min}}
DeclareMathOperator*{argmax}{{arg,max}}
DeclareMathOperator*{reduce}{{reduce}}
DeclareMathOperator*{dom}{mathbf{dom}}
DeclareMathOperator*{Repair}{mathbf{Repair}}
DeclareMathOperator{prox}{mathbf{prox}}
DeclareMathOperator{span}{mathbf{span}}
DeclareMathOperator{Ima}{Im}
defdefas{stackrel{textual content{def}}{=}}
defdif{mathop{}!mathrm{d}}
definecolor{colormomentum}{RGB}{27, 158, 119}
definecolor{colorstepsize}{RGB}{217, 95, 2}
defmom{{colour{colormomentum}m}}
defstep{{colour{colorstepsize}h}}
$$

Gradient Descent with Momentum

Gradient descent with momentum, also referred to as heavy ball or momentum for brief, is an optimization methodology
designed to unravel unconstrained minimization issues of the shape
start{equation}
argmin_{xx in RR^d} f(xx),,
finish{equation}
the place the target perform $f$ is differentiable and we now have entry to its gradient $nabla
f$. On this methodology
the replace is a sum of two phrases. The primary time period is the
distinction between the present and the earlier iterate $(xx_{t} – xx_{t-1})$, also referred to as momentum
time period
. The second time period is the gradient $nabla f(xx_t)$ of the target perform.

Gradient Descent with Momentum
Enter: beginning guess $xx_0$, step-size $step > 0$ and momentum
parameter $mother in (0, 1)$.
$xx_1 = xx_0 – dfrac{step}{mother+1} nabla f(xx_0)$
For $t=1, 2, ldots$ compute
start{equation}label{eq:momentum_update}
xx_{t+1} = xx_t + mother(xx_{t} – xx_{t-1}) – stepnabla
f(xx_t)
finish{equation}

Regardless of its simplicity, gradient descent with momentum reveals unexpectedly wealthy dynamics that we’ll discover on this publish. A wonderful paper that explores the dynamics of momentum is Gabriel Goh’s Why Momentum Actually Works.

The panorama described within the part “The Dynamics of Momentum” are completely different than those described on this publish. Those in Gabriel’s paper describe the advance alongside the course of a single eigenvector, and as such are at greatest a partial view of convergence (actually, they provide the deceptive impression that one of the best convergence is achieved with zero momentum). The methods used on this publish permit as an alternative to compute the worldwide convergence fee, and figures just like the above appropriately predict a non-trivial optimum momentum time period.

Convergence charges

Within the final publish we noticed that, assuming $f$ is a quadratic goal of the shape
start{equation}label{eq:choose}
f(xx) defas frac{1}{2}xx^prime HH xx + bb^prime xx~,
finish{equation}
then we will analyze gradient descent with momentum utilizing a convex mixture of Chebyshev polynomials of the primary and second form.

Extra exactly, we confirmed that the error at iteration $t$ could be bounded as
start{equation}label{eq:theorem}
|xx_t – xx^star| leq underbrace{max_{lambda in [lmin, L]} |P_t(lambda)|}_{defas r_t} |xx_0 – xx^star|,.
finish{equation}
the place $lmin$ and $L$ are the smallest and largest eigenvalues of $HH$ respectively and $P_t$ is a $t$-th diploma polynomial that we will write as
start{align}
P_t(lambda) = mother^{t/2} left( {smallfrac{2mom}{1+mother}}, T_t(sigma(lambda)) + {smallfrac{1 – mother}{1 + mother}},U_t(sigma(lambda))proper),.
finish{align}
the place $sigma(lambda) = {smalldfrac{1}{2sqrt{mother}}}(1 +
mother –
step,lambda),$ is a linear perform that we’ll seek advice from because the hyperlink perform and $T_t$ and $U_t$ are the Chebyshev polynomials of the primary and second form respectively.

We’ll denote by $r_t$ the utmost of this polynomial over the $[lmin, L]$ interval, and name the convergence fee of an algorithm the $t$-th root of this amount, $sqrt[t]{r_t}$. We’ll additionally use usually the restrict superior of this amount, $limsup_{t to infty} sqrt[t]{r_t}$, which we’ll name the asymptotic fee. The asymptotic fee supplies a handy method to examine the (asymptotic) pace of convergence, because it permits to disregard different slower phrases which might be negligible for big $t$.

This id between gradient descent with momentum and Chebyshev polynomials opens the door to deriving the asymptotic properties of the optimizer from these of Chebyshev polynomials.

The Two Faces of Chebyshev Polynomials

Chebyshev polynomials behave very in a different way inside and out of doors the interval $[-1, 1]$. Inside this interval (shaded blue area) the magnitude of those polynomials stays near zero, whereas exterior it explodes:

The conduct of Chebyshev polynomials of the primary and second form is drastically completely different inside and out of doors the $[-1, 1]$ interval.


Open In Colab


Let’s make this statement extra exact.

Inside the $[-1, 1]$ interval, Chebyshev polynomials admit the trigonometric definitions $T_t(cos(theta)) = cos(t theta)$ and $U_{t}(cos(theta)) = sin((t+1)theta) / sin(theta)$ and they also have an oscillatory conduct with values bounded in absolute worth by 1 and $t+1$ respectively.

Exterior of this interval the Chebyshev polynomials of the primary form admit the specific type for $|xi| ge 1$:
start{align}
T_t(xi) &= dfrac{1}{2} Huge(xi-sqrt{xi^2-1} Huge)^t + dfrac{1}{2} Huge(xi+sqrt{xi^2-1} Huge)^t
U_t(xi) &= frac{(xi + sqrt{xi^2 – 1})^{t+1} – (xi – sqrt{xi^2 – 1})^{t+1}}{2 sqrt{xi^2 – 1}},.
finish{align}
We’re occupied with convergence charges, so we’ll look into $t$-th root asymptotics of the portions.With little further effort, it might be doable to derive non-asymptotic convergence charges, though I will not pursue this evaluation right here. Fortunately, these asymptotics are the identical for each polynomialsThough we can’t use it right here, this $t$-th root asymptotic holds for (virtually) all orthogonal polynomials, not simply Chebyshev polynomials. See as an example reference under and taking limits we now have that
start{equation}
lim_{t to infty} sqrt[t]T_t(xi) = lim_{t to infty} sqrt[t] = |xi| + sqrt{xi^2 – 1},.
finish{equation}

The Strong Area

Let’s begin first by contemplating the case wherein the Chebyshev polynomials are solely evaluated within the $[-1, 1]$. Because the Chebsyshev polynomials are evaluated at $sigma(cdot)$, this suggests that $|sigma(lambda)| leq 1$. We’ll name the set of step-size and momentum parameters for which the earlier inequality is verified the sturdy area.

Let’s visualize this area in a map.
Since $sigma$ is a linear perform, its extremal values are reached on the edges:
start{equation}
max_{lambda in [lmin, L]} |sigma(lambda)| = max, ,.
finish{equation}
Utilizing $lmin leq L$ and that $sigma(lambda)$ is lowering in $lambda$, we will simplify the situation that the above is smaller than one to $sigma(lmin) leq 1$ and $sigma(L) geq -1$, which by way of the step-size and momentum correspond to:
start{equation}label{eq:robust_region}
frac{(1 – sqrt{mother})^2}{lmin} leq step leq frac{(1 + sqrt{mother})^2}{L} ,.
finish{equation}
These two situations present the higher and decrease certain of the sturdy area.

The sturdy area –in grey– is higher bounded by $h = frac{(1 + sqrt{m})^2}{L}$ and decrease bounded $h = frac{(1 – sqrt{m})^2 }{lmin}$.

Asymptotic fee

Let $sigma(lambda) = cos(theta)$ for some $theta$, which is all the time doable since $sigma(lambda) in [-1, 1]$. On this regime, Chebyshev polynomials confirm the identities $T_t(cos(theta)) = cos(t theta)$ and $U_t(cos(theta)) = sin((t+1)theta)/sin(theta)$ , which changing within the definition of the residual polynomial offers
start{equation}
P_t(sigma^{-1}(cos(theta))) = mother^{t/2} left[ {smallfrac{2mom}{1+mom}}, cos(ttheta) + {smallfrac{1 – mom}{1 + mom}},frac{sin((t+1)theta)}{sin(theta)}right],.
finish{equation}

Because the expression contained in the sq. brackets is bounded in absolute worth by $t+2$, taking $t$-th root after which limits we now have $limsup_{t to infty} sqrt[t]{|P_t(sigma^{-1}(cos(theta)))|} = sqrt{mother}$ for any $theta$. This offers our first asymptotic fee:

The asymptotic fee within the sturdy area is $sqrt{mother}$.

That is nothing in need of magical. It will appear pure –and this would be the case in different areas– that the pace of convergence ought to rely upon each the step-size and the momentum parameter. But, this outcome implies that it isn’t the case within the sturdy area. On this area, the convergence solely is determined by the momentum parameter $mother$. Wonderful.This
insensitivity to step-size has been leveraged by Zhang et al. 2018 to develop a momentum tuner

This additionally illustrates why we name this the sturdy area. In its inside, perturbing the step-size in a approach that we keep inside the area has no impact on the convergence fee. The following determine shows the asymptotic fee (darker is quicker) within the sturdy area.

Within the sturdy area, the asymptotic fee solely is determined by the momentum parameter.


Open In Colab


The Lazy Area

Let’s contemplate now what occurs exterior of the sturdy area. On this case, the convergence will rely upon the biggest of $, $. We’ll contemplate first the case wherein the utmost is $|sigma(lmin)|$ and go away the opposite one for subsequent part.

This area is set by the inequalities $|sigma(lmin)| > 1$ and $|sigma(lmin)| geq |sigma(L)|$.
Utilizing the definition of $sigma$ and fixing for $step$ offers the equal situations
start{equation}
step leq frac{2(1 + mother)}{L + lmin} quad textual content{ and }quad step leq frac{(1 – sqrt{mother})^2}{lmin},.
finish{equation}
Word the second inequality is identical one as for the sturdy area eqref{eq:robust_region} however with the inequality signal reversed, and so the area can be on the oposite facet of that curve. We’ll name this the lazy area, as in growing the momentum will take us out of it and into the sturdy area.


The lazy area depicted in grey within the left: decrease bounded by zero and higher bounded by the minimal of $h = frac{(1 – sqrt{m})^2}{lmin}$ and $h = frac{2(1 + m)}{L – lmin}$.


Open In Colab


Asymptotic fee

As we noticed in Part 2, exterior of the $[-1, 1]$ interval each Chebyshev have easy $t$-th root asymptotics.
Utilizing this and that each sorts of Chebyshev polynomials agree in signal exterior of the $[-1, 1]$ interval we will compute the asymptotic fee as
start{align}
lim_{t to infty} sqrt[t]{r_t} &= sqrt{mother} lim_{t to infty} sqrt[t]{ {smallfrac{2mom}{mother+1}}, T_t(sigma(lmin)) + {smallfrac{1 – mother}{1 + mother}},U_t(sigma(lmin))}
&= sqrt{mother}left(|sigma(lmin)| + sqrt{sigma(lmin)^2 – 1} proper)
finish{align}
This offers the asymptotic fee for this area

Within the lazy area the asymptotic fee is $sqrt{mother}left(|sigma(lmin)| + sqrt{sigma(lmin)^2 – 1} proper)$.

Not like within the sturdy area, this fee is determined by each the step-size and the momentum parameter, which enters within the fee by way of the hyperlink perform $sigma$. This may be noticed within the colour plot of the asymptotic fee

Within the lazy area (in contrast to within the sturdy area), the asymptotic fee is determined by each the step-size and momentum parameter.


Open In Colab


Knife’s edge

The sturdy and lazy area occupy most (however not all!) of the area for which momentum converges. There is a small area that sits between the lazy and sturdy areas and the area the place momentum diverges. We name this area the Knife’s edge

For parameters not within the sturdy or lazy area, we now have that $|sigma(L)| > 1$ and $|sigma(L)| > |sigma(lmin)|$. Utilizing the asymptotics of Chebyshev polynomials as we did within the earlier part, we now have that the asymptotic fee is $sqrt{mother}left(|sigma(L)| + sqrt{sigma(L)^2 – 1} proper)$. The tactic will solely converge when this asymptotic fee is under 1. Implementing this ends in $step lt 2 (1 + mother) / L$. Combining this situation with the considered one of not being within the sturdy or lazy area offers the characterization:
start{equation}
step lt frac{2 (1 + mother)}{L} quad textual content{ and } quad step geq maxBig{tfrac{2(1 + mother)}{L + lmin}, tfrac{(1 + sqrt{mother})^2}{L}Huge},.
finish{equation}

Knife’s edge is higher bounded by $h = frac{2(1+m)}{L}$ and decrease bounded by $h = maxBig{tfrac{2(1 + mother)}{L + lmin}, tfrac{(1 + sqrt{mother})^2}{L}Huge}$.


Open In Colab


Asymptotic fee

The asymptotic fee could be computed utilizing the identical approach as within the lazy area. The ensuing fee is identical as in that area however with $sigma(L)$ changing $sigma(lmin)$:

Within the Knife’s edge area the asymptotic fee is $sqrt{mother}left(|sigma(L)| + sqrt{sigma(L)^2 – 1} proper)$.

Pictorially, this corresponds to

Convergence in Knife’s edge. As within the lazy area, the convergence on this area is determined by each the step-size and momentum parameter.


Open In Colab


Placing all of it collectively

That is the top of our journey. We have visited all of the areas on which momentum converges.There is a small convergent area with unfavorable momentum parameter that we’ve not visited. Though not sometimes used for minimization, unfavorable momentum has discovered functions in easy video games (Gidel et al., 2020). The one factor left to do is to mix all of the asymptotic charges we have gathered alongside the best way.

The asymptotic fee $limsup_{t to infty} sqrt[t]{r_t}$ of momentum is
start{alignat}{2}
&sqrt{mother} &&textual content{ if }step in large[frac{(1 – sqrt{mom})^2}{lmin}, frac{(1+sqrt{mom})^2}{L}big]
&sqrt{mother}(|sigma(lmin)| + sqrt{sigma(lmin)^2 – 1}) &&textual content{ if } step in large[0, min{tfrac{2(1 + mom)}{L + lmin}, tfrac{(1 – sqrt{mom})^2}{lmin}}big]
&sqrt{mother}(|sigma(L)| + sqrt{sigma(L)^2 – 1})&&textual content{ if } step in large[maxbig{tfrac{2(1 + mom)}{L + lmin}, tfrac{(1 + sqrt{mom})^2}{L}big}, tfrac{2 (1 + mom) }{L} big)
&geq 1 text{ (divergence)} && text{ otherwise.}
end{alignat}

Plotting the asymptotic rates for all regions we can see that Polyak momentum (the method with momentum $mom = left(frac{sqrt{L} – sqrt{lmin}}{sqrt{L} + sqrt{lmin}}right)^2$ and step-size $step = left(frac{2}{sqrt{L} + sqrt{lmin}}right)^2$ which is asymptotically optimal among the momentum methods with constant coefficients) is at the intersection of the three regions.



Open In Colab

Citing

If you find this blog post useful, please consider citing it as:

A Hitchhiker’s Guide to Momentum., Fabian Pedregosa, 2021

Bibtex entry:

  
    @misc{pedregosa2021residual,
      title={A Hitchhiker's Guide to Momentum},
      author={Pedregosa, Fabian},
      url = {http://fa.bianp.net/blog/2021/hitchhiker/},
      year={2021}
    }
  
  

Acknowledgements.
Thanks to Baptiste Goujaud for proofreading, catching many typos and making excellent clarifying suggestions, to Robert Gower for proofreading this post and providing thoughtful feedback. Thanks also to Courtney Paquette and Adrien Taylor for reporting errors.


References

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments