Extensions and variants Stochastic gradient descent




1 extensions , variants

1.1 momentum
1.2 averaging
1.3 adagrad
1.4 rmsprop
1.5 adam
1.6 ksgd





extensions , variants

many improvements on basic stochastic gradient descent algorithm have been proposed , used. in particular, in machine learning, need set learning rate (step size) has been recognized problematic. setting parameter high can cause algorithm diverge; setting low makes slow converge. conceptually simple extension of stochastic gradient descent makes learning rate decreasing function ηt of iteration number t, giving learning rate schedule, first iterations cause large changes in parameters, while later ones fine-tuning. such schedules have been known since work of macqueen on k-means clustering.


momentum

further proposals include momentum method, appeared in rumelhart, hinton , williams seminal paper on backpropagation learning. stochastic gradient descent momentum remembers update Δ w @ each iteration, , determines next update linear combination of gradient , previous update:







Δ
w
:=
α
Δ
w

η


q

i


(
w
)


{\displaystyle \delta w:=\alpha \delta w-\eta \nabla q_{i}(w)}






w
:=
w
+
Δ
w


{\displaystyle w:=w+\delta w}



that leads to:







w
:=
w

η


q

i


(
w
)
+
α
Δ
w


{\displaystyle w:=w-\eta \nabla q_{i}(w)+\alpha \delta w}



where parameter



w


{\displaystyle w}

minimizes



q
(
w
)


{\displaystyle q(w)}

estimated, ,



η


{\displaystyle \eta }

step size (sometimes called learning rate in machine learning).


the name momentum stems analogy momentum in physics: weight vector



w


{\displaystyle w}

, thought of particle traveling through parameter space, incurs acceleration gradient of loss ( force ). unlike in classical stochastic gradient descent, tends keep traveling in same direction, preventing oscillations. momentum has been used several decades.


averaging

averaged stochastic gradient descent, invented independently ruppert , polyak in late 1980s, ordinary stochastic gradient descent records average of parameter vector on time. is, update same ordinary stochastic gradient descent, algorithm keeps track of










w
¯



=


1
t





i
=
0


t

1



w

i




{\displaystyle {\bar {w}}={\frac {1}{t}}\sum _{i=0}^{t-1}w_{i}}

.

when optimization done, averaged parameter vector takes place of w.


adagrad

adagrad (for adaptive gradient algorithm) modified stochastic gradient descent per-parameter learning rate, first published in 2011. informally, increases learning rate more sparse parameters , decreases learning rate less sparse ones. strategy improves convergence performance on standard stochastic gradient descent in settings data sparse , sparse parameters more informative. examples of such applications include natural language processing , image recognition. still has base learning rate η, multiplied elements of vector {gj,j} diagonal of outer product matrix.







g
=



τ
=
1


t



g

τ



g

τ



t





{\displaystyle g=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {t}}}



where




g

τ


=


q

i


(
w
)


{\displaystyle g_{\tau }=\nabla q_{i}(w)}

, gradient, @ iteration τ. diagonal given by








g

j
,
j


=



τ
=
1


t



g

τ
,
j


2




{\displaystyle g_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}}

.

this vector updated after every iteration. formula update now







w
:=
w

η


d
i
a
g

(
g

)




1
2





g


{\displaystyle w:=w-\eta \,\mathrm {diag} (g)^{-{\frac {1}{2}}}\circ g}



or, written per-parameter updates,








w

j


:=

w

j





η


g

j
,
j






g

j


.


{\displaystyle w_{j}:=w_{j}-{\frac {\eta }{\sqrt {g_{j,j}}}}g_{j}.}



each {g(i,i)} gives rise scaling factor learning rate applies single parameter wi. since denominator in factor,






g

i




=





τ
=
1


t



g

τ


2






{\displaystyle {\sqrt {g_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}}

ℓ2 norm of previous derivatives, extreme parameter updates dampened, while parameters few or small updates receive higher learning rates.


while designed convex problems, adagrad has been applied non-convex optimization.


rmsprop

rmsprop (for root mean square propagation) method in learning rate adapted each of parameters. idea divide learning rate weight running average of magnitudes of recent gradients weight. so, first running average calculated in terms of means square,







v
(
w
,
t
)
:=
γ
v
(
w
,
t

1
)
+
(
1

γ
)
(


q

i


(
w
)

)

2




{\displaystyle v(w,t):=\gamma v(w,t-1)+(1-\gamma )(\nabla q_{i}(w))^{2}}



where,



γ


{\displaystyle \gamma }

forgetting factor.


and parameters updated as,







w
:=
w



η

v
(
w
,
t
)





q

i


(
w
)


{\displaystyle w:=w-{\frac {\eta }{\sqrt {v(w,t)}}}\nabla q_{i}(w)}



rmsprop has shown excellent adaptation of learning rate in different applications. rmsprop can seen generalization of rprop , capable work mini-batches opposed full-batches.


adam

adam (short adaptive moment estimation) update rmsprop optimizer. in optimization algorithm, running averages of both gradients , second moments of gradients used. given parameters




w

(
t
)




{\displaystyle w^{(t)}}

, loss function




l

(
t
)




{\displaystyle l^{(t)}}

,



t


{\displaystyle t}

indexes current training iteration (indexed @



1


{\displaystyle 1}

), adam s parameter update given by:








m

w


(
t
+
1
)




β

1



m

w


(
t
)


+
(
1


β

1


)



w



l

(
t
)




{\displaystyle m_{w}^{(t+1)}\leftarrow \beta _{1}m_{w}^{(t)}+(1-\beta _{1})\nabla _{w}l^{(t)}}







v

w


(
t
+
1
)




β

2



v

w


(
t
)


+
(
1


β

2


)
(



w



l

(
t
)



)

2




{\displaystyle v_{w}^{(t+1)}\leftarrow \beta _{2}v_{w}^{(t)}+(1-\beta _{2})(\nabla _{w}l^{(t)})^{2}}












m
^




w


=



m

w


(
t
+
1
)



1


β

1


t







{\displaystyle {\hat {m}}_{w}={\frac {m_{w}^{(t+1)}}{1-\beta _{1}^{t}}}}










v
^




w


=



v

w


(
t
+
1
)



1


β

2


t







{\displaystyle {\hat {v}}_{w}={\frac {v_{w}^{(t+1)}}{1-\beta _{2}^{t}}}}









w

(
t
+
1
)




w

(
t
)



η






m
^




w









v
^




w




+
ϵ





{\displaystyle w^{(t+1)}\leftarrow w^{(t)}-\eta {\frac {{\hat {m}}_{w}}{{\sqrt {{\hat {v}}_{w}}}+\epsilon }}}



where



ϵ


{\displaystyle \epsilon }

small number used prevent division 0, ,




β

1




{\displaystyle \beta _{1}}

,




β

2




{\displaystyle \beta _{2}}

forgetting factors gradients , second moments of gradients, respectively.


ksgd

kalman-based stochastic gradient descent (ksgd) online , offline algorithm learning parameters statistical problems quasi-likelihood models, include linear models, non-linear models, generalized linear models, , neural networks squared error loss special cases. online learning problems, ksgd special case of kalman filter linear regression problems, special case of extended kalman filter non-linear regression problems, , can viewed incremental gauss-newton method. benefits of ksgd, in comparison other methods, (1) not sensitive condition number of problem , (2) has robust choice of hyperparameters, , (3) has stopping condition. drawbacks of ksgd algorithm requires storing dense covariance matrix between iterations, , requires matrix-vector product @ each iteration.


to describe algorithm, suppose




q

i


(
w
)


{\displaystyle q_{i}(w)}

,



w



r


p




{\displaystyle w\in \mathbb {r} ^{p}}

defined example



(

y

i


,

x

i


)


r

×


r


d




{\displaystyle (y_{i},x_{i})\in \mathbb {r} \times \mathbb {r} ^{d}}

such that










w



q

i


(
w
)
=




y

i



μ
(

x

i


,
w
)


v
(
μ
(

x

i


,
w
)
)






w


μ
(

x

i


,
w
)


{\displaystyle \nabla _{w}q_{i}(w)={\frac {y_{i}-\mu (x_{i},w)}{v(\mu (x_{i},w))}}\nabla _{w}\mu (x_{i},w)}



where



μ
(

x

i


,
w
)


{\displaystyle \mu (x_{i},w)}

mean function (i.e. expected value of




y

i




{\displaystyle y_{i}}

given




x

i




{\displaystyle x_{i}}

), ,



v
(
μ
(

x

i


,
w
)
)


{\displaystyle v(\mu (x_{i},w))}

variance function (i.e. variance of




y

i




{\displaystyle y_{i}}

given




x

i




{\displaystyle x_{i}}

). then, parameter update,



w
(
t
+
1
)


{\displaystyle w(t+1)}

, , covariance matrix update,



m
(
t
+
1
)


{\displaystyle m(t+1)}

given following







p
=



w


μ
(

x

t
+
1


,
w
(
t
)
)


{\displaystyle p=\nabla _{w}\mu (x_{t+1},w(t))}






m
=
μ
(

x

t
+
1


,
w
(
t
)
)


{\displaystyle m=\mu (x_{t+1},w(t))}






v
=
m
(
t
)
p


{\displaystyle v=m(t)p}






s
=
min
{

γ

1


,
max
{

γ

2


,
v
(
m
)
}
}
+

v


t



p


{\displaystyle s=\min \lbrace \gamma _{1},\max \lbrace \gamma _{2},v(m)\rbrace \rbrace +v^{\mathsf {t}}p}






w
(
t
+
1
)
=
w
(
t
)
+




y

t
+
1



m

s


v


{\displaystyle w(t+1)=w(t)+{\frac {y_{t+1}-m}{s}}v}






m
(
t
+
1
)
=
m
(
t
)



1
s


v

v


t





{\displaystyle m(t+1)=m(t)-{\frac {1}{s}}vv^{\mathsf {t}}}



where




γ

1


,

γ

2




{\displaystyle \gamma _{1},\gamma _{2}}

hyperparameters.



m
(
t
)


{\displaystyle m(t)}

update can result in covariance matrix becoming indefinite, can avoided @ cost of matrix-matrix multiplication.



m
(
0
)


{\displaystyle m(0)}

can positive definite symmetric matrix, typically taken identity. noted patel, problems besides linear regression, restarts required ensure convergence of algorithm, no theoretical or implementation details given. in closely related, off-line, mini-batch method non-linear regression analyzed bertsekas, forgetting factor used in covariance matrix update prove convergence.





cite error: there <ref group=lower-alpha> tags or {{efn}} templates on page, references not show without {{reflist|group=lower-alpha}} template or {{notelist}} template (see page).







Comments

Popular posts from this blog

United Kingdom List of armoured fighting vehicles by country

Episodes List of 2 Broke Girls episodes

Advert synopses Orson Welles Paul Masson adverts