Background Stochastic gradient descent




both statistical estimation , machine learning consider problem of minimizing objective function has form of sum:







q
(
w
)
=


1
n





i
=
1


n



q

i


(
w
)
,


{\displaystyle q(w)={\frac {1}{n}}\sum _{i=1}^{n}q_{i}(w),}



where parameter



w


{\displaystyle w}

minimizes



q
(
w
)


{\displaystyle q(w)}

estimated. each summand function




q

i




{\displaystyle q_{i}}

typically associated



i


{\displaystyle i}

-th observation in data set (used training).


in classical statistics, sum-minimization problems arise in least squares , in maximum-likelihood estimation (for independent observations). general class of estimators arise minimizers of sums called m-estimators. however, in statistics, has been long recognized requiring local minimization restrictive problems of maximum-likelihood estimation. therefore, contemporary statistical theorists consider stationary points of likelihood function (or zeros of derivative, score function, , other estimating equations).


the sum-minimization problem arises empirical risk minimization. in case,




q

i


(
w
)


{\displaystyle q_{i}(w)}

value of loss function @



i


{\displaystyle i}

-th example, ,



q
(
w
)


{\displaystyle q(w)}

empirical risk.


when used minimize above function, standard (or batch ) gradient descent method perform following iterations :







w
:=
w

η

q
(
w
)
=
w

η



i
=
1


n




q

i


(
w
)

/

n
,


{\displaystyle w:=w-\eta \nabla q(w)=w-\eta \sum _{i=1}^{n}\nabla q_{i}(w)/n,}



where



η


{\displaystyle \eta }

step size (sometimes called learning rate in machine learning).


in many cases, summand functions have simple form enables inexpensive evaluations of sum-function , sum gradient. example, in statistics, one-parameter exponential families allow economical function-evaluations , gradient-evaluations.


however, in other cases, evaluating sum-gradient may require expensive evaluations of gradients summand functions. when training set enormous , no simple formulas exist, evaluating sums of gradients becomes expensive, because evaluating gradient requires evaluating summand functions gradients. economize on computational cost @ every iteration, stochastic gradient descent samples subset of summand functions @ every step. very


effective in case of large-scale machine learning problems.








Comments

Popular posts from this blog

United Kingdom List of armoured fighting vehicles by country

Fleet Cuban Revolutionary Armed Forces

Advert synopses Orson Welles Paul Masson adverts