Background Stochastic gradient descent




both statistical estimation , machine learning consider problem of minimizing objective function has form of sum:







q
(
w
)
=


1
n





i
=
1


n



q

i


(
w
)
,


{\displaystyle q(w)={\frac {1}{n}}\sum _{i=1}^{n}q_{i}(w),}



where parameter



w


{\displaystyle w}

minimizes



q
(
w
)


{\displaystyle q(w)}

estimated. each summand function




q

i




{\displaystyle q_{i}}

typically associated



i


{\displaystyle i}

-th observation in data set (used training).


in classical statistics, sum-minimization problems arise in least squares , in maximum-likelihood estimation (for independent observations). general class of estimators arise minimizers of sums called m-estimators. however, in statistics, has been long recognized requiring local minimization restrictive problems of maximum-likelihood estimation. therefore, contemporary statistical theorists consider stationary points of likelihood function (or zeros of derivative, score function, , other estimating equations).


the sum-minimization problem arises empirical risk minimization. in case,




q

i


(
w
)


{\displaystyle q_{i}(w)}

value of loss function @



i


{\displaystyle i}

-th example, ,



q
(
w
)


{\displaystyle q(w)}

empirical risk.


when used minimize above function, standard (or batch ) gradient descent method perform following iterations :







w
:=
w

η

q
(
w
)
=
w

η



i
=
1


n




q

i


(
w
)

/

n
,


{\displaystyle w:=w-\eta \nabla q(w)=w-\eta \sum _{i=1}^{n}\nabla q_{i}(w)/n,}



where



η


{\displaystyle \eta }

step size (sometimes called learning rate in machine learning).


in many cases, summand functions have simple form enables inexpensive evaluations of sum-function , sum gradient. example, in statistics, one-parameter exponential families allow economical function-evaluations , gradient-evaluations.


however, in other cases, evaluating sum-gradient may require expensive evaluations of gradients summand functions. when training set enormous , no simple formulas exist, evaluating sums of gradients becomes expensive, because evaluating gradient requires evaluating summand functions gradients. economize on computational cost @ every iteration, stochastic gradient descent samples subset of summand functions @ every step. very


effective in case of large-scale machine learning problems.








Comments

Popular posts from this blog

United Kingdom List of armoured fighting vehicles by country

Episodes List of 2 Broke Girls episodes

Advert synopses Orson Welles Paul Masson adverts