Background Stochastic gradient descent
both statistical estimation , machine learning consider problem of minimizing objective function has form of sum:
q
(
w
)
=
1
n
∑
i
=
1
n
q
i
(
w
)
,
{\displaystyle q(w)={\frac {1}{n}}\sum _{i=1}^{n}q_{i}(w),}
where parameter
w
{\displaystyle w}
minimizes
q
(
w
)
{\displaystyle q(w)}
estimated. each summand function
q
i
{\displaystyle q_{i}}
typically associated
i
{\displaystyle i}
-th observation in data set (used training).
in classical statistics, sum-minimization problems arise in least squares , in maximum-likelihood estimation (for independent observations). general class of estimators arise minimizers of sums called m-estimators. however, in statistics, has been long recognized requiring local minimization restrictive problems of maximum-likelihood estimation. therefore, contemporary statistical theorists consider stationary points of likelihood function (or zeros of derivative, score function, , other estimating equations).
the sum-minimization problem arises empirical risk minimization. in case,
q
i
(
w
)
{\displaystyle q_{i}(w)}
value of loss function @
i
{\displaystyle i}
-th example, ,
q
(
w
)
{\displaystyle q(w)}
empirical risk.
when used minimize above function, standard (or batch ) gradient descent method perform following iterations :
w
:=
w
−
η
∇
q
(
w
)
=
w
−
η
∑
i
=
1
n
∇
q
i
(
w
)
/
n
,
{\displaystyle w:=w-\eta \nabla q(w)=w-\eta \sum _{i=1}^{n}\nabla q_{i}(w)/n,}
where
η
{\displaystyle \eta }
step size (sometimes called learning rate in machine learning).
in many cases, summand functions have simple form enables inexpensive evaluations of sum-function , sum gradient. example, in statistics, one-parameter exponential families allow economical function-evaluations , gradient-evaluations.
however, in other cases, evaluating sum-gradient may require expensive evaluations of gradients summand functions. when training set enormous , no simple formulas exist, evaluating sums of gradients becomes expensive, because evaluating gradient requires evaluating summand functions gradients. economize on computational cost @ every iteration, stochastic gradient descent samples subset of summand functions @ every step. very
effective in case of large-scale machine learning problems.
Comments
Post a Comment