Iterative method Stochastic gradient descent
fluctuations in total objective function gradient steps respect mini-batches taken.
in stochastic (or on-line ) gradient descent, true gradient of
q
(
w
)
{\displaystyle q(w)}
approximated gradient @ single example:
w
:=
w
−
η
∇
q
i
(
w
)
.
{\displaystyle w:=w-\eta \nabla q_{i}(w).}
as algorithm sweeps through training set, performs above update each training example. several passes can made on training set until algorithm converges. if done, data can shuffled each pass prevent cycles. typical implementations may use adaptive learning rate algorithm converges.
in pseudocode, stochastic gradient descent can presented follows:
a compromise between computing true gradient , gradient @ single example compute gradient against more 1 training example (called mini-batch ) @ each step. can perform better true stochastic gradient descent described, because code can make use of vectorization libraries rather computing each step separately. may result in smoother convergence, gradient computed @ each step uses more training examples.
the convergence of stochastic gradient descent has been analyzed using theories of convex minimization , of stochastic approximation. briefly, when learning rates
η
{\displaystyle \eta }
decrease appropriate rate, , subject relatively mild assumptions, stochastic gradient descent converges surely global minimum when objective function convex or pseudoconvex, , otherwise converges surely local minimum.
this in fact consequence of robbins-siegmund theorem.
Comments
Post a Comment