March 13, 2023

2.3 Batch and Momentum

1 Batch

1.1 Review: Optimization with Batch

image-20230313140211756

1.2 Small Batch v.s. Large Batch

image-20230313143835780

1.2.1 Training Time

Parallel Computing

image-20230313143623988

image-20230313143815916

1.2.2 Performance

image-20230313150746727

image-20230313145045389

image-20230313150215734

Flat Minima is better than Sharp Minima.

image-20230313151246458

1.3 Comparison

image-20230313151407854

1.3.1 Reference

image-20230313151502092

2 Momentum

2.1 (Vanilla) Gradient Descent

Move in the opposite direction of gradient descent.

image-20230313152059377

2.2 Gradient Descent + Momentum

Movement: movement of last step minus gradient at present

image-20230313152522563

\mathbf{m^i} is the weighted sum of all the previous gradient: \mathbf{g^0, g^1, ...,g^{i-1}}

\begin{align} \mathbf{m^0} &= 0 \\ \mathbf{m^1} &= -\eta\mathbf{g^0} \\ \mathbf{m^2} &= -\lambda\eta\mathbf{g^0} - \eta\mathbf{g^1} \\ ... \end{align}

# ML