1 Batch
1.1 Review: Optimization with Batch
1.2 Small Batch v.s. Large Batch
1.2.1 Training Time
Parallel Computing
1.2.2 Performance
- Small Batch size has better performance for training data.
- What’s wrong with large batch size on training data? Optimization Fails.
- Smaller batch size and momentum help escape critical points.
- Small Batch size has better performance on testing data.
- What wrong with large batch size on testing data? Overfitting
Flat Minima is better than Sharp Minima.
1.3 Comparison
1.3.1 Reference
2 Momentum
2.1 (Vanilla) Gradient Descent
Move in the opposite direction of gradient descent.
2.2 Gradient Descent + Momentum
Movement: movement of last step minus gradient at present
is the weighted sum of all the previous gradient: