Machine Learning

March 14, 2023

2.4 Error Surface Is Rugged...

Tips for training: Adapting Learning Rate

1 Training stuck ≠ Small Gradient

2 Different Parameters Needs Different Learning Rate

: iteration, : one of the parameter

2.1 Adagrad: Root Mean Square (RMS)

2.2 RMSProp

Learning rate adapts dynamically on the same direction.

2.3 Adam: RMSProp + Momentum

2.4 Learning Rate Scheduling

Warm up is used in Residual, Transformer, Bert

2.5 Summary of Optimization

Next

Classification

Previous

2.3 Batch and Momentum