March 14, 2023

2.4 Error Surface Is Rugged...

Tips for training: Adapting Learning Rate

1 Training stuck ≠ Small Gradient

image-20230313154301877

image-20230313155019805

2 Different Parameters Needs Different Learning Rate

image-20230313155516527

t : iteration, i : one of the parameter

2.1 Adagrad: Root Mean Square (RMS)

image-20230313160205168

2.2 RMSProp

Learning rate adapts dynamically on the same direction.image-20230313160502530

image-20230313160825876

image-20230313161102967

2.3 Adam: RMSProp + Momentum

image-20230313161134024

2.4 Learning Rate Scheduling

image-20230313161654334

image-20230313162445753

Warm up is used in Residual, Transformer, Bert

image-20230313162203813

2.5 Summary of Optimization

image-20230313162728082

# ML