Tips for training: Adapting Learning Rate
1 Training stuck ≠ Small Gradient
2 Different Parameters Needs Different Learning Rate
: iteration, : one of the parameter
2.1 Adagrad: Root Mean Square (RMS)
2.2 RMSProp
Learning rate adapts dynamically on the same direction.
2.3 Adam: RMSProp + Momentum
2.4 Learning Rate Scheduling
Warm up is used in Residual, Transformer, Bert