Optimization
CSE 891: Deep Learning
Vishnu Boddeti
# vanilla gradient descent
w = initialize_weights()
for t in range(num_steps):
dw = compute_gradients(loss_fn, data, w)
w -= learning_rate * dw
$x_1$ | $x_2$ | $t$ |
---|---|---|
100.8 | 0.00345 | 5.1 |
340.1 | 0.00181 | 3.2 |
200.2 | 0.00267 | 4.1 |
$\vdots$ | $\vdots$ | $\vdots$ |
$x_1$ | $x_2$ | $t$ |
---|---|---|
1003.2 | 1005.1 | 3.3 |
1001.1 | 1008.2 | 4.8 |
998.3 | 1003.4 | 2.9 |
$\vdots$ | $\vdots$ | $\vdots$ |
Algorithm | Tracks 1st moments (momentum) | Tracks 2nd moments (adaptive LR) | Leaky 2nd moments | Bias correction for momentum updates |
---|---|---|---|---|
SGD | ||||
SGD+Momentum | ||||
Nesterov | ||||
AdaGrad | ||||
RMSProp | ||||
Adam |
- $\mathcal{L}(\mathbf{w})=\mathcal{L}_{data}(\mathbf{w})+\mathcal{L}_{reg}(\mathbf{w})$
- $\mathbf{g}_t = \nabla\mathbf{L}(\mathbf{w}_t)$
- $\mathbf{s}_t = optimizer(\mathbf{g}_t)$
- $\mathbf{w}_{t+1}=\mathbf{w}_t - \alpha\mathbf{s}_t$
- $\mathcal{L}(\mathbf{w})=\mathcal{L}_{data}(\mathbf{w})+\lambda\|\mathbf{w}\|_2^2$
- $\mathbf{g}_t = \nabla\mathbf{L}(\mathbf{w}_t)=\nabla\mathbf{L}_{data}(\mathbf{w}_t)+2\lambda\mathbf{w}$
- $\mathbf{s}_t = optimizer(\mathbf{g}_t)$
- $\mathbf{w}_{t+1}=\mathbf{w}_t - \alpha\mathbf{s}_t$
- $\mathcal{L}(\mathbf{w})=\mathcal{L}_{data}(\mathbf{w})$
- $\mathbf{g}_t = \nabla\mathbf{L}_{data}(\mathbf{w}_t)$
- $\mathbf{s}_t = optimizer(\mathbf{g}_t) + 2\lambda\mathbf{w}$
- $\mathbf{w}_{t+1}=\mathbf{w}_t - \alpha\mathbf{s}_t$