Modeling Long-Term Dependencies
CSE 891: Deep Learning
Vishnu Boddeti
Wednesday October 13, 2021
Overview
Last class, we saw RNNs, and how to use backpropagation through time to compute gradients.
Backpropagation, while mathematically correct, might fail due to exploding or vanishing gradients.
In general, learning dependencies over very long terms is a difficult problem.
This lecture: What causes vanishing or exploding gradients? How to overcome these problems?
Why Gradients Vanish or Explode?
Consider the machine translation example. The RNN reads an entire sentence in one language before it outputs the translated sentence.
Typically a sentence is 20 words long. There is a gap of 20 time steps between the information it sees the first piece of information to when it starts using it.
The derivatives need to propagate over this entire path.
Backpropagation Through Time
$$\begin{eqnarray} \mathbf{h}_t &=& \phi(\mathbf{W}\mathbf{h}_{t-1} + \mathbf{U}\mathbf{x}_{t}) \nonumber \\ \mathbf{y}_t &=& \mathbf{V}\mathbf{h}_t \nonumber \end{eqnarray}$$
RNN Computational Flow Graph
Activations:
$$\begin{equation} \begin{aligned} \mathcal{\overline{L}} &= 1 \\ \overline{y^{(t)}} &= \mathcal{\overline{L}}\frac{\partial \mathcal{L}}{\partial y^{(t)}} \\ \overline{r^{(t)}} &= \overline{y^{(t)}}\phi'(r^{(t)}) \\ \overline{h^{(t)}} &= \overline{r^{(t)}}+\overline{z^{(t+1)}}w \\ \overline{z^{(t)}} &= \overline{h^{(t)}}\phi'(z^{(t)}) \end{aligned} \end{equation}$$
Parameters:
$$\begin{equation} \begin{aligned} \overline{u} &= \sum_t \overline{z^{(t)}}x^{(t)} \\ \overline{v} &= \sum_t \overline{r^{(t)}}h^{(t)} \\ \overline{w} &= \sum_t \overline{z^{(t+1)}}h^{(t)} \nonumber \end{aligned} \end{equation}$$
Backpropagation Through Time
$$\begin{eqnarray} \frac{\partial L}{\partial \mathbf{W}_{hh}} &=& \sum_{t=1}^T \frac{\partial L_t}{\partial \mathbf{W}_{hh}} \nonumber \\ \frac{\partial L_t}{\partial \mathbf{W}_{hh}} &=& \sum_{k=1}^t \frac{\partial L_t}{\partial \mathbf{y}_t}\frac{\partial \mathbf{y}_t}{\partial \mathbf{h}_t}\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k}\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}} \nonumber \\ \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} &=& \prod_{i=k+1}^t \frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}} = \prod_{i=k+1}^t \mathbf{W}^T_{\mathbf{hh}}diag\left(g'(\mathbf{h}_{i-1})\right) \nonumber \end{eqnarray}$$
Vanishing Gradients
$$\begin{eqnarray} \left\|\frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}}\right\|_2 &\leq& \left\|\mathbf{W}_{\mathbf{hh}}\right\|_2\left\|diag\left(g'(\mathbf{h}_{i-1})\right)\right\|_2 \leq \alpha\beta \nonumber \\ \left\|\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{k}}\right\|_2 &\leq& (\alpha\beta)^{t-k} \nonumber \end{eqnarray}$$
Vanishing/Exploding Gradients
Univariate Encoder Network:
Backpropagation Updates:
$$\begin{eqnarray} \overline{h^{(t)}} &=& \overline{z^{(t+1)}}w \nonumber \\ \overline{z^{(t)}} &=& \overline{h^{(t)}}\phi'(z^{(t)}) \nonumber \end{eqnarray}$$
Recursive Computation:
$$\begin{eqnarray} \overline{h^{(1)}} &=& w^{T-1}\phi'(z^{(2)})\dots\phi'(z^{(T)})\overline{h^{(T)}} \nonumber \end{eqnarray}$$
Assuming Linear Activation:
$$\begin{eqnarray} \frac{\partial h^{(T)}}{\partial h^{(1)}} &=& w^{T-1} \nonumber \end{eqnarray}$$
Exploding:
$$\begin{eqnarray} w = 1.1, T=50 \Rightarrow \frac{\partial h^{(T)}}{\partial h^{(1)}} = 117.4 \nonumber \end{eqnarray}$$
Vanishing:
$$\begin{eqnarray} w = 0.9, T=50 \Rightarrow \frac{\partial h^{(T)}}{\partial h^{(1)}} = 0.00515 \nonumber \end{eqnarray}$$
Vanishing/Exploding Gradients
In the multi-variate case, the Jacobians multiply:
\begin{equation} \frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(1)}} = \frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(T-1)}} \cdots \frac{\partial \mathbf{h}^{(2)}}{\partial \mathbf{h}^{(1)}} \nonumber \end{equation}
Matrices can explode or vanish just like scalar values, difficult to figure out precisely how.
Consider the data path in the forward and backward pass:
The forward pass has non-linear activations that squash the activations, preventing them from exploding.
The backward pass is linear, difficult to keep activations stable.
Vanishing/Exploding Gradients
We looked at the problem of vanishing or exploding gradients from the perspective of backpropagation. Let us dig a little deeper conceptually
The Jacobian $\frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(1)}}$ determines, how much $\mathbf{h^{(T)}}$ changes when you change $\mathbf{h^{(1)}}$.
Each hidden layer computes a function of the previous hidden layer and the current input:
$$\begin{equation} \mathbf{h}^{(t)} = f(\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}) \nonumber \end{equation}$$
As we propagate through time, the function gets nested iteratively:
$$\begin{equation} \mathbf{h}^{(4)} = f(f(f(\mathbf{h}^{(1)}, \mathbf{x}^{(2)}), \mathbf{x}^{(3)}),\mathbf{x}^{(4)}) \nonumber \end{equation}$$
Studying properties of iterated functions will allow us to understand what RNNs are computing.
Iterated Functions
$$f(x)=3.5x(1-x)$$
#--- { "data" : { "datasets" : [{ "borderColor": "yellowgreen", "backgroundColor": "#333333" }] }, "options": { "responsive": true, "legend": { "labels": { "fontColor": "white" }}, "scales": { "xAxes": [{ "display": false }], "yAxes": [{ "gridLines": { "drawOnChartArea": false }, "ticks": { "fontColor": "#ffffff" }, "display": true }] } } } ---#
#--- { "data" : { "datasets" : [{ "borderColor": "yellow", "backgroundColor": "#333333" }] }, "options": { "responsive": true, "legend": { "labels": { "fontColor": "white" }}, "scales": { "xAxes": [{ "display": false }], "yAxes": [{ "gridLines": { "drawOnChartArea": false }, "ticks": { "fontColor": "#ffffff" }, "display": true }] } } } ---#
#--- { "data" : { "datasets" : [{ "borderColor": "cyan", "backgroundColor": "#333333" }] }, "options": { "responsive": true, "legend": { "labels": { "fontColor": "white" }}, "scales": { "xAxes": [{ "display": false }], "yAxes": [{ "gridLines": { "drawOnChartArea": false }, "ticks": { "fontColor": "#ffffff" }, "display": true }] } } } ---#
#--- { "data" : { "datasets" : [{ "borderColor": "pink", "backgroundColor": "#333333" }] }, "options": { "responsive": true, "legend": { "labels": { "fontColor": "white" }}, "scales": { "xAxes": [{ "display": false }], "yAxes": [{ "gridLines": { "drawOnChartArea": false }, "ticks": { "fontColor": "#ffffff" }, "display": true }] } } } ---#
RNNs as Dynamical Systems
Imagining RNN as a dynamical system with various attractors.
Within each colored region, gradients vanish. Even if you move a little, you still wind up at the same attractor.
At the boundary, the gradient blows up. Moving slightly moves you from one attractor to the other.
Vanishing/Exploding Gradients
Cliffs make it hard to estimate the true gradient. Here are the loss and cost functions with respect to the bias parameter for the hidden units:
Generally gradients will explode on some examples and vanish on other examples. In expectation, the loss may be fairly smooth.
Keeping Things Stable
A simple solution: clip the gradients
Clip the gradient $\mathbf{g}$ such that it has a norm of at most $\eta$:
$$\begin{equation} \mbox{if } \|\mathbf{g}\| \geq \eta \mbox{ then } \mathbf{g} \leftarrow \frac{\eta \mathbf{g}}{\|\mathbf{g}\|} \nonumber \end{equation}$$
The gradients end up being biased, but they do not explode.
Keeping Things Stable
Another Trick: reverse the input sequence.
There is only one time step between the first word of the input and the first word of the output.
Network can first learn short-term dependencies between early words in the sentence and then long-term dependencies between the later words.
Keeping Things Stable
Exploding or vanishing gradients are a conceptual problem with plain RNNs.
Hidden units are a kind of memory. Default behavior should be to remember their previous value.
the function at each time step should be close to identity
hard to implement identity function with non-linear activations
If function is close to identity, the gradient computations can be stable.
The Jacobians $\frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(1)}}$ are close to the identity matrix, so we can multiply more of them together and keep the gradients stable.
Keeping Things Stable
Identity RNNs:
Use the ReLU activation function
Initialize all the weight matrices to the identity matrix
Negative activations are clipped to zero, positive activations simply retain their value in the absence of inputs.
Allows learning much longer-term dependencies than vanilla RNNs.
Was able to learn to classify MNIST digits with an input sequence of one pixel at a time.
Long Short Term Memory
Another architecture which makes it easy to remember information for long periods of time.
Network's activations are it's short-term memory and the weights are it's long-term memory.
The LSTM architecture forces the short-term memory to last for a long time period.
LSTM consists of memory cells which have controllers that decide when to store and when to forget.
Long Short Term Memory
LSTM: Math
LSTM mimics a computer memory.
Input Gate:
controls memory writes
Forget Gate:
controls memory reset
Output Gate:
controls memory read
$$ \begin{eqnarray} \mathbf{i}_t &=& Sigmoid(\mathbf{W}_{\mathbf{xi}}\mathbf{x}_t + \mathbf{W}_{\mathbf{hi}}\mathbf{h}_t + \mathbf{b}_\mathbf{i}) \nonumber \\ \mathbf{f}_t &=& Sigmoid(\mathbf{W}_{\mathbf{xf}}\mathbf{x}_t + \mathbf{W}_{\mathbf{hf}}\mathbf{h}_t + \mathbf{b}_\mathbf{f}) \nonumber \\ \mathbf{o}_t &=& Sigmoid(\mathbf{W}_{\mathbf{xo}}\mathbf{x}_t + \mathbf{W}_{\mathbf{ho}}\mathbf{h}_t + \mathbf{b}_\mathbf{o}) \nonumber \\ \mathbf{g}_t &=& Tanh(\mathbf{W}_{\mathbf{xg}}\mathbf{x}_t + \mathbf{W}_{\mathbf{hg}}\mathbf{h}_t + \mathbf{b}_\mathbf{g}) \nonumber \\ \mathbf{c}_t &=& \mathbf{f}_t\odot\mathbf{c}_{t-1} + \mathbf{i}_t\odot\mathbf{g}_t \nonumber \\ \mathbf{h}_t &=& \mathbf{o}_t\odot Tanh(\mathbf{c}_t) \nonumber \end{eqnarray} $$
where $\odot$ is the hadamard product (element-wise product)
Hadamard Product Backpropagation
$$\begin{eqnarray} \mathbf{a} &=& \mathbf{x} \odot \mathbf{y} \nonumber \\ \frac{\partial L}{\partial x_k} &=& \frac{\partial L}{\partial \mathbf{a}}\frac{\partial \mathbf{a}}{\partial x_k} \nonumber \\ \frac{\partial L}{\partial x_k} &=& \frac{\partial L}{\partial \mathbf{a}}y_k \nonumber \\ \frac{\partial L}{\partial \mathbf{x}} &=& \frac{\partial L}{\partial \mathbf{a}} \odot \mathbf{y} \nonumber \end{eqnarray}$$
$$\begin{eqnarray}\begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} &=& \begin{bmatrix}x_1y_1 \\ x_2y_2 \\ \vdots \\ x_ny_n\end{bmatrix}\end{eqnarray}$$
LSTM
Looks complicated? ML researchers thought so and LSTMs were barely used for about a decade after they were proposed.
In 2013 and 2014, researchers used them to get impressive results on challenging and important problems like speech recognition and machine translation.
Since then, they’ve been one of the most widely used RNN architectures.
There have been many attempts to simplify the architecture, but nothing was conclusively shown to be simpler and better.
You never have to think about the complexity, since frameworks like TensorFlow and PyTorch provide nice black box implementations.
Deep Recurrent Neural Networks