Modeling Long-Term Dependencies


CSE 891: Deep Learning

Vishnu Boddeti

Wednesday October 14, 2020

Overview

  • Last class, we saw RNNs, and how to use backpropagation through time to compute gradients.
  • Backpropagation, while mathematically correct, might fail due to exploding or vanishing gradients.
  • In general, learning dependencies over very long terms is a difficult problem.
  • This lecture: What causes vanishing or exploding gradients? How to overcome these problems?

Why Gradients Vanish or Explode?

  • Consider the machine translation example. The RNN reads an entire sentence in one language before it outputs the translated sentence.
  • Typically a sentence is 20 words long. There is a gap of 20 time steps between the information it sees the first piece of information to when it starts using it.
  • The derivatives need to propagate over this entire path.

Backpropagation Through Time

$$\begin{eqnarray} \mathbf{h}_t &=& \phi(\mathbf{W}\mathbf{h}_{t-1} + \mathbf{U}\mathbf{x}_{t}) \nonumber \\ \mathbf{y}_t &=& \mathbf{V}\mathbf{h}_t \nonumber \end{eqnarray}$$

RNN Computational Flow Graph

Backpropagation Through Time

$$\begin{eqnarray} \frac{\partial L}{\partial \mathbf{W}_{hh}} &=& \sum_{t=1}^T \frac{\partial L_t}{\partial \mathbf{W}_{hh}} \nonumber \\ \frac{\partial L_t}{\partial \mathbf{W}_{hh}} &=& \sum_{k=1}^t \frac{\partial L_t}{\partial \mathbf{y}_t}\frac{\partial \mathbf{y}_t}{\partial \mathbf{h}_t}\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k}\frac{\partial \mathbf{h}_k}{\partial \mathbf{W}_{hh}} \nonumber \\ \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_k} &=& \prod_{i=k+1}^t \frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}} = \prod_{i=k+1}^t \mathbf{W}^T_{\mathbf{hh}}diag\left(g'(\mathbf{h}_{i-1})\right) \nonumber \end{eqnarray}$$
  • Vanishing Gradients
$$\begin{eqnarray} \left\|\frac{\partial \mathbf{h}_i}{\partial \mathbf{h}_{i-1}}\right\|_2 &\leq& \left\|\mathbf{W}_{\mathbf{hh}}\right\|_2\left\|diag\left(g'(\mathbf{h}_{i-1})\right)\right\|_2 \leq \alpha\beta \nonumber \\ \left\|\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{k}}\right\|_2 &\leq& (\alpha\beta)^{t-k} \nonumber \end{eqnarray}$$

Vanishing/Exploding Gradients

  • Univariate Encoder Network:
  • Backpropagation Updates:
  • $$\begin{eqnarray} \overline{h^{(t)}} &=& \overline{z^{(t+1)}}w \nonumber \\ \overline{z^{(t)}} &=& \overline{h^{(t)}}\phi'(z^{(t)}) \nonumber \end{eqnarray}$$
  • Recursive Computation:
  • $$\begin{eqnarray} \overline{h^{(1)}} &=& w^{T-1}\phi'(z^{(2)})\dots\phi'(z^{(T)})\overline{h^{(T)}} \nonumber \end{eqnarray}$$

Vanishing/Exploding Gradients

  • In the multi-variate case, the Jacobians multiply:
  • \begin{equation} \frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(1)}} = \frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(T-1)}} \cdots \frac{\partial \mathbf{h}^{(2)}}{\partial \mathbf{h}^{(1)}} \nonumber \end{equation}
  • Matrices can explode or vanish just like scalar values, difficult to figure out precisely how.
  • Consider the data path in the forward and backward pass:
    • The forward pass has non-linear activations that squash the activations, preventing them from exploding.
    • The backward pass is linear, difficult to keep activations stable.

Vanishing/Exploding Gradients

  • We looked at the problem of vanishing or exploding gradients from the perspective of backpropagation. Let us dig a little deeper conceptually
  • The Jacobian $\frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(1)}}$ determines, how much $\mathbf{h^{(T)}}$ change when you change $\mathbf{h^{(1)}}$.
  • Each hidden layer computes a function of the previous hidden layer and the current input:
  • $$\begin{equation} \mathbf{h}^{(t)} = f(\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}) \nonumber \end{equation}$$
  • As we propagate through time, the function gets nested iteratively:
  • $$\begin{equation} \mathbf{h}^{(4)} = f(f(f(\mathbf{h}^{(1)}, \mathbf{x}^{(2)}), \mathbf{x}^{(3)}),\mathbf{x}^{(4)}) \nonumber \end{equation}$$
  • Studying properties of iterated functions will allow us to understand what RNNs are computing.

Iterated Functions

$$f(x)=3.5x(1-x)$$
#--- { "data" : { "datasets" : [{ "borderColor": "yellowgreen", "backgroundColor": "#333333" }] }, "options": { "responsive": true, "legend": { "labels": { "fontColor": "white" }}, "scales": { "xAxes": [{ "display": false }], "yAxes": [{ "gridLines": { "drawOnChartArea": false }, "ticks": { "fontColor": "#ffffff" }, "display": true }] } } } ---# #--- { "data" : { "datasets" : [{ "borderColor": "yellow", "backgroundColor": "#333333" }] }, "options": { "responsive": true, "legend": { "labels": { "fontColor": "white" }}, "scales": { "xAxes": [{ "display": false }], "yAxes": [{ "gridLines": { "drawOnChartArea": false }, "ticks": { "fontColor": "#ffffff" }, "display": true }] } } } ---#

RNNs as Dynamical Systems

  • Imagining RNN as a dynamical system with various attractors.
  • Within each colored region, gradients vanish. Even if you move a little, you still wind up at the same attractor.
  • At the boundary, the gradient blows up. Moving slightly moves you from one attractor to the other.

Vanishing/Exploding Gradients

  • Cliffs make it hard to estimate the true gradient. Here are the loss and cost functions with respect to the bias parameter for the hidden units:
  • Generally gradients will explode on some examples and vanish on other examples. In expectation, the loss may be fairly smooth.

Keeping Things Stable

  • A simple solution: clip the gradients
  • Clip the gradient $\mathbf{g}$ such that it has a norm of at most $\eta$:
  • $$\begin{equation} \mbox{if } \|\mathbf{g}\| \geq \eta \mbox{ then } \mathbf{g} \leftarrow \frac{\eta \mathbf{g}}{\|\mathbf{g}\|} \nonumber \end{equation}$$
  • The gradients end up being biased, but they do not explode.

Keeping Things Stable

  • Another Trick: reverse the input sequence.
  • There is only one time step between the first word of the input and the first word of the output.
  • Network can first learn short-term dependencies between early words in the sentence and then long-term dependencies between the later words.

Keeping Things Stable

  • Exploding or vanishing gradients are a conceptual problem with plain RNNs.
  • Hidden units are a kind of memory. Default behavior should be to remember their previous value.
    • the function at each time step should be close to identity
    • hard to implement identity function with non-linear activations
  • If function is close to identity, the gradient computations can be stable.
    • The Jacobians $\frac{\partial \mathbf{h}^{(T)}}{\partial \mathbf{h}^{(1)}}$ are close to the identity matrix, so we can multiply more of them together and keep the gradients stable.

Keeping Things Stable

  • Identity RNNs:
    • Use the ReLU activation function
    • Initialize all the weight matrices to the identity matrix
  • Negative activations are clipped to zero, positive activations simply retain their value in the absence of inputs.
  • Allows learning much longer-term dependencies than vanilla RNNs.
  • Was able to learn to classify MNIST digits with an input sequence of one pixel at a time.

Long Short Term Memory

  • Another architecture which makes it easy to remember information for long periods of time.
    • Network's activations are it's short-term memory and the weights are it's long-term memory.
    • The LSTM architecture forces the short-term memory to last for a long time period.
  • LSTM consists of memory cells which have controllers that decide when to store and when to forget.

Long Short Term Memory

LSTM: Math

  • LSTM mimics a computer memory.
    • Input Gate: controls memory writes
    • Forget Gate: controls memory reset
    • Output Gate: controls memory read
$$ \begin{eqnarray} \mathbf{i}_t &=& Sigmoid(\mathbf{W}_{\mathbf{xi}}\mathbf{x}_t + \mathbf{W}_{\mathbf{hi}}\mathbf{h}_t + \mathbf{b}_\mathbf{i}) \nonumber \\ \mathbf{f}_t &=& Sigmoid(\mathbf{W}_{\mathbf{xf}}\mathbf{x}_t + \mathbf{W}_{\mathbf{hf}}\mathbf{h}_t + \mathbf{b}_\mathbf{f}) \nonumber \\ \mathbf{o}_t &=& Sigmoid(\mathbf{W}_{\mathbf{xo}}\mathbf{x}_t + \mathbf{W}_{\mathbf{ho}}\mathbf{h}_t + \mathbf{b}_\mathbf{o}) \nonumber \\ \mathbf{g}_t &=& Tanh(\mathbf{W}_{\mathbf{xg}}\mathbf{x}_t + \mathbf{W}_{\mathbf{hg}}\mathbf{h}_t + \mathbf{b}_\mathbf{g}) \nonumber \\ \mathbf{c}_t &=& \mathbf{f}_t\odot\mathbf{c}_{t-1} + \mathbf{i}_t\odot\mathbf{g}_t \nonumber \\ \mathbf{h}_t &=& \mathbf{o}_t\odot Tanh(\mathbf{c}_t) \nonumber \end{eqnarray} $$
  • where $\odot$ is the hadamard product (element-wise product)

Hadamard Product Backpropagation

$$\begin{eqnarray} \mathbf{a} &=& \mathbf{x} \odot \mathbf{y} \nonumber \\ \frac{\partial L}{\partial x_k} &=& \frac{\partial L}{\partial \mathbf{a}}\frac{\partial \mathbf{a}}{\partial x_k} \nonumber \\ \frac{\partial L}{\partial x_k} &=& \frac{\partial L}{\partial \mathbf{a}}y_k \nonumber \\ \frac{\partial L}{\partial \mathbf{x}} &=& \frac{\partial L}{\partial \mathbf{a}} \odot \mathbf{y} \nonumber \end{eqnarray}$$

LSTM

  • Looks complicated? ML researchers thought so and LSTMs were barely used for about a decade after they were proposed.
  • In 2013 and 2014, researchers used them to get impressive results on challenging and important problems like speech recognition and machine translation.
  • Since then, they’ve been one of the most widely used RNN architectures.
  • There have been many attempts to simplify the architecture, but nothing was conclusively shown to be simpler and better.
  • You never have to think about the complexity, since frameworks like TensorFlow and PyTorch provide nice black box implementations.

Deep Recurrent Neural Networks