Modeling Sequences with Neural Networks


CSE 849: Deep Learning

Vishnu Boddeti

Overview

  • Sequence prediction:
    • Speech-to-text and text-to-speech
    • Caption generation for image
    • Machine translation
  • If input is also a sequence, the setting is called sequence-to-sequence prediction.
  • Markov models are memoryless, are not well suited for long range dependencies.
  • $$p(w_i|w_1,\dots,w_{i-1})=p(w_i|w_{i-3},w_{i-2},w_{i-1})$$

Language Modeling As a Probability Model

  • Sequential Probabilistic Model:
  • Is a Markov model good enough?
    • Need very long-range context.
    • How much context is sufficient?
    • Current day models: 10 million tokens

Language Modeling: Unigram

  • Assumption: each word is generated independently
  • $p(x_1, x_2, \dots, x_n) = \prod_{i=1}^n q(x_i)$

Language Modeling: Bigram

  • Assumption: each word is generated independently
  • $p(x_1, x_2, \dots, x_n) = \prod_{i=1}^n q(x_i|x_{i-1})$

Language Modeling: N-gram

  • Assumption: each word is generated independently
  • $p(x_1, x_2, \dots, x_n) = \prod_{i=1}^n q(x_i|x_{i-1},x_{i-2},x_{i-(k-1)})$

Autoregressive Models vs RNN

  • Autoregressive models such as the neural language model are memoryless, so they can only use information from their immediate context (in this figure, context length = 1):
  • If we add connections between the hidden units, it becomes a recurrent neural network (RNN). Having a memory lets an RNN use longer-term dependencies:

Sequences

Sequential Processing of Non-Sequential Data

  • Classify images by taking a series of "glimpses".
  • Generate images one piece at a time.
  • Ba, Mnih, and Kavukcuoglu, "Multiple Object Recognition with Visual Attention", ICLR 2015.
  • Gregor et al, "DRAW: A Recurrent Neural Network For Image Generation", ICML 2015

Sequential Processing of Non-Sequential Data

  • Integrate with oil paint simulator – at each timestep output a new stroke
  • Ganin et al, "Synthesizing Programs for Images using Reinforced Adversarial Learning", ICML 2018

Recurrent Neural Networks

  • Sequence Modeling with Neural Networks.
  • $$\begin{eqnarray} \mathbf{h}_t &=& f_{\mathbf{h}}\left(\mathbf{x}_t,(\mathbf{h}_{1},\dots,\mathbf{h}_{t-1}); \mathbf{W}^t_{\mathbf{h}}\right)\\ \mathbf{y}_t &=& f_{\mathbf{y}}((\mathbf{h}_{1},\dots,\mathbf{h}_{t}); \mathbf{W}^t_{\mathbf{y}}) \end{eqnarray}$$
  • Sequence Modeling with Recursive Neural Networks.
  • $$\begin{eqnarray} \mathbf{h}_t &=& f_{\mathbf{h}}(\mathbf{x}_t,\mathbf{h}_{t-1}; \mathbf{W}_{\mathbf{h}})\\ \mathbf{y}_t &=& f_{\mathbf{y}}\left(\mathbf{h}_t; \mathbf{W}_{\mathbf{y}}\right) \end{eqnarray}$$

Recurrent Neural Networks

$$\begin{eqnarray} \mathbf{h}_t &=& Tanh(\mathbf{W}_{\mathbf{hh}}\mathbf{h}_{t-1} + \mathbf{W}_{\mathbf{xh}}\mathbf{x}_{t})\\ \mathbf{y}_t &=& \mathbf{W}_{\mathbf{hy}}\mathbf{h}_t \end{eqnarray}$$

RNN Example

  • Moving Sum:

One More RNN Example

  • Comparing moving sums of first and second inputs.

RNN Example: Parity

  • Given a sequence of binary inputs, determine the parity i.e., whether the number of 1's is odd or even.
  • Computing parity is a classic example of a problem that is hard to solve for shallow feed-forward networks, but easy for a RNN.
  • Incrementally compute parity by keeping track of current parity:
  • Input: 0 1 0 1 1 0 1 0 1 1
    Parity Bits 0 1 1 0 1 1 $\rightarrow$
  • Each parity bit is the XOR of the current bit and the previous parity.

RNN Example: Parity Continued...

  • Find weights and biases for RNN to compute parity. Assume binary threshold units.
  • Solution:
    • Output unit tracks current parity bit.
    • Hidden layers compute the XOR.

RNN Example: Parity Continued...

RNN Example: Parity Continued...

  • Output Unit: Compute XOR between previous output and current input bit.
$y^{(t-1)}$ $x^{(t)}$ $y^{(t)}$
0 0 0
0 1 1
1 0 1
1 1 0

RNN Example: Parity Continued...

  • Design hidden units to compute XOR
    • Have one unit compute AND and one unit compute OR.
    • Pick weights and biases for these computations.


$y^{(t-1)}$ $x^{(t)}$ $h_1^{(t)}$ $h_2^{(t)}$ $y^{(t)}$
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0

RNN Example: Parity Continued...

  • What about the first time step?
  • Network should behave as if previous input was a 0.


$y^{(0)}$ $x^{(1)}$ $h_1^{(1)}$ $h_2^{(1)}$ $y^{(1)}$
0 0 0 0 0
0 1 0 1 1

Language Modeling

  • Learning a language model with RNNs
    • represent each word as an indicator vector
    • model predicts probability distribution
    • train with cross-entropy
  • This model can learn long range dependencies.

Language Modeling Example

  • Given characters at $\{1, 2,\dots,t\}$, model predicts character t
  • Training Sequence: "hello"
  • Vocabulary: [h,e,l,o]

RNN Batching

  • abcdefghijklmnopqrstuvwxyz
  • a g m s
  • b h n t
  • c i o u
  • d j p v
  • e k q w
  • f l r x

Language Modeling Continued...

  • At inference, the output of the model feeds back as input into the model.
  • Teaching Forcing: At training time, the inputs are from the training set, rather than the network output.

Language Modeling Continued...

  • Some challenges remain:
    • Vocabularies can be very large once you include people, places, etc. It’s computationally difficult to predict distributions over millions of words.
    • How do we deal with words we haven’t seen before?
    • In some languages (e.g. German), it’s hard to define what should be considered a word.

Language Modeling Continued...

  • Another approach is to model text one character at a time.
  • This solves the problem of what to do about previously unseen words.
  • Note that long-term memory is essential at the character level.

Neural Machine Translation

  • Translate english to french, given pairs of translated sentences.
  • What is wrong with the following setup?
  • Sentences may not be the same length, and the words might not align perfectly.
  • Need to resolve ambiguities using information from later in the sentence.

Neural Machine Translation

  • Encoder-Decoder Architecture:
    • network first reads and memorizes the sentence
    • starts outputting the translation when it sees end token
    • Encoder and decoder are two different networks with different weights.
    • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio. EMNLP 2014.
    • Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals and Quoc Le, NIPS 2014.

Sequence to Sequence

  • Many to One: Encoder input sequence in a single vector.
  • One to Many: Produce output sequence from single input vector.
(Many to One) + (One to Many)