Autoregressive models such as the neural language model are memoryless, so they can only use information from their immediate context (in this figure, context length = 1):
If we add connections between the hidden units, it becomes a recurrent neural network (RNN). Having a memory lets an RNN use longer-term dependencies:
Sequences
Sequential Processing of Non-Sequential Data
Classify images by taking a series of "glimpses".
Generate images one piece at a time.
Ba, Mnih, and Kavukcuoglu, "Multiple Object Recognition with Visual Attention", ICLR 2015.
Gregor et al, "DRAW: A Recurrent Neural Network For Image Generation", ICML 2015
Sequential Processing of Non-Sequential Data
Integrate with oil paint simulator – at each timestep output a new stroke
Ganin et al, "Synthesizing Programs for Images using Reinforced Adversarial Learning", ICML 2018
Given a sequence of binary inputs, determine the parity i.e., whether the number of 1's is odd or even.
Computing parity is a classic example of a problem that is hard to solve for shallow feed-forward networks, but easy for a RNN.
Incrementally compute parity by keeping track of current parity:
Input:
0
1
0
1
1
0
1
0
1
1
Parity Bits
0
1
1
0
1
1
$\rightarrow$
Each parity bit is the XOR of the current bit and the previous parity.
RNN Example: Parity Continued...
Find weights and biases for RNN to compute parity. Assume binary threshold units.
Solution:
Output unit tracks current parity bit.
Hidden layers compute the XOR.
RNN Example: Parity Continued...
RNN Example: Parity Continued...
Output Unit: Compute XOR between previous output and current input bit.
$y^{(t-1)}$
$x^{(t)}$
$y^{(t)}$
0
0
0
0
1
1
1
0
1
1
1
0
RNN Example: Parity Continued...
Design hidden units to compute XOR
Have one unit compute AND and one unit compute OR.
Pick weights and biases for these computations.
$y^{(t-1)}$
$x^{(t)}$
$h_1^{(t)}$
$h_2^{(t)}$
$y^{(t)}$
0
0
0
0
0
0
1
0
1
1
1
0
0
1
1
1
1
1
1
0
RNN Example: Parity Continued...
What about the first time step?
Network should behave as if previous input was a 0.
$y^{(0)}$
$x^{(1)}$
$h_1^{(1)}$
$h_2^{(1)}$
$y^{(1)}$
0
0
0
0
0
0
1
0
1
1
Language Modeling
Learning a language model with RNNs
represent each word as an indicator vector
model predicts probability distribution
train with cross-entropy
This model can learn long range dependencies.
Language Modeling Example
Given characters at $\{1, 2,\dots,t\}$, model predicts character t
Training Sequence: "hello"
Vocabulary: [h,e,l,o]
RNN Batching
abcdefghijklmnopqrstuvwxyz
a g m s
b h n t
c i o u
d j p v
e k q w
f l r x
Language Modeling Continued...
At inference, the output of the model feeds back as input into the model.
Teaching Forcing: At training time, the inputs are from the training set, rather than the network output.
Language Modeling Continued...
Some challenges remain:
Vocabularies can be very large once you include people, places, etc. It’s computationally difficult to predict distributions over millions of words.
How do we deal with words we haven’t seen before?
In some languages (e.g. German), it’s hard to define what should be considered a word.
Language Modeling Continued...
Another approach is to model text one character at a time.
This solves the problem of what to do about previously unseen words.
Noe that long-term memory is essential at the character level.
Neural Machine Translation
Translate english to french, given pairs of translated sentences.
What is wrong with the following setup?
Sentences may not be the same length, and the words might not align perfectly.
Need to resolve ambiguities using information from later in the sentence.
Neural Machine Translation
Encoder-Decoder Architecture:
network first reads and memorizes the sentence
starts outputting the translation when it sees end token
Encoder and decoder are two different networks with different weights.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio. EMNLP 2014.
Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals and Quoc Le, NIPS 2014.
Sequence to Sequence
Many to One: Encoder input sequence in a single vector.
One to Many: Produce output sequence from single input vector.