Attention

CSE 891: Deep Learning

Vishnu Boddeti

Monday October 19, 2020

Today

Attention

The Problem with Long Sequences

Context vector $\mathbf{c}$ is often just $\mathbf{h}_T$
Input sequence is bottlenecked through fixed-sized vector. What if $T=1000$?

Image Captioning with CNN and RNN

Mao et al, "Explain Images with Multimodal Recurrent Neural Networks", NeurIPS 2014 Deep Learning and Representation Workshop
Karpathy and Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions", CVPR 2015
Vinyals et al, "Show and Tell: A Neural Image Caption Generator", CVPR 2015
Donahue et al, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description", CVPR 2015
Chen and Zitnick, "Learning a Recurrent Visual Representation for Image Caption Generation", CVPR 2015

Image Captioning with CNN and RNN

How does it do?

What do we want?

Human Vision

Human Saccades

Attention in Animals

Resource Saving

Only need sensors where relevant bits are (e.g. fovea vs. peripheral vision)
Only compute relevant bits of information (e.g. fovea has many more ‘pixels’ than periphery)

Variable State Manipulation

Manipulate environment (for all grains do: eat)
Learn modular subroutines (not state)

Regression

Watson Nadaraya Estimator (1964)

Data $\{x_1,\dots,x_m\}$ and labels $\{y_1,\dots,y_m\}$.
Estimate label $y$ at a new location $x$.
The world's dumbest estimator: average over all labels

$$\begin{equation} y = \frac{1}{m}\sum_{i=1}^m y_i \end{equation}$$

Better Idea (Watson, Nadaraya, 1964)

Weigh the labels according to location

$$\begin{equation} y = \sum_{i=1^m} \alpha(x, x_i)y_i \end{equation}$$

$\alpha(x,x_i)y_i$

$x$: query
$x_i$: key
$y_i$: value

Weighing The Locations

Unnormalized Weights

Normalized Weights

Watson Nadaraya Estimator

Consistency: Given enough data this algorithm converges to the optimal solution.
Simplicity: No free parameters - information is in the data not weights (or very few if we try to learn the weighting function)
Deep Learning Variant:

Learn weighting function.
Replace averaging (pooling) by weighted averaging.

General Attention Layer

$$\begin{eqnarray} &&\mathbf{z} = \sum_{i=1}^n \alpha(\mathbf{c},\mathbf{y}_i)\mathbf{y}_i \\ &&\alpha(\mathbf{c},\mathbf{y}_i) \geq 0 \\ &&\sum_{i=1}^n\alpha(\mathbf{c},\mathbf{y}_i) = 1 \end{eqnarray}$$

Key-Value Store in associative memory (dictionary/has table/database).

storing (saving)
retreiving (querying)
managing

Keys, Queries and Values

$\mathbf{c}$: query
$\alpha(\mathbf{c},\mathbf{y}_i)$: key
$\mathbf{y}$: values

Seq2Seq with RNNs and Attention

English: English to French translation
Input:"This agreement on the European Economic Area was signed in August 1992."
Output:L'ccord sur la zone économique européenne a été signé en août 1992.

Visualize Attention Weghts $a_{t,i}$

Seq2Seq with RNNs and Attention

The decoder doesn't use the fact that $\mathbf{h}_i$ form an ordered sequence – it just treats them as an unordered set $\{\mathbf{h}_i\}$
Can use similar architecture given any set of input hidden vectors $\{\mathbf{h}_i\}$?
Attention: Deep Learning on Sets

Image Captioning with Attention

Use a CNN to compute a grid of features for an image.
Each timestep of decoder uses a different context vector that looks at different parts of the input image

Image Captioning with Attention

Xu et al, "Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention", ICML 2015

X, Attend, and Y

"Show, attend, and tell" (Xu et al, ICML 2015)

Look at image, attend to image regions, produce question

"Ask, attend, and answer" (Xu and Saenko, ECCV 2016)
"Show, ask, attend, and answer" (Kazemi and Elqursh, 2017)

Read text of question, attend to image regions, produce answer

"Listen, attend, and spell" (Chan et al, ICASSP 2016)

Process raw audio, aEend to audio regions while producing text

"Listen, attend, and walk" (Mei et al, AAAI 2016)

Process text, attend to text regions, output navigation commands

"Listen, attend, and walk" (Mei et al, AAAI 2016)

Process text, attend to text regions, output navigation commands

"Show, attend, and interact" (Qureshi et al, ICRA 2017)

Process image, attend to image regions, output robot control commands

"Show, attend, and read" (Li et al, AAAI 2019)

Process image, attend to image regions, output text

Attention Layer

Inputs:

Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
Similarity Function: $f_{att}(\cdot)$

Computation:

Similarities: $e_i=f_{att}(\mathbf{q},\mathbf{x}_i)$, $\mathbf{e}\in\mathbb{R}^{n}$
Attention Weights: $\mathbf{a}=softmax(\mathbf{e})$, $\mathbf{a}\in\mathbb{R}^{n}$
Output Vector: $\mathbf{y}=\sum_{i=1}^n a_i\mathbf{x}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Attention Layer Variations

Similarity Functions:

Dot Product: $f_{att}(\mathbf{q},\mathbf{x}_i) = \mathbf{q}^T\mathbf{x}_i$
Scaled Dot Product: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$
Multiple Queries Product: $f_{att}(\mathbf{Q},\mathbf{X}) = \mathbf{Q}^T\mathbf{X}$

Separate Key and Value

$\mathbf{K} = \mathbf{W}_k\mathbf{X}$
$\mathbf{V} = \mathbf{W}_v\mathbf{X}$

Common Attention Layer

Inputs:

Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
Similarity Function: $f_{att}(\cdot)$

Computation:

Keys: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$
Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$
Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Hard Attention Layer

Inputs:

Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$

Computation:

Similarities: $e_i=f_{att}(\mathbf{q},\mathbf{x}_i)$, $\mathbf{e}\in\mathbb{R}^{n}$
Attention Weights: $\mathbf{a}=softmax(\mathbf{e})$, $\mathbf{a}\in\mathbb{R}^{n}$
Output Vector: $\mathbf{y}=sample(\mathbf{x};\mathbf{a})$, $\mathbf{y}\in\mathbb{R}^{d}$

Self-Attention Layer

One query per input
Inputs:

Query Vector: $\mathbf{X} \in \mathbb{R}^{n\times d}$
Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$

Computation:

Keys and Queries: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$, $\mathbf{Q}=\mathbf{W}_q\mathbf{X}$
Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$
Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Self-Attention Layer

Consider permuting the vectors
Queries and Keys will be the same, but permuted
Similarities will be the same, but permuted
Attention will be the same, but permuted
Values will be the same, but permuted
Outputs will be the same, but permuted
Self-Attention layer is Permutation Equivariant $f(s(x)) = s(f(x))$
Self-Attention layer works on sets of vectors

Masked-Attention Layer

One query per input
Inputs:

Query Vector: $\mathbf{X} \in \mathbb{R}^{n\times d}$
Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$

Computation:

Keys and Queries: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$, $\mathbf{Q}=\mathbf{W}_q\mathbf{X}$
Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$

Set future similarity to $-\infty$

Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Example: CNN with Self-Attention

Processing Sequences

RNN

Works on Ordered Sequences

Good at long sequences: After one RNN layer, $h_T$ "see" the whole sequence

Not parallelizable: need to compute hidden states sequentially

CNN

Works on Multidimensional Grids

Bad at long sequences: Need to stack many conv layers for outputs to "see" the whole sequence

Highly parallel: Each output can be computed in parallel

Self-Attention

Works on Sets of Vectors

Good at long sequences: after one self-attention layer, each output "sees" all inputs

Highly parallel: Each output can be computed in parallel

Very memory intensive

Summary

Adding "attention" to RNN models lets them look at different parts of the input at each timestep
Generalized Self-Attention is new, powerful neural network primitive
Lot more in next class