Attention


CSE 891: Deep Learning

Vishnu Boddeti

Monday October 18, 2021

Today

  • Attention

The Problem with Long Sequences

  • Context vector $\mathbf{c}$ is often just $\mathbf{h}_T$
  • Input sequence is bottlenecked through fixed-sized vector. What if $T=1000$?

Image Captioning with CNN and RNN

  • Mao et al, "Explain Images with Multimodal Recurrent Neural Networks", NeurIPS 2014 Deep Learning and Representation Workshop
  • Karpathy and Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions", CVPR 2015
  • Vinyals et al, "Show and Tell: A Neural Image Caption Generator", CVPR 2015
  • Donahue et al, "Long-term Recurrent Convolutional Networks for Visual Recognition and Description", CVPR 2015
  • Chen and Zitnick, "Learning a Recurrent Visual Representation for Image Caption Generation", CVPR 2015

Image Captioning with CNN and RNN

How does it do?

What do we want?

Human Vision




Human Saccades

Attention in Animals

  • Resource Saving
    • Only need sensors where relevant bits are (e.g. fovea vs. peripheral vision)
    • Only compute relevant bits of information (e.g. fovea has many more ‘pixels’ than periphery)
  • Variable State Manipulation
    • Manipulate environment (for all grains do: eat)
    • Learn modular subroutines (not state)

Regression

Watson Nadaraya Estimator (1964)

  • Data $\{x_1,\dots,x_m\}$ and labels $\{y_1,\dots,y_m\}$.
  • Estimate label $y$ at a new location $x$.
  • The world's dumbest estimator: average over all labels
  • $$\begin{equation} y = \frac{1}{m}\sum_{i=1}^m y_i \end{equation}$$
  • Better Idea (Watson, Nadaraya, 1964)
    • Weigh the labels according to location
    $$\begin{equation} y = \sum_{i=1}^m \alpha(x, x_i)y_i \end{equation}$$

Weighing The Locations

Unnormalized Weights

Watson Nadaraya Estimator

  • Consistency: Given enough data this algorithm converges to the optimal solution.
  • Simplicity: No free parameters - information is in the data not weights (or very few if we try to learn the weighting function)
  • Deep Learning Variant:
    • Learn weighting function.
    • Replace averaging (pooling) by weighted averaging.

General Attention Layer

$$\begin{eqnarray} &&\mathbf{z} = \sum_{i=1}^n \alpha(\mathbf{c},\mathbf{y}_i)\mathbf{y}_i \\ &&\alpha(\mathbf{c},\mathbf{y}_i) \geq 0 \\ &&\sum_{i=1}^n\alpha(\mathbf{c},\mathbf{y}_i) = 1 \end{eqnarray}$$

Seq2Seq with RNNs and Attention

Seq2Seq with RNNs and Attention



  • English: English to French translation
  • Input:"This agreement on the European Economic Area was signed in August 1992."
  • Output:L'ccord sur la zone économique européenne a été signé en août 1992.

Seq2Seq with RNNs and Attention

  • The decoder doesn't use the fact that $\mathbf{h}_i$ form an ordered sequence – it just treats them as an unordered set $\{\mathbf{h}_i\}$
  • Can use similar architecture given any set of input hidden vectors $\{\mathbf{h}_i\}$?
  • Attention: Deep Learning on Sets

Image Captioning with Attention

  • Use a CNN to compute a grid of features for an image.
  • Each timestep of decoder uses a different context vector that looks at different parts of the input image

Image Captioning with Attention

  • Xu et al, "Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention", ICML 2015

X, Attend, and Y

  • "Show, attend, and tell" (Xu et al, ICML 2015)
    • Look at image, attend to image regions, produce question
  • "Ask, attend, and answer" (Xu and Saenko, ECCV 2016)
  • "Show, ask, attend, and answer" (Kazemi and Elqursh, 2017)
    • Read text of question, attend to image regions, produce answer
  • "Listen, attend, and spell" (Chan et al, ICASSP 2016)
    • Process raw audio, aEend to audio regions while producing text

Attention Layer

  • Inputs:
    • Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Similarity Function: $f_{att}(\cdot)$
  • Computation:
    • Similarities: $e_i=f_{att}(\mathbf{q},\mathbf{x}_i)$, $\mathbf{e}\in\mathbb{R}^{n}$
    • Attention Weights: $\mathbf{a}=softmax(\mathbf{e})$, $\mathbf{a}\in\mathbb{R}^{n}$
    • Output Vector: $\mathbf{y}=\sum_{i=1}^n a_i\mathbf{x}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Attention Layer Variations

  • Similarity Functions:
    • Dot Product: $f_{att}(\mathbf{q},\mathbf{x}_i) = \mathbf{q}^T\mathbf{x}_i$
    • Scaled Dot Product: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$
    • Multiple Queries Product: $f_{att}(\mathbf{Q},\mathbf{X}) = \mathbf{Q}^T\mathbf{X}$

Common Attention Layer

  • Inputs:
    • Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Similarity Function: $f_{att}(\cdot)$
  • Computation:
    • Keys: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$
    • Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$
    • Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
    • Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
    • Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Hard Attention Layer

  • Inputs:
    • Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$
  • Computation:
    • Similarities: $e_i=f_{att}(\mathbf{q},\mathbf{x}_i)$, $\mathbf{e}\in\mathbb{R}^{n}$
    • Attention Weights: $\mathbf{a}=softmax(\mathbf{e})$, $\mathbf{a}\in\mathbb{R}^{n}$
    • Output Vector: $\mathbf{y}=sample(\mathbf{x};\mathbf{a})$, $\mathbf{y}\in\mathbb{R}^{d}$

Self-Attention Layer

  • One query per input
  • Inputs:
    • Query Vector: $\mathbf{X} \in \mathbb{R}^{n\times d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$
  • Computation:
    • Keys and Queries: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$, $\mathbf{Q}=\mathbf{W}_q\mathbf{X}$
    • Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$
    • Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
    • Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
    • Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Self-Attention Layer

  • Consider permuting the vectors
  • Queries and Keys will be the same, but permuted
  • Similarities will be the same, but permuted
  • Attention will be the same, but permuted
  • Values will be the same, but permuted
  • Outputs will be the same, but permuted
  • Self-Attention layer is Permutation Equivariant $f(s(x)) = s(f(x))$
  • Self-Attention layer works on sets of vectors

Masked-Attention Layer

  • One query per input
  • Inputs:
    • Query Vector: $\mathbf{X} \in \mathbb{R}^{n\times d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$
  • Computation:
    • Keys and Queries: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$, $\mathbf{Q}=\mathbf{W}_q\mathbf{X}$
    • Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$
      • Set future similarity to $-\infty$
    • Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
    • Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
    • Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Example: CNN with Self-Attention

Processing Sequences

RNN
  • Works on Ordered Sequences
    • Good at long sequences: After one RNN layer, $h_T$ "see" the whole sequence
    • Not parallelizable: need to compute hidden states sequentially
CNN
  • Works on Multidimensional Grids
    • Bad at long sequences: Need to stack many conv layers for outputs to "see" the whole sequence
    • Highly parallel: Each output can be computed in parallel
Self-Attention
  • Works on Sets of Vectors
    • Good at long sequences: after one self-attention layer, each output "sees" all inputs
    • Highly parallel: Each output can be computed in parallel
    • Very memory intensive

Summary

  • Adding "attention" to RNN models lets them look at different parts of the input at each timestep
  • Generalized Self-Attention is new, powerful neural network primitive
  • Lot more in next class