NLP and Transformers


CSE 891: Deep Learning

Vishnu Boddeti

Wednesday October 20, 2021

Today

  • Attention
  • Transformer
  • NLP

Recap: Attention Layer

  • Inputs:
    • Query Vector: $\mathbf{q} \in \mathbb{R}^{d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Similarity Function: $f_{att}(\cdot)$
  • Computation:
    • Keys: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$
    • Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{e}\in\mathbb{R}^{n\times n}$
    • Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
    • Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
    • Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Recap: Self-Attention Layer

  • Inputs:
    • Query Vector: $\mathbf{X} \in \mathbb{R}^{n\times d}$
    • Input Vector: $\mathbf{X} \in \mathbb{R}^{n \times d}$
    • Parameters: $\mathbf{W}_k \in \mathbb{R}^{d \times d'}$, $\mathbf{W}_q \in \mathbb{R}^{d \times d'}$, $\mathbf{W}_v \in \mathbb{R}^{d \times d'}$
    • Similarity Function: $f_{att}(\mathbf{q},\mathbf{x}_i) = \frac{\mathbf{q}^T\mathbf{x}_i}{\sqrt{d}}$
  • Computation:
    • Keys and Queries: $\mathbf{K}=\mathbf{W}_k\mathbf{X}$, $\mathbf{Q}=\mathbf{W}_q\mathbf{X}$
    • Similarities: $\mathbf{E}=f_{att}(\mathbf{K},\mathbf{Q})$, $\mathbf{E}\in\mathbb{R}^{n\times n}$
    • Attention Weights: $\mathbf{A}=softmax(\mathbf{E})$, $\mathbf{A}\in\mathbb{R}^{n \times n}$
    • Values: $\mathbf{V}=\mathbf{W}_v\mathbf{X}$
    • Output Vector: $\mathbf{y}_j=\sum_{i=1}^n A_{ij}\mathbf{v}_i$, $\mathbf{y}\in\mathbb{R}^{d}$

Advantages of Attention

  • Significantly improves Neural Machine Translation performance.
    • Allows decoder to focus on certain parts of the source.
  • Solves the bottleneck problem.
    • Allows decoder to look directly at source; bypass bottleneck
  • Helps with vanishing gradients problem
    • provides shortcut between faraway states
  • Provides some intepretability
    • By inspecting attention distribution, we can see what the decoder was focusing on.

Attention for Deep Learning

  • Weighted sum is a selective summary of the information contained in values. Query determines which values to focus on.
  • Way to obtain a fixed-size representation of an arbitrary set of representations (values) dependent on some other representation (query).

Multihead Self-Attention Layer

  • Scaled Dot-Product Attention attends to one or few entries in the input key-value pairs.
    • Only one way for a word to interact with others
  • Humans can attend to many things simultaneously.
    • Can we extend attention to achieve the same?
  • Idea: apply Scaled Dot-Product Attention multiple times on the linearly transformed inputs.
    • Split Inputs
    • Use $H$ independent "Attention Heads" in parallel
    • Concatenate Outputs

Making Attention Positional Again

  • Unlike RNNs and CNNs encoders, the attention encoder outputs do not depend on the order of the inputs.
  • The order of the sequence conveys important information for the machine translation tasks and language modeling.
  • Idea: add positional information of a input token in the sequence into the input embedding vectors.
  • $$\begin{equation} PE_{pos, 2i} = sin\left(\frac{pos}{10000^{\frac{2i}{d_{emb}}}}\right) \mbox{ and } PE_{pos, 2i+1} = cos\left(\frac{pos}{10000^{\frac{2i}{d_{emb}}}}\right) \end{equation}$$
  • Final input embeddings are the concatenation of the learnable embedding and the positional encoding.
  • Positional encoding allows same words at different locations to have different overall representations.

Manual Positional Encoding

Learned Positional Encoding

  • One can learn the positional encoding instead of manually specifying it.

More Attention Variants

  • Dot Product: $f_{att}(\mathbf{q},\mathbf{x}_i) = \mathbf{q}^T\mathbf{x}_i$
    • Assumes key and query are of same size.
  • Bilinear: $f_{att}(\mathbf{q},\mathbf{x}_i) = \mathbf{q}^T\mathbf{W}\mathbf{x}_i$
    • $\mathbf{W}\in\mathbb{R}^{d_1\times d_2}$
    • Allows of key and query to be of different dimensionality
  • Additive: $f_{att}(\mathbf{q},\mathbf{x}_i) = \mathbf{v}^T\mbox{tanh}(\mathbf{W}_1\mathbf{q} + \mathbf{W}_2\mathbf{x}_i)$
    • $\mathbf{W}_1\in\mathbb{R}^{d_3\times d_1}, \mathbf{W}_2\in\mathbb{R}^{d_3\times d_2}, \mathbf{v}\in\mathbb{R}^{d_3}$
    • $d_3$ (the attention dimensionality) is a hyperparameter
    • more flexible similarity function

Visualizing Attention

  • Self-attention layers learned that "it" could refer to different entities in the different contexts.

Visualizing Multi-Head Attention

Computational Cost and Parallelism

  • Computational Cost:
    • Complexity: How many multiply-add operations for the forward and backward pass.
    • Sequential Ops: The computations that cannot be parallelized. (The part of the model that requires a for loop.)
  • Maximum Path Length: the shortest path length between the first encoder input and the last decoder output.

Computational Efficiency

Layer Type Complexity Per Layer Sequential Ops Max Path Length
Self-Attention $\mathcal{O}(n^2 \cdot d)$ $\mathcal{O}(1)$ $\mathcal{O}(1)$
Recurrent $\mathcal{O}(n \cdot d^2)$ $\mathcal{O}(n)$ $\mathcal{O}(n)$
Convolutional $\mathcal{O}(k \cdot n \cdot d^2)$ $\mathcal{O}(1)$ $\mathcal{O}(\log_k(n))$
Self-Attention (restricted) $\mathcal{O}(r \cdot n \cdot d)$ $\mathcal{O}(1)$ $\mathcal{O}(n/r)$

Attention is All You Need

Vaswani et al, "Attention is all you need", NeurIPS 2017

Why Transformers?

  • We want parallelization but RNNs are inherently sequential.
  • Despite LSTMs, RNNs generally need attention mechanism to deal with long range dependencies.
    • path length between states grows with distance otherwise
  • But if attention gives us access to any state… maybe we can just use attention and don't need the RNN?

The Transformer Block

  • All vectors interact with each other
  • Residual Connection
  • Choice of normalization: Layer normalization
  • MLP independently on each vector
  • Residual connection
  • Output

The Transformer

  • Transformer Block:
    • Input: set of vectors $\mathbf{x}$
    • Output: set of vectors $\mathbf{y}$
    • Self-attention is the only interaction between vectors
    • Layer Norm and MLP work independently per vector
    • Highly scalable, highly parallelizable
  • A Transformer is a sequence of transformer blocks.
    • Vasawani et al: 12 blocks, $d=512$, 6 heads
  • Vaswani et al, "Attention is all you need", NeurIPS 2017

How does it do?

  • Vaswani et al, "Attention is all you need", NeurIPS 2017

Transformer Variants

  • GPT
    • Contextualize words by just using left-to-right language model
    • Pre-train network to predict the next word
  • ELMo
    • Pretrains two language models: left-to-right and right-to-left
    • Concatenate final layer output
    • Only 'shallow' combination of leftward and rightward context
  • BERT
    • Mask 15% of tokens, and predict blanks
    • All words condition on all other words

The Transformer: Transfer Learning

  • "ImageNet Moment for Natural Language Processing"

  • Pre-Training:
    • Download a lot of text from the internet
    • Train a giant Transformer model for language modeling
    • Lot of objectives work well!
    • Crucial to model deep, bidirectional interactions between words
    • Large gains from scaling up pre-training, with no clear limits yet

  • Finetuning:
    • Fine-tune the Transformer on your own NLP task
  • Devlin et al, "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018

Scaling Up Transformer

Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8x P100 (12 hrs)
Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)
BERT-Base 12 768 12 110M 13GB
BERT-Large 24 1024 16 340M 13GB
XLNet-Large 24 1024 16 340M 126GB 512x TPU-v3 (2.5 days)
RoBERTa 24 1024 16 355M 160GB 1024x V100 (1 day)
GPT-2 48 1600 ? 1.5B 40GB
Megatron-LM 72 3072 32 8.3B 174GB 512x V100 (9 days)
Turing-NLG 78 4256 28 17B ? 256x V100
GPT-3 96 12288 96 175B 694GB ?
  • Brown et al, "Language Models are Few-Shot Learners", arXiv 2020

Generated Text from GPT-3

  • Title: United Methodists Agree to Historic Split
  • Subtitle: Those who oppose gay marriage will form their own denomination
  • Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be "theologically and socially conservative," according to The Washington Post. The majority of delegates attending the church's annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the "largest Protestant denomination in the U.S.," but that it has been shrinking in recent decades. The new split will be the second in the church's history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split "comes at a critical time for the church, which has been losing members for years," which has been "pushed toward the brink of a schism over the role of LGBTQ people in the church." Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.

GPT-3: Programming by Prompt

  • Poor English input: I eated the purple berries.
  • Good English output: I ate the purple berries.
  • Poor English input: Thank you for picking me as your designer. I’d appreciate it.
  • Good English output: Thank you for choosing me as your designer. I appreciate it.
  • Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications.
  • Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications.
  • Poor English input: I’d be more than happy to work with you in another project
  • Poor English input: I'd be more than happy to work with you on another project.

GPT-3: Literary Style Transfer

  • Topic: humor: literary parodies and homages, pastiches, style parodies.
  • Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
  • By Ernest Hemingway:
  • "It was a cold day on Privet Drive. A child cried. Harry felt nothing. He was dryer than dust. He had been silent too long. He had not felt love. He had scarcely felt hate. Yet the Dementor’s Kiss killed nothing. Death didn’t leave him less dead than he had been a second before. It wasn’t about living or dying really. It wasn’t about death. It was about the Dark Lord, his creator, his engenderer. He was not resurrected. He was created again. He was whole again in the courtyard of Malfoy Manor."

GPT-3: Literary Style Transfer

  • Topic: humor: literary parodies and homages, pastiches, style parodies.
  • Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
  • By Jane Austen:
  • "It is a truth universally acknowledged, that a broken Harry is in want of a book. This he knows to his cost; pressing that wretched nose against the window of a bookstore in Diagon Alley the last thing before he goes into hiding for several years, where upon, he goes straight to Azkaban."

GPT-3: Literary Style Transfer

  • Topic: humor: literary parodies and homages, pastiches, style parodies.
  • Parodies of the fantasy novel series Harry Potter in the style of various famous authors:
  • By Arthur Conan Doyle:
  • "Harry pushed at the swinging doors of the bookshop hard, and nearly knocked himself unconscious. He staggered in with his ungainly package, his cheeks scarlet with cold and the shame of having chosen the wrong month to go Christmas shopping. The proprietor of the store, however, didn’t cast even a cursory glance at him, being far more interested in an enormous hunk of wood lying in the middle of the floor, which certainly looked like a gallows. Yes, the proprietor said to a reedy old man wearing a bowler hat and a forlorn expression that made Harry want to kick him, I can rent you such a gallows for a small fee."

Attention for Semantic Segmentation

  • Zhang et al, Co-occurrent Features in Semantic Segmentation, CVPR 2019

Transformers for Object Detection

  • Carion et al, End-to-end object detection with Transformers, ECCV 2020

Transformers for Object Detection

  • Carion et al, End-to-end object detection with Transformers, ECCV 2020

Transformer Variants

  • Lots of focus on reducing the computational complexity of transformer models.

Vision Transformer

  • Adapting transformers for images.
  • Matches best CNN model performance when pre-trained on external large scale dataset.
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021

Vision Transformer Training and Finetuning

  • Training
    • Pre-train on very large dataset.
    • Adam optimizer, batch size of 4096, linear learning rate warmup and decay
  • Fine Tuning
    • replace final MLP classification head and tune its parameters.
    • fine tune with SGD+momentum
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021

Vision Transformer

  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021

MLP-Mixer

  • "While convolutions and attention are both sufficient for good performance, neither of them are necessary."
  • Mlp-mixer: An all-mlp architecture for vision, NeurIPS 2021

MLP-Mixer

  • No pooling, operate at same size through the entire network.
  • MLP-Mixing Layers:
    • Token-Mixing MLP: allow communication between different spatial locations (tokens)
    • Channel-Mixing MLP: allow communication between different channels
    • Interleave between the layers.
    • Very simple architecture.
  • "In the extreme situation, our architecture can be seen as a unique CNN, which uses (1×1) convolutions for channel mixing, and single-channel depth-wise convolutions for token mixing. However, the converse is not true as CNNs are not special cases of Mixer."
  • Mlp-mixer: An all-mlp architecture for vision, NeurIPS 2021

MLP-Mixer

  • Mlp-mixer: An all-mlp architecture for vision, NeurIPS 2021

MLP-Mixer


    class PreNormResidual(nn.Module):
      def __init__(self, dim, fn):
          super().__init__()
          self.fn = fn
          self.norm = nn.LayerNorm(dim)

      def forward(self, x):
          return self.fn(self.norm(x)) + x

  def FeedForward(dim, expansion_factor = 4, dropout = 0., dense = nn.Linear):
      return nn.Sequential(
          dense(dim, dim * expansion_factor),
          nn.GELU(),
          nn.Dropout(dropout),
          dense(dim * expansion_factor, dim),
          nn.Dropout(dropout)
      )

  def MLPMixer(*, image_size, channels, patch_size, dim, depth, num_classes, expansion_factor = 4, dropout = 0.):
      assert (image_size % patch_size) == 0, 'image must be divisible by patch size'
      num_patches = (image_size // patch_size) ** 2
      chan_first, chan_last = partial(nn.Conv1d, kernel_size = 1), nn.Linear

      return nn.Sequential(
          Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_size, p2 = patch_size),
          nn.Linear((patch_size ** 2) * channels, dim),
          *[nn.Sequential(
              PreNormResidual(dim, FeedForward(num_patches, expansion_factor, dropout, chan_first)),
              PreNormResidual(dim, FeedForward(dim, expansion_factor, dropout, chan_last))
          ) for _ in range(depth)],
          nn.LayerNorm(dim),
          Reduce('b n c -> b c', 'mean'),
          nn.Linear(dim, num_classes)
      )
  

MLP-Mixer


    import einops
    import flax.linen as nn
    import jax.numpy as jnp

    class MlpBlock(nn.Module):
      mlp_dim: int
      @nn.compact
      def __call__(self, x):
        y = nn.Dense(self.mlp_dim)(x)
        y = nn.gelu(y)
        return nn.Dense(x.shape[-1])(y)

    class MixerBlock(nn.Module):
      """Mixer block layer."""
      tokens_mlp_dim: int
      channels_mlp_dim: int
      @nn.compact
      def __call__(self, x):
        y = nn.LayerNorm()(x)
        y = jnp.swapaxes(y, 1, 2)
        y = MlpBlock(self.tokens_mlp_dim, name='token_mixing')(y)
        y = jnp.swapaxes(y, 1, 2)
        x = x + y
        y = nn.LayerNorm()(x)
        return x + MlpBlock(self.channels_mlp_dim, name='channel_mixing')(y)

    class MlpMixer(nn.Module):
      """Mixer architecture."""
      patches: Any
      num_classes: int
      num_blocks: int
      hidden_dim: int
      tokens_mlp_dim: int
      channels_mlp_dim: int
      @nn.compact
      def __call__(self, inputs, *, train):
        del train
        x = nn.Conv(self.hidden_dim, self.patches.size, strides=self.patches.size, name='stem')(inputs)
        x = einops.rearrange(x, 'n h w c -> n (h w) c')
        for _ in range(self.num_blocks):
          x = MixerBlock(self.tokens_mlp_dim, self.channels_mlp_dim)(x)
        x = nn.LayerNorm(name='pre_head_layer_norm')(x)
        x = jnp.mean(x, axis=1)
        if self.num_classes:
          x = nn.Dense(self.num_classes, kernel_init=nn.initializers.zeros,name='head')(x)
        return x
  

Summary

  • Transformers are a new neural network model that only uses attention
  • However, the models are extremely expensive
  • Improvements (unfortunately) seem to mostly come from even more expensive models and more data
  • If you can afford large data and large compute, transformers are the go to architecture, instead of CNNs, RNNs, etc.
    • On our way back to fully-connected models, throwing out the inductive bias of CNNs and RNNs.
  • Current research suggest that, perhaps, all we need is MLPs.
    • Lots of exciting research ahead of us.