Attention is All You Need

Front page of "Attention Is All You Need" paper

Introduction and Background

  • Previous works for language modeling and machine translation
    • Recurrent neural networks (RNN)
    • Long short-term memory (LSTM)
    • Gated recurrent neural networks (GRN)
  • Recurrent models
    • Positions $\rightarrow$ Computation time
    • Generates a sequence of hidden states $h_t$
      • $h_t$=$f(h_{t-1}, x_t)$
    • Precludes parallelization within training examples :(

RNN-based Encoder-Decoder by Minseok Heo
  • Reducing sequential computation for sequential data!
    • CNN as building block
      • Extended Neural GPU
      • ByteNet
      • ConvS2S
  • Attention model

RNN-based Encoder-Decoder with Attention by Minseok Heo
  • Self attention
    • $a.k.a.$ intra-attention
    • Attention mechanism relating different positions of a single sequence
    • In order to compute a representation of the sequence

Encode-Decode in Transformer with Self-Attention by Minseok Heo
  • Transformer
    • Eschewing recurrence
    • Relying entirely on an self-attention mechanism: No RNN or CNN
    • Draw global dependencies between input and output
    • SOTA translation quality after 12 hours with 8 P100 GPUs
  • Highlight:
    • RNN already had ability to predict “next word to come”
    • if it is given “well-extracted summary of previous word sequence”
    • So, our focus is how to “efficiently extract summary of previous word sequence”
    • That’s what the Transformer and the self-attention is all about

Model Architecture

  • Encoder-Decoder structure
  • Encoder:
    • Maps an input seq of symbol representation $x = (x_1, x_2, … , x_n)$
    • To a seq of continuous representation $z = (z_1, z_2, … , z_n)$
  • Decoder:
    • Given $z = (z_1, z_2, … , z_n)$, generates output seq $y = (y_1, y_2, … , y_m)$
  • Auto-regressive
    • Consume previously generated symbols as additional input

Transformer Architecture

Encoder Stack

  • Stack of N=6 identical layers
  • Each layer with two sub-layers
    • Multi-head self-attention mechanism
      • With residual connection (Add)
      • Followed by layer normalization (Norm)
    • Position-wise fully connected feed-forward network
      • With residual connection (Add)
      • Followed by layer normalization (Norm)
  • Output dimension of sub-layers and embedding layers $d_{model}$ are all the same
    • To enable residual connection
    • $d_{model}$ = 512

Encoder Stack in Transformer by Minseok Heo

Single Encoder in Transformer by Minseok Heo

Decoder Stack

  • Stack of N=6 identical layers
  • Each layer with three sub-layers
    • Masked(!) Multi-head self-attention mechanism
      • With residual connection (Add)
      • Followed by layer normalization (Norm)
      • Masking
        • Prevent positions from attending to subsequent positions
    • Multi-head encoder-decoder attention over the output of the encoder stack
      • Input
        • $Q$ projected from of previous multi-head self-attention sub-layer
        • $K$ and $V$ projected from output of encoder stack
      • With residual connection (Add)
      • Followed by layer normalization (Norm)
      • Masking
        • Prevent positions from attending to subsequent positions
        • The predictions for position $i$ can depend only on the know outputs at positions less than $i$
    • Position-wise fully connected feed-forward network
      • With residual connection (Add)
      • Followed by layer normalization (Norm)
    • Output dimension of sub-layers and embedding layers $d_{model}$ are all the same
      • To enable residual connection
      • $d_{model}$ = 512

Connection between encoders and decoders by Jay Alammar

Decoder Stack in Transformer by Minseok Heo

Single Decoder in Transformer by Minseok Heo

Mask matrix and input/output dependency in decoder by Peter Bloem

The Last Decoder in Transformer by Minseok Heo

Attention

  • Attention function
    • Mapping a query and a set of key-value pairs to an output
    • Output = Weighted sum of the values

Scaled Dot-Product Attention and Multi-Head Attention

Meaning of Query, Key, Value by Jay Alamar

Keys and Values from encoder, Queries from decoder by Minseok Heo

Scaled Dot-Product Attention

  • Input: queries and keys of dimensions $d_k$ and values of dimensions $d_v$

  • Scaling factor $1/\sqrt{d_k}$:

    • The dot product may grow too large for large dimension $d_k$
    • So the softmax function to be too small
    • The scaling factor counteract this effect

Q, K, V generation by Jay Alammar and Minseok Heo

Output of Self-attention by Jay Alammar and Minseok Heo

Multi-Head Attention

  1. To linearly project the Q, K, V $h$ times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions
  2. Then, perform the attention function in parallel
  3. Yielding $d_v$ dimensional output values
  4. Concatenated to $h\times d_v$ dimensional vectors
  5. Once again linearly projected resulting in the final values

Multi-head attention by Jay Alammar and Minseok Heo
  • Benefit?:
    • Jointly attend to information from different representation subspaces at different positions

Applications of Attention in our Model

  • Encoder-decoder attention layers
  • Self-attention layers
    • Encoder:
      • Keys ($K$), values ($V$) and queries ($Q$) come from the output of the previous layer in the encoder.
      • Each position can attend to all positions
    • Decoder:
      • Keys ($K$), values ($V$) and queries ($Q$) come from the output of the previous layer in the decoder.
      • Each position can attend to all positions up to and including that position $\rightarrow$ Need Mask!

Decoding time step 3: Q from previous outputs, K and V from encoder

Decoding time step 4: Q from previous outputs, K and V from encoder

Position-wise Feed-Forward Networks

  • Input/Output: $d_{model}$ = 512
  • Inner-layer: $d_{ff}$ = 2048

Embeddings and Softmax

  • Used learned embeddings to convert the input tokens to output tokens with $d_{model}$ dimension
  • Used learned linear transformation and softmax function
    • to convert the decoder output to predicted next-token probabilities
    • Shared weight matrices between encoder and decoder

Positional Encoding

  • Must inject some information about
    • the relative or
    • absolute position of the tokens in the sequence
  • Sine and cosine functions of different frequencies

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors. (From Jay Alammar's blog)

Why Self-Attention

  1. Better complexity
  2. Better parallelism
  3. Better path length between long-range dependencies in the network
  4. More interpretable models
    • Different heads for different tasks


Training

  • Dataset:
    • WMT2014 English-German dataset
      • 4.5 million sentence pairs
      • Encoded using byte-pair encoding
      • Source target vocabulary ~ 37000 tokens
    • WMT2017 English-French dataset
      • 36M sentences
      • Split tokens into 32000 word-piece vocabulary
  • Batch:
    • a set of sentence pairs containing ~25000 source tokens and 25000 target tokens
  • Hardware:
    • 8 P100 GPUs
    • Base model
      • Each training step ~ 0.4 second
      • 100,000 steps ~ 12 hours
    • Big model
      • ~ 1.0 seconds per step
      • 300,000 steps ~ 3.5 days
  • Optimizer
    • Adam optimizer
    • Warmup_steps = 4000
  • Regularization
    • Residual dropout
    • Label smoothing

Results

  • Machine translation
    • WMT 2014 English-to-German translation task:
      • 28.4 BLEU = 2.0 BLEU + SOTA
    • WMT 2014 English-to-French translation task
      • 41.0 BLEU = Outperforming all SOTA, 1/4 training cost
  • Model variation
Variation on the Transformer architecture

Reference

Notes Mentioning This Note

Table of Contents


Share on: