Last updated on March 8, 2023

Attention is All You Need

Front page of "Attention Is All You Need" paper

Introduction and Background

Previous works for language modeling and machine translation
- Recurrent neural networks (RNN)
- Long short-term memory (LSTM)
- Gated recurrent neural networks (GRN)
Recurrent models
- Positions $\rightarrow$ Computation time
- Generates a sequence of hidden states $h_t$
  - $h_t$=$f(h_{t-1}, x_t)$
- Precludes parallelization within training examples :(

RNN-based Encoder-Decoder by Minseok Heo

Reducing sequential computation for sequential data!
- CNN as building block
  - Extended Neural GPU
  - ByteNet
  - ConvS2S
Attention model
- Allowing modeling of dependencies without regard to their distance in the input or output sequences
- Used in conjunction with a recurrent network
- Ref: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.“ arXiv preprint arXiv:1409.0473 (2014).

RNN-based Encoder-Decoder with Attention by Minseok Heo

Self attention
- $a.k.a.$ intra-attention
- Attention mechanism relating different positions of a single sequence
- In order to compute a representation of the sequence

Encode-Decode in Transformer with Self-Attention by Minseok Heo

Transformer
- Eschewing recurrence
- Relying entirely on an self-attention mechanism: No RNN or CNN
- Draw global dependencies between input and output
- SOTA translation quality after 12 hours with 8 P100 GPUs
Highlight:
- RNN already had ability to predict “next word to come”
- if it is given “well-extracted summary of previous word sequence”
- So, our focus is how to “efficiently extract summary of previous word sequence”
- That’s what the Transformer and the self-attention is all about

Model Architecture

Encoder-Decoder structure
Encoder:
- Maps an input seq of symbol representation $x = (x_1, x_2, … , x_n)$
- To a seq of continuous representation $z = (z_1, z_2, … , z_n)$
Decoder:
- Given $z = (z_1, z_2, … , z_n)$, generates output seq $y = (y_1, y_2, … , y_m)$
Auto-regressive
- Consume previously generated symbols as additional input

Transformer Architecture

Encoder Stack

Stack of N=6 identical layers
Each layer with two sub-layers
- Multi-head self-attention mechanism
  - With residual connection (Add)
  - Followed by layer normalization (Norm)
- Position-wise fully connected feed-forward network
  - With residual connection (Add)
  - Followed by layer normalization (Norm)
Output dimension of sub-layers and embedding layers $d_{model}$ are all the same
- To enable residual connection
- $d_{model}$ = 512

Encoder Stack in Transformer by Minseok Heo

Single Encoder in Transformer by Minseok Heo

Decoder Stack

Stack of N=6 identical layers
Each layer with three sub-layers
- Masked(!) Multi-head self-attention mechanism
  - With residual connection (Add)
  - Followed by layer normalization (Norm)
  - Masking
    - Prevent positions from attending to subsequent positions
- Multi-head encoder-decoder attention over the output of the encoder stack
  - Input
    - $Q$ projected from of previous multi-head self-attention sub-layer
    - $K$ and $V$ projected from output of encoder stack
  - With residual connection (Add)
  - Followed by layer normalization (Norm)
  - Masking
    - Prevent positions from attending to subsequent positions
    - The predictions for position $i$ can depend only on the know outputs at positions less than $i$
- Position-wise fully connected feed-forward network
  - With residual connection (Add)
  - Followed by layer normalization (Norm)
- Output dimension of sub-layers and embedding layers $d_{model}$ are all the same
  - To enable residual connection
  - $d_{model}$ = 512

Connection between encoders and decoders by Jay Alammar

Decoder Stack in Transformer by Minseok Heo

Single Decoder in Transformer by Minseok Heo

Mask matrix and input/output dependency in decoder by Peter Bloem

The Last Decoder in Transformer by Minseok Heo

Attention

Attention function
- Mapping a query and a set of key-value pairs to an output
- Output = Weighted sum of the values

Scaled Dot-Product Attention and Multi-Head Attention

Meaning of Query, Key, Value by Jay Alamar

Keys and Values from encoder, Queries from decoder by Minseok Heo

Scaled Dot-Product Attention

Input: queries and keys of dimensions $d_k$ and values of dimensions $d_v$
Scaling factor $1/\sqrt{d_k}$:
- The dot product may grow too large for large dimension $d_k$
- So the softmax function to be too small
- The scaling factor counteract this effect

Q, K, V generation by Jay Alammar and Minseok Heo

Output of Self-attention by Jay Alammar and Minseok Heo

Multi-Head Attention

To linearly project the Q, K, V $h$ times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions
Then, perform the attention function in parallel
Yielding $d_v$ dimensional output values
Concatenated to $h\times d_v$ dimensional vectors
Once again linearly projected resulting in the final values

Multi-head attention by Jay Alammar and Minseok Heo

Benefit?:
- Jointly attend to information from different representation subspaces at different positions

Applications of Attention in our Model

Encoder-decoder attention layers
- The queries ($Q$) come from the previous decoder layer
- The memory keys ($K$) and values ($V$) come from the output of the encoder.
- Mimics encoder-decoder attention mechanisms in Seq-to-Seq models such as
  - Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
  - Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  - Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
Self-attention layers
- Encoder:
  - Keys ($K$), values ($V$) and queries ($Q$) come from the output of the previous layer in the encoder.
  - Each position can attend to all positions
- Decoder:
  - Keys ($K$), values ($V$) and queries ($Q$) come from the output of the previous layer in the decoder.
  - Each position can attend to all positions up to and including that position $\rightarrow$ Need Mask!

Decoding time step 3: Q from previous outputs, K and V from encoder

Decoding time step 4: Q from previous outputs, K and V from encoder

Position-wise Feed-Forward Networks

Input/Output: $d_{model}$ = 512
Inner-layer: $d_{ff}$ = 2048

Embeddings and Softmax

Used learned embeddings to convert the input tokens to output tokens with $d_{model}$ dimension
Used learned linear transformation and softmax function
- to convert the decoder output to predicted next-token probabilities
- Shared weight matrices between encoder and decoder

Positional Encoding

Must inject some information about
- the relative or
- absolute position of the tokens in the sequence
Sine and cosine functions of different frequencies

A real example of positional encoding for 20 words (rows) with an embedding size of 512 (columns). You can see that it appears split in half down the center. That's because the values of the left half are generated by one function (which uses sine), and the right half is generated by another function (which uses cosine). They're then concatenated to form each of the positional encoding vectors. (From Jay Alammar's blog)

Why Self-Attention

Better complexity
Better parallelism
Better path length between long-range dependencies in the network
More interpretable models
- Different heads for different tasks

Training

Dataset:
- WMT2014 English-German dataset
  - 4.5 million sentence pairs
  - Encoded using byte-pair encoding
  - Source target vocabulary ~ 37000 tokens
- WMT2017 English-French dataset
  - 36M sentences
  - Split tokens into 32000 word-piece vocabulary
Batch:
- a set of sentence pairs containing ~25000 source tokens and 25000 target tokens
Hardware:
- 8 P100 GPUs
- Base model
  - Each training step ~ 0.4 second
  - 100,000 steps ~ 12 hours
- Big model
  - ~ 1.0 seconds per step
  - 300,000 steps ~ 3.5 days
Optimizer
- Adam optimizer
- Warmup_steps = 4000
Regularization
- Residual dropout
- Label smoothing

Results

Machine translation
- WMT 2014 English-to-German translation task:
  - 28.4 BLEU = 2.0 BLEU + SOTA
- WMT 2014 English-to-French translation task
  - 41.0 BLEU = Outperforming all SOTA, 1/4 training cost
Model variation

Variation on the Transformer architecture

Reference

Vaswani, Ashish, et al. “Attention is all you need.“ Advances in neural information processing systems 30 (2017).
- Github repository by the authors
The Illustrated Transformer by Jay Alammar
TRANSFORMERS FROM SCRATCH by Peter Bloem
트랜스포머 (어텐션 이즈 올 유 니드) by Minseok Heo
GPT3 Architecture
[[ Efﬁcient Estimation of Word Representations in Vector Space ]]

Notes Mentioning This Note

GPT3 Architecture

Attention is All You Need
Introduction and Background
Model Architecture
- Encoder Stack
- Decoder Stack
Attention
Why Self-Attention
Training
Results
Reference