Attention is All You Need
Introduction and Background

Previous works for language modeling and machine translation
 Recurrent neural networks (RNN)
 Long shortterm memory (LSTM)
 Gated recurrent neural networks (GRN)
 Recurrent models
 Positions $\rightarrow$ Computation time
 Generates a sequence of hidden states $h_t$
 $h_t$=$f(h_{t1}, x_t)$
 Precludes parallelization within training examples :(
 Reducing sequential computation for sequential data!

CNN as building block
 Extended Neural GPU
 ByteNet
 ConvS2S

CNN as building block
 Attention model
 Allowing modeling of dependencies without regard to their distance in the input or output sequences
 Used in conjunction with a recurrent network
 Ref: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.“ arXiv preprint arXiv:1409.0473 (2014).
 Self attention
 $a.k.a.$ intraattention
 Attention mechanism relating different positions of a single sequence
 In order to compute a representation of the sequence
 Transformer
 Eschewing recurrence
 Relying entirely on an selfattention mechanism: No RNN or CNN
 Draw global dependencies between input and output
 SOTA translation quality after 12 hours with 8 P100 GPUs
 Highlight:
 RNN already had ability to predict “next word to come”
 if it is given “wellextracted summary of previous word sequence”
 So, our focus is how to “efficiently extract summary of previous word sequence”
 That’s what the Transformer and the selfattention is all about
Model Architecture
 EncoderDecoder structure
 Encoder:
 Maps an input seq of symbol representation $x = (x_1, x_2, … , x_n)$
 To a seq of continuous representation $z = (z_1, z_2, … , z_n)$
 Decoder:
 Given $z = (z_1, z_2, … , z_n)$, generates output seq $y = (y_1, y_2, … , y_m)$
 Autoregressive
 Consume previously generated symbols as additional input
Encoder Stack
 Stack of N=6 identical layers
 Each layer with two sublayers

Multihead selfattention mechanism
 With residual connection (
Add
)  Followed by layer normalization (
Norm
)
 With residual connection (
 Positionwise fully connected feedforward network
 With residual connection (
Add
)  Followed by layer normalization (
Norm
)
 With residual connection (

Multihead selfattention mechanism
 Output dimension of sublayers and embedding layers $d_{model}$ are all the same
 To enable residual connection
 $d_{model}$ = 512
Decoder Stack
 Stack of N=6 identical layers
 Each layer with three sublayers

Masked(!) Multihead selfattention mechanism
 With residual connection (
Add
)  Followed by layer normalization (
Norm
)  Masking
 Prevent positions from attending to subsequent positions
 With residual connection (

Multihead encoderdecoder attention over the output of the encoder stack
 Input
 $Q$ projected from of previous multihead selfattention sublayer
 $K$ and $V$ projected from output of encoder stack
 With residual connection (
Add
)  Followed by layer normalization (
Norm
)  Masking
 Prevent positions from attending to subsequent positions
 The predictions for position $i$ can depend only on the know outputs at positions less than $i$
 Input
 Positionwise fully connected feedforward network
 With residual connection (
Add
)  Followed by layer normalization (
Norm
)
 With residual connection (
 Output dimension of sublayers and embedding layers $d_{model}$ are all the same
 To enable residual connection
 $d_{model}$ = 512

Masked(!) Multihead selfattention mechanism
Attention
 Attention function
 Mapping a query and a set of keyvalue pairs to an output
 Output = Weighted sum of the values
Scaled DotProduct Attention

Input: queries and keys of dimensions $d_k$ and values of dimensions $d_v$

Scaling factor $1/\sqrt{d_k}$:
 The dot product may grow too large for large dimension $d_k$
 So the softmax function to be too small
 The scaling factor counteract this effect
MultiHead Attention
 To linearly project the Q, K, V $h$ times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions
 Then, perform the attention function in parallel
 Yielding $d_v$ dimensional output values
 Concatenated to $h\times d_v$ dimensional vectors
 Once again linearly projected resulting in the final values
 Benefit?:
 Jointly attend to information from different representation subspaces at different positions
Applications of Attention in our Model
 Encoderdecoder attention layers
 The queries ($Q$) come from the previous decoder layer
 The memory keys ($K$) and values ($V$) come from the output of the encoder.
 Mimics encoderdecoder attention mechanisms in SeqtoSeq models such as
 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
 Selfattention layers
 Encoder:
 Keys ($K$), values ($V$) and queries ($Q$) come from the output of the previous layer in the encoder.
 Each position can attend to all positions
 Decoder:
 Keys ($K$), values ($V$) and queries ($Q$) come from the output of the previous layer in the decoder.
 Each position can attend to all positions up to and including that position $\rightarrow$ Need Mask!
 Encoder:
Positionwise FeedForward Networks
 Input/Output: $d_{model}$ = 512
 Innerlayer: $d_{ff}$ = 2048
Embeddings and Softmax
 Used learned embeddings to convert the input tokens to output tokens with $d_{model}$ dimension
 Used learned linear transformation and softmax function
 to convert the decoder output to predicted nexttoken probabilities
 Shared weight matrices between encoder and decoder
Positional Encoding
 Must inject some information about
 the relative or
 absolute position of the tokens in the sequence
 Sine and cosine functions of different frequencies
Why SelfAttention
 Better complexity
 Better parallelism
 Better path length between longrange dependencies in the network
 More interpretable models
 Different heads for different tasks
Training
 Dataset:
 WMT2014 EnglishGerman dataset
 4.5 million sentence pairs
 Encoded using bytepair encoding
 Source target vocabulary ~ 37000 tokens
 WMT2017 EnglishFrench dataset
 36M sentences
 Split tokens into 32000 wordpiece vocabulary
 WMT2014 EnglishGerman dataset
 Batch:
 a set of sentence pairs containing ~25000 source tokens and 25000 target tokens
 Hardware:
 8 P100 GPUs
 Base model
 Each training step ~ 0.4 second
 100,000 steps ~ 12 hours
 Big model
 ~ 1.0 seconds per step
 300,000 steps ~ 3.5 days
 Optimizer
 Adam optimizer
 Warmup_steps = 4000
 Regularization
 Residual dropout
 Label smoothing
Results
 Machine translation
 WMT 2014 EnglishtoGerman translation task:
 28.4 BLEU = 2.0 BLEU + SOTA
 WMT 2014 EnglishtoFrench translation task
 41.0 BLEU = Outperforming all SOTA, 1/4 training cost
 WMT 2014 EnglishtoGerman translation task:
 Model variation
Reference
 Vaswani, Ashish, et al. “Attention is all you need.“ Advances in neural information processing systems 30 (2017).
 Github repository by the authors
 The Illustrated Transformer by Jay Alammar
 TRANSFORMERS FROM SCRATCH by Peter Bloem
 트랜스포머 (어텐션 이즈 올 유 니드) by Minseok Heo
 GPT3 Architecture
 [[ Efﬁcient Estimation of Word Representations in Vector Space ]]