Last updated on March 8, 2023

GPT3 Architecture

What is GPT3 and its architecture? : Answer from ChatGPT, one of children of GPT3
- GPT-3 (Generative Pre-trained Transformer 3) is a language model developed by OpenAI, which is designed to generate human-like text based on its training data. It is one of the largest and most powerful language models created to date, with 175 billion parameters.
- GPT-3 uses a neural network architecture called a transformer, which enables it to process and understand language in a way that is similar to how humans do. This allows it to perform a variety of natural language processing (NLP) tasks, such as language translation, text summarization, and question-answering, among others.
- GPT-3 has received a lot of attention due to its ability to generate highly convincing and coherent text, which has many potential applications in areas such as content creation, chatbots, and virtual assistants. However, it has also raised concerns about the potential misuse of such powerful language models, particularly in the areas of fake news and propaganda.
GPT as a Children of the Transformer?
- Basically, hugely augmented decoder stack of transformer introduced in Attention is All You Need
- In GPT, like traditional language models, outputs one token at a time.

GPT-2 vs BERT vs Transformer XL by Jay Alamer

GPT-2 generates one token at one time by Jay Alamer

High Level Description of GPT3

GPT2 = A stack of 12 decoders with 12 attention heads
GPT3 = A stack of 96 decoders with 96 attention heads

Architectures of GPT and Transformer

Input/Output

Inputs and outputs of language models

Input : A sequence of N words (a.k.a tokens).
- 2048 words for GPT-3
- One input word at one time
Output: A guess for the word most likely to be put at the end of the input sequence.

Sequence of language model outputs

For GPT-3, output is not just a single guess, it’s a sequence (length 2048) of guesses (a probability for each likely word)

Input/Output as a sequence in GPT3

Encoding of Word Sequence

GPT3 as a ML algorithm, it operates on vectors of numbers
How to convert words into vectors?
GPT3’s vocabulary = 50257 words
One-hot encoding: One word = 50257-dim vector with only one element is one and all zero for the other elements

One-hot encoding

2048 words $\rightarrow$ Matrix with 2048x50257 Binary matrix

A sentence to a binary matrix using one-hot encoding

GPT-3 uses byte-level Byte Pair Encoding (BPE) tokenization for efficiency
- Word as token $\rightarrow$ Groups of characters (sub-words of a word) as token
- Check OpenAI’s tokenizer tool

Embedding of Encoded Word Sequence

50257 dim-vector is too big!
Let’s Make it short!
Need to learn an embedding function

Case when embedding dimension = 2

GPT uses 12288 dimensions as its embedding dimension!
How to embed?
- 2048 word-sequence in a vocabulary with 50257 words
- $\rightarrow$ 2048x50257 Sequence-encoding matrix $A_E$
- $\rightarrow$ Multiply with 50257x12288 Embedding-weights matrix : Learned parameters! $A_E\times W_E$
- $\rightarrow$ 2048x12288 Sequence-embedding matrix
About the matrix multiplication $A_E\times W_E = X_{TokenEmbedding}$
- Not actually multiplication
- Just simple table lookup
- Steps:
  - Get index $j$ of the non-zero element of the $i$-th row of Encoding matrix $A_E$
  - Find and fetch $j$-th row of Weight matrix $W_E$
  - Put it in $i$-th row of Embedding matrix $X_{TokenEmbedding}$
Q: What if embeddings are distributed across the system?
Q: When is the 50257x12288 Embedding-weights matrix trained? Is it pretrained?

Sequence-encoding matrix to Sequence-embedding matrix

Positional Encoding

The position information of i-th word in 2048 word sequence is sent to 12288 sine functions with different frequencies and converted to 12288-dimensional position vector

2048 word’s position $\rightarrow$ 12288 sinusoidal functions, each with a different frequency
Since #sine-frequency > #words, each position will give unique position vector with relative positional information, just like Fourier transformation does.
So, this vector can represent
- Unique position of each word
- Periodically relative relations among words
Now, we have positional encoding matrix $X_{PositionalEncoding}$ with (2048 $\times$ 12288) size

2048 word information to 2048x12288 positional-encoding matrix

Final Embedding Matrix

Sequance-Embedding matrix + Positional-Encoding matrix = Final embedding matrix

Final embedding matrix (2048x12288) $X$
- = Sequence-Embedding matrix (2048x12288) $X_{TokenEmbedding}$
- - Positional-Encoding matrix (2048x12288) $X_{PositionalEncoding}$

Attention (Simplified)

What is attention?
- An operation that gives …
- Prediction which input tokens to focus on and how much important it is
- for each output in the sequence
Attention toy model
- Sequence of 3 tokens (2048 in GPT3) with 512-dimensional embedding (12288 in GPT3)

Linear projection of input tokens to Q, K, and V

Step1: Make $Q$, $K$, $V$ using 3 linear projection matrices $W_q, W_k, W_v$
- Learn 3 linear projections $W_q, W_k, W_v$
  - 512(Embedding dimension, 12288 in GPT3) x64($QKV$’s columns, 124 in GPT3)
- Apply them to the sequence embeddings
- Generate 3 different matrices $Q$, $K$, and $V$
  - $Q$: Queries = $X\times W_q$
  - $K$: Keys = $X\times W_q$
  - $V$: Values = $X\times W_q$

Importance matrix : QK' followed by softmax

Step2: Importance matrix
- Multiply queries and keys
- $Q_{(3\times 64)}\times K^T_{(64\times 3)}$
  - In GPT3: $Q_{(2048\times 128)}\times K^T_{(128\times 2048)}$
- Apply softmax function to the 3x3 matrix result
- $Softmax~(Q_{(3\times 64)}\times K^T_{(64\times 3)})_{(3\times 3)}$
- Meaning?
  - Only interaction between different tokens
  - Which tells the importance of each token to each other tokens

Adjust values with the importance matrix

Step3: Make weighted values $Y$
- Multiply the importance matrix to the values
- $Y=$ $Softmax$$(Q_{(3\times 64)}\times K^T_{(64\times 3)}$$){(3\times 3)}$$\times$$V{(3\times 64)}$
Computation Steps: Save the previous Q and K history !!
1. The i-th row vector $x_i$ of $X$ matrix with 12288 dimension is fed as input
2. The $Q(x_i)$, $K(x_i)$, $V(x_i)$ are calculated and used to make output vector $x_{i+1}$
3. Then, you feed $x_{i+1}$ to the decoder stack for next output $x_{i+2}$.
  1. The role of $V(x_i)$ is completed, so remove $V(x_i)$
  2. Due to masked self-attention, all the previous $Q_{x_{j} ~where ~j\le i+1}$ and $K_{x_{j} ~where ~j\le i+1}$ including $Q(x_i)$, $K(x_i)$ are reused for next decoding steps
  3. So, all the previous $Q_{x_{j} ~where ~j\le i+1}$ and $K_{x_{j} ~where ~j\le i+1}$ vectors should be stored in the memory until the end of sentence generation!!

Multi-Head Attention

GPT-3 used 96 multi-head attention with different $W_q, W_k, W_v$ projection weights
- So, 96x3x(Embedding-dim)x($QKV$’s columns) parameters from linear projections

Multi-Head Attention in GPT-3

Use results of multi-head attention $Y$s as input again!
- $Y$ = 2048 (Tokens) $\times$ 128 ($QKV$’s columns)
- Concatenate 96 $Y$s = 12288 = 96 $\times$ 2048 $\times$ 128
- So, now you have 2048 (Tokens) $\times$ 12288 (Embedding-dim)
GPT-3 uses sparse attention ?
$QKV$ calculation using big weight matrix and split ?

Feed Forward

1 Hidden layer

Feed Forward Layer

Input and output = (2048 $\times$ 12288) with 12288 input/output neurons
Hidden layer = 49152 hidden neurons = (4 $\times$ 12288)

Add & Norm

After Multi-Head attention and the Feed Forward blocks
Input + Output $\rightarrow$ Normalize (like ResNet)
“Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final self-attention block”?

Add and Norm

Normalization methods: Layer-norm normalizes across all neurons per sample

Decoding

The output from the multi-head attention layer $\rightarrow$ Input for decoding
Decoding is to invert
- the 12288-dim embedding vector back
- to the word vector with 50257-dim (i.e vocabulary size)
- filled with likelihood of each word in the vocabulary
Given probability matrix, $top-k$ will synthesize the final sentence output!