GPT for All

February 10, 2023

A fresh look

Caution: This article uses a stream-of-consciousness approach that emphasizes intuition and repeated explanations. While written pieces sometimes miss addressing subtle, lingering questions, I aim to clarify through these methods. For a quick overview, each section concludes with a "Recap" and a GPT-4-generated summary. Though GPT-4 excels at dense summarization compared to humans, I recommend not relying solely on it for grasping new concepts in this case. I've verified the accuracy of each summary.

Even though the transformer is used to generate images in VQ-GAN, it's helpful to understand it on its own. Transformers have by-far the most quality material online in the NLP domain, so first presenting it in this context is a worthwhile move. It is much more difficult to understand transformers in the image domain and transfer that understanding to the NLP domain than vice versa.

Basics of Language Modeling

You should first understand the concept of language modeling generally rather than just in the GPT context. Although no other language models have come close to GPTs in terms of performing well at a large scale, it's helpful to undestand that the theoretical language modeling framework is more general than just the transformer. Thinking of architectures at different levels of abstraction depending on the context is helpful. For now, don't think about GPTs, think language models.

Auto-regressive Generation

During generation, language models take in a sequence of words as context and use that to auto-regressively generate a convincing continuation of the sentence. Auto-regressive means that we have some initial context in the form of a sequence of words and we use that context to predict the next word. We add that word back onto the original sentence and feed it back into the language model and have it try to generate the next word with the new sentence as input. This is the "generating" side in the image below. GPTs are performing a mapping from `sequence of words -> word`. You can think of it like any other function `f` that takes in an input `x` and maps it to `y`, but a much, much longer formula.

Training

Graph of word embeddings versus dimensions

But, if we don't want random outputs, we first need to train the model. When the model is learning to generate sentences(training), it performs a `sequence of words(x)->sequence of words(y)` mapping, not just `sequence of words(x) -> word(y)`. Let's say we have the sentence `I am self aware`. When we break it into a sequence to be fed into the language model, the string is split into `["I", "am", "self", "aware"]`. The input sequence for one training sample is `["I", "am", "self"]` and the true output sequence is `["am", "self", "aware"]`. The predicted sequence of words `y` is being compared to the true output sequence to see how well it did. The loss being used to compare the predictions to truth is known as cross-entropy loss. I will give the details of this later, but for now, just know that the model is trying to minimize this loss number through backpropogation. Models like ChatGPT get their intelligence from a huge amount of this training, along with some supervised fine-tuning and RLHF as a cherry on top.

Now I need to make an important observation. I think this fact goes over many people's heads when they first learn about GPTs, but it is critical to understanding how they work. The input `x` and output `y` are the same length, so it's really like we're doing a one to one mapping. Going back to the previous example, `I` is being mapped to `am`, `am` is being mapped to `self`, and so on. GPTs, at their core, are performing a one-to-one mapping.

I was confused with this part of GPTs at first because it seemed to me that the model would just be able to peek ahead at the answer. For predicting `am` from `I`, couldn't the model simply take the value `am` from ahead of it in the input and trivially find the answer? In this case, the only place where the model would actually have to make a prediction is the last word in `x`. The one-to-one mapping is the reason why the model can't do this, which leads me to a fundamentally new way of looking at GPTs. They're very different from typical deep models, where the entire input sequence is processed with information from the entire sequence at once. In GPTs, each token in the sequence is truly being processed independently, but it has the ability to perform simple read/write operations between itself and the previous tokens in the sequence. There's a lot of parallel, similar computation happening across tokens individually, with reading and writing happening between tokens. Surprisingly, this simple operation allows for huge scalabalility, which people didn't expect before OpenAI's GPT series. You'll soon see how the reading/writing is done (answer: attention), but just have that picture clear in your mind.

A nice consequence of the model predicting token-by-token is that training is extremely efficient. Every time we pass in an input sequence with `T` characters, we get `T` predictions and `T` opportunities for the model to learn.

When I said previously that, during generation, language models do a `sequence of words->word` mapping, that wasn't entirely true. It's still doing `sequence of words(x)->sequence of words(y)`, except we only care about the last word in the output sequence `y`. The rest are just discarded. That's because the prior information about the models predictions for the shifted-over version aren't needed in the context of auto-regressive sampling. This is an unfortunate consequence for the efficiency of generation, but it's a worthy tradeoff for the efficiency of training.

Recap

The transformer has two different phases: training and generation. When training, the model is mapping `sequence->sequence` and comparing it's predictions for each word to the true shifted over version of the input at every single position. In order to make these predictions for each word in the sequence, the model uses itself and the words before it. At the position `self` in the input `I am self aware`, the model is mapping `self` to `aware` using the context of `I am self`. The context is mixed using simple reading and writing operations between tokens, with all processes occuring individually token per token.

I've left a lot of details intentionally vague so far, but most of the questions that you have about this will hopefully be solved as we try to answer the following two questions:

How is text represented as numbers by the model?
How does the model include previous words(not just itself) as context?

I'll start with the first and tackle the next one after, since 1 is a more essential/fundamental question to language modeling generally, while two is more GPT-specific.

Text to numbers and back

I've mentioned mapping words to other words so far, but obviously this is a big abstraction. No deep learning model can actually take in a string directly as input and process that through a neural network. Instead, we need an internal representation for language. The solution? Tokenization and embeddings.

Tokenization

Tokenization refers to transforming an input string into a discrete sequence of word-tokens. Typically, tokenization is a two-step process:

Build a vocabulary from the initial training corpus
Use that vocabulary to break up the input string into a sequence of tokens

Before breaking up a sentence into a sequence of tokens, we need to identify the list of possible tokens, or our "vocabulary". There are many ways to do this:

Character level - get all unique characters from training corpus and assign each an integer (`I am self aware-> [I, ,a,m, ,s,e,l,f, ,a,w,a,r,e]`)
Sub-word level - use an algorithm like Byte-Pair Tokenization to identify common sub-words(`Hello my -> [I, , am, ,self, ,aware]`)
Sentence level - split entire sentences into tokens using another pre-trained model(`I am self aware -> [I am self aware]`)

In practice, sub-word tokenization is the most common, and for good reason. Characters are too granular for a good discrete-representation of language. Think about it this way: `a` can be used in so many different contexts, whereas `apple` has a much more definite atomized meaning. On the other end, sentence-level tokenization is just too coarse-grained for language modeling tasks.

I'll use a character-level tokenizer (will explain later) in the beginning and then switch it to BPE later on.

Embedding

Tokenization converts strings to integers, yet these integers are not semantically meaningful. There's no information about the meaning of the word encoded in the integer(i.e. `flower=123` and `cybertruck= 124`). We could let the model operate on single integers, but then we only have one number to represent each word. This is not enough information to work with per word. Embeddings address this by assigning each integer(that refers to a sub-word token) in the vocabulary to an n-dimensional vector known as its embedding. The data structure that handles this assignment is called a lookup table. You take a string, map each string to an integer using sub-word tokenization, and then map that integer to the n-dimensional vector.

The vectors in the lookup-table are adjusted during training. As a result, their positions in vector space begin to reflect their semantic meaning(look at the image above). `Germany` is 10 away from `Berlin` and `Tokyo` is 10 away from `Japan`. `Germany-Berlin=Capital` and `Japan-Tokyo=Capital`. Woah.

Unembedding

Unembedding is a fundamentally different process than embedding. There's no way of doing a reverse lookup table. We need to use a more flexible vector to integer model: a position-wise linear layer. If you don't get what "position-wise" means, I will explain. Our embedded sequence is composed of a list of n-dimensional vectors. These vectors each refer to an independent token in the sequence. When we use a position-wise linear layer, it means that each token vector in the sequence of token vectors is processed by the same linear layer. Therefore, the linear layer isn't a `sequence->sequence` linear map, but a `word->word` linear map applied individually across tokens. Remember when I said that when a GPT is doing a `sequence(x)->sequence(y)` mapping, it's really like each token in `x` is being mapped to the `y` in the same position. The position-wise layer is the reason for this. To loop back to the question of "why can't the model look forward to cheat the answer?" It's simply because there is no mechanism to do so. The embedding operates on each token independently, and the unembedding acts on each token vector independently. The ability to use context comes from some internal components of the GPT, but th.

At this point, assuming you understand everything explained to this point, you've got a really good foundational intuition. With that groundwork laid, I now feel comfortable integrating some math notation into it.

Understanding the flow of data in terms of changes of shapes

In this section, I'll introduce you to a style of mathematical notation that makes reasoning about/debugging deep learning models drastically easier.

The input to our model is not actually a single sequence of text; it's a batch made of several, independent sequences of text. Each sequence in the batch is processed separately by the model, but batching is much more computationally efficient. I'll call the number of sequences in each batch `B`, although this is sometimes called `batch_size`. Each sequence in the batch is made up of a certain number of words. I'll call the number of tokens in each sequence `T`, although you might see it as `context_length`. When the input has only been tokenized(not yet embedded), the input is a batch of sequences of integers with shape `(B,T)`. Remember how I said that we could convert this representation to an equivalent version that uses one-hot encoding instead? If we used that representation, all the `T` integers in the sequence would be converted to a vector with the length `vocab_size`, meaning it's `(B,T,vocab_size)`. You can think of the non one-hot encoded version as `(B,T,1)`(the 1 is just a single integer) if that helps make the comparison clearer.

When we embed it, we map each of those integers to a `C`-dimensional vector. `C` is also sometimes known as the `embed_dim`. Once embedded, the shape is `(B,T,C)`.

In code, I will use embed_dim and context length, but I think the `(B,T,C)` convention is very good short-hand for comments and einsum notation, which I'll use later.

Recap

We have a bunch of text in our training dataset. We get text in batches and tokenize, resulting in an input with shape `(B,T)`. Converting this to a one-hot encoded representation yields a `(B,T,vocab_size)` input. After embedding, the input becomes `(B,T,C)`. Then comes the unembedding, which maps the `C`-dimensional vector back to `vocab_size`. Since this is the only functionality needed to go from input to output, we can start by building a simpler language model that goes from tokenization to embedding to unembedding.

Coding it up

First let's build the character-level tokenizer and load some data in (tiny-shakespeare). I'll explain each line in comments next to the code.

with open("../data/shake.txt", "r") as f:
    data = f.read()  # read in data

len(data), data[:50](1115394, 'First Citizen:\nBefore we proceed any further, hear')

Get all the unique characters in data with `set`

# set returns only the unique set of chars, then we convert to a list so we can sort
vocab = sorted(list(set(data)))
vocab_size = len(vocab)

len(vocab), vocab[:10](65, ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3'])

Create tokenization mapping from integer to character and character to integer.

# get mapping from integers to chars -> decode
itoc = {i: c for i, c in enumerate(vocab)}
# get mapping from chars to integers -> encode
ctoi = {c: i for i, c in enumerate(vocab)}

# function to map char to integer for all chars in input string x
encode = lambda x: [ctoi[c] for c in x]

# function to map integer to char for all chars in input string x
decode = lambda x: [itoc[i] for i in x]

encode("hello"), decode(encode("hello"))([46, 43, 50, 50, 53], ['h', 'e', 'l', 'l', 'o'])

tokenized_data = encode(data)  # run encoding function on entire dataset
n = int(len(tokenized_data) * 0.9)  # define training slice as 90%
train_data = tokenized_data[:n]  # get first 90% of data
test_data = tokenized_data[n:]  # get last 90% of data

Create a dataloader function from raw string. 1. Get batch_size random start index 2. `x` runs from the `index` to `index+context_length` 3. `y` runs from `index+1` to `index+1+context_length` -> the shift

def get_batch(data, context_length, batch_size):
    ix = torch.randint(0, len(data) - context_length - 1, (batch_size,))
    x = torch.stack(
        [
            torch.tensor(data[i : i + context_length], dtype=torch.long)
            for i in ix
        ]
    )
    y = torch.stack(
        [
            torch.tensor(
                data[i + 1 : i + 1 + context_length], dtype=torch.long
            )
            for i in ix
        ]
    )
    return x, y  # y is one-step right shifted version of x, same size


get_batch(train_data, 64, 32)[0].shape  # (B,T)

Tiniest Language Model

Now that we've got the necessary ingredients for the simplest possible embed to unembed language model, let's build it...

class LanguageModel(nn.Module):
    def __init__(self, embed_dim, vocab_size):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.unembedding = Unembedding(embed_dim, vocab_size)

    def forward(self, x):
        return self.unembedding(self.embedding(x))

How does it learn?

At this point, I need to explain the training process of the model. Remember how our output is the predicted shifted-sequence and we are comparing that to the actual. The output has shape `(B,T,vocab_size)` and the ground truth has shape `(B,T)`. To get a performance metric from this, we want to compare using cross-entropy loss. Cross-entropy loss is a way of measuring the difference between two probability distributions.

lm = LanguageModel(2, vocab_size)
optimizer = torch.optim.Adam(lm.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
losses = []

for step in range(10000):
    x, y = get_batch(train_data, 1, 64)
    pred_logits = lm(x)
    loss = criterion(
        pred_logits.view(-1, vocab_size).float(), y.view(-1)
    ).requires_grad_(True)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    losses.append(loss.detach().numpy())
    if step % 10 == 0:
        print(f"Step {step}, Loss {loss.item()}")Step 0, Loss 4.480882167816162
Step 1000, Loss 3.8772220611572266
Step 2000, Loss 3.561326742172241
Step 3000, Loss 3.372966766357422
Step 4000, Loss 2.8575117588043213
Step 5000, Loss 3.0783464908599854
Step 6000, Loss 3.2430574893951416
Step 7000, Loss 3.1417269706726074
Step 8000, Loss 3.090763568878174
Step 9000, Loss 3.076702117919922

... and generate

def generate_sequence(
    lm,
    train_data,
    batch_size=1,
    seq_length=256,
    context_length=256,
    temperature=1.0,
):
    # Get initial context
    initial_context, _ = get_batch(train_data, 256, batch_size)
    initial_context = initial_context.to("mps")

    generated_tokens = initial_context[:, -1].unsqueeze(-1)
    context = initial_context

    for _ in range(seq_length):
        if context_length and context.size(1) > context_length:
            context = context[:, -context_length:]

        # Fetch the relevant logits from the model
        logits = lm(context)[:, -1, :] / temperature
        # scale by temperature and compute probabilities
        probs = F.softmax(logits, dim=-1)

        # Sample a token based on the probabilities
        sampled_token = torch.multinomial(probs, 1)

        # Append the token and update context
        generated_tokens = torch.cat((generated_tokens, sampled_token), dim=-1)
        context = torch.cat((context[:, 1:], sampled_token), dim=-1)

    # Convert tokens to string
    generated_sequence = "".join(
        decode(list(generated_tokens[0].flatten().cpu().detach().numpy()))
    )

    return generated_sequence


# Usage
generated_text = generate_sequence(lm, train_data, temperature=1)
print(generated_text)fesn!atbRFhed wfrnt o soaq:A
Wle he,oI
Wud deis
MVS

An interesting property well-known in mechanistic interpretabilityis that our simple language model(`embed->unembed`) approximates bigram statistics. Since the model is simply mapping one word to one word with no other context, the best it can theoretically do is to emulate bigram statistics. Bi-gram statistics are the literal frequency counts of each one word to one word mapping present in the dataset. Of course, it doesn't behave exactly the same; we intentionally cause some variability by probabilistically sampling instead of directly taking the highest component.

What's a GPT?

What we've just built is not a GPT, but it's close. The unique GPT components are located in between the embedding and unembedding layer. We've got the outer shell of a language model, but the inner components(what makes GPT special) will allow for reading prior context. Once embedded, the `(B,T,C)` input is fed through a series of residually stacked "transformer layers". Each transformer layer is identical. These transformers layers are "residual" because they do not replace the signal like `new_x = transformer_layer(x)`, but `x = x + transformer_layer(x)`.

Inside the transformer layer are two components, an attention layer and an MLP layer. First, the attention layer is added onto the original, `x = x + attention(x)`, and then `x = x + mlp(x)`. If you expand this out, it looks like `x = x + attention(x) + mlp(x + attention(x))`. The original input persists through each transformer layer but is having information read and wrote to it in each layer. You can think of this as a sort of data highway. The embedded input `(B,T,C)` is fed into the transformer layers and it's values are repeatedly read and wrote to until it reaches the unembedding. Recall how, in our simple `embedding->unembedding` model, each token is being directly mapped from current to next token. Transformer layers add in the ability to mix in context from previous words as well. Specifically, the attention layer is the only operation that allows for information to mix from previous words because the MLP is another position-wise network. The concept that I've just described is the residual stream, a key insight of mechanistic interpretability. With this insight, the behaviour and interpretability of the model becomes much clearer. All operations on the residual stream are linear (direct addition), so information has the ability to persist in certain "linear subspaces" of the model throughout many layers. Consequently, layers can communicate and perform functions across each other.

Attention

In GPTs, attention acts as the primary mechanism for contextual understanding. It facilitates token-to-token information exchange crucial for predicting subsequent tokens. There are two functionalities that we want to have when moving context between tokens with the purpose of improving understanding throughout the residual stream.

Read information from previous tokens(without also looking ahead to cheat)
Write information from previous tokens into the residual stream

How to read

Reading information is similar to a data-retrieval process. We want each token to look to its left and consider which previous words are relevant to the understanding of the current token in some way. Mathematically, this could look like a probability dis In the end, this should look like a probability distribution(all in between 0 and 1, sum to 1) over all the tokens previous and including the current token. These probabilities reflect what proportion of the relevant information from the other token that we want to copy into our token.

To do so, we want each token in the sequence, including itself, to "broadcast" a certain vector out that is supposed to represent its meaning that is relevant to the current token. It's called `K`. The current token then "broadcasts" its own vector to establish what information it's looking for, `Q`. At every position(behind the current-token), every other token is deciding how relevant it is to the current token in this process. Attention uses the dot-product between the broadcasted `Q` vector and all corresponding `K` vectors to obtain a relevancy score value and then softmaxes to get probabilities. The entire reading process is summed up in the attention table. Each row corresponds to how much of the other token vectors we want to copy into the current.

So the square grid that I referred to is a real thing -- the attention pattern. The attention pattern, `A`, is responsible for all the reading operations from other tokens. There are no other components in a GPT which allow for information movement between tokens. As a result, this is where we implemement the masking operation so our model can only look in the past. All that's needed is just a lower-triangular mask(meaning we set all the values representing future tokens to `-INF`). The way you get that singular relevancy number between the `Q` and `K` vectors is by performing a dot-product comparison between the broadcasted vectors of all the other tokens in the sequence to the current token. Both `Q` and `K` have shape `(B,T,C)`, so to get the proper behavior of multiplication between each feature vector, the dot product is performed with the tranpose of `K`. `Q(B,T,C) @ K(B,C,T)-->A(B,T,T)`. This is the square grid of tokens I wasreferring to. We can perform this kind of multiplication in PyTorch and the batch dimension is ignored so each query token vector is being multiplied by every other key token vector. The linear layers take the function of aligning to whatever operation they're put in. In this NLP, and more broadly attention context, the operation is that the `Q` vector should be what information we want from the other tokens, and the `K` vector should be what information the other vector has. Dot-producting them together along the feature vector dimension gives you the similarity. We can use this as our token reading operation.

We implement it in code form here

class SelfAttention(nn.Module):
    def __init__(self, embedding_dim):
        super().__init__()
        self.embedding_dim = embedding_dim

        self.qkv = nn.Linear(embedding_dim, 3 * embedding_dim)
        self.scaling_dim = torch.sqrt(
            torch.tensor([self.embedding_dim], dtype=torch.float32)
        )

    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)

        A = q @ k.transpose(-1, -2) / self.scaling_dim
        A = F.softmax(A, dim=-1)
        mask = torch.triu(
            torch.ones((T, T), device=x.device), diagonal=1
        ).bool()  # Upper triangular mask
        A = A.masked_fill(mask, -1e9)  # Use masked_fill for masking

        return A @ v

From One to Multiple Heads

In this setup, each attention layer is responsible for one read and write operation. This is an inefficient use of compute because you can reduce the dimensionality of the `QKV` matrices and still achieve good performance. The way to resolve this is to split up each of the `QKV` matrices into `num_heads` heads, and apply the attention operation separately for each head. `Q` has shape `(B,T,C)`, so if we want four heads, we make the shape `(B,T,4,C//4)`. I'll call `C//4` the `head_dim` and substitute `4` for num_heads from now on. Then, we can just apply the same attention operation on each head separately and combine them at the end. For the attention pattern `A`, this means rearranging the `Q` and `K`(which have shape `(B,T,num_heads,head_dim)`) into `(B,num_heads,T,head_dim)`. To perform the scaled-dot product operation, you can just perform a matrix multiplication along the `head_dim` in the same way we did it along the `embed_dim` before. The output is `(B,num_heads,T,T)`.

class SelfAttention(nn.Module):
    def __init__(self, embedding_dim):
        super().__init__()
        self.embedding_dim = embedding_dim

        self.qkv = nn.Linear(embedding_dim, 3 * embedding_dim)
        self.scaling_dim = torch.sqrt(
            torch.tensor([self.embedding_dim], dtype=torch.float32)
        )

    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)

        A = q @ k.transpose(-1, -2) / self.scaling_dim
        A = F.softmax(A, dim=-1)
        mask = torch.triu(
            torch.ones((T, T), device=x.device), diagonal=1
        ).bool()  # Upper triangular mask
        A = A.masked_fill(mask, -1e9)  # Use masked_fill for masking

        return A @ v

MLP

From a mechanistic interpretability perspective, the MLP in the transformer is still somewhat mystifying. Though powerful, its reason for existence is less certain than the attention mechanism. MLP layers seem to improve the performance of the model, but they're not as direly necessary as attention. Regardless, they're importance to understand and implement. Just like everything else in the transformer, the MLP transformation is applied positon-wise to each token. There is no information movement between tokens, only better understanding the current token. The MLP takes an input `(B,T,C)`, expands it through a linear layer with `N` times the dimensionality of the embedding_dim, and then shrinks it back down from `N*embed_dim` to `embed_dim`. Increasing the hidden layer allows the model to learn a more complex behavior. After being fed through the expansion and unexpansion, the input is fed through an activation function.

class MLP(nn.Module):
    def __init__(self, embedding_dim):
        super(MLP, self).__init__()

        self.fc1 = nn.Linear(embedding_dim, embedding_dim * 4)
        self.fc2 = nn.Linear(embedding_dim * 4, embedding_dim)
        self.dropout = nn.Dropout()

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(F.gelu(x))
        x = self.dropout(x)
        return x

TransformerLayer

To further abstract the transformer layer so it's simpler in the actual GPT class, I put all the logic for the residual (attention + MLP) in here. That way, I can just run the input through a list of TransformerLayers.

class TransformerLayer(nn.Module):
    def __init__(self, embedding_dim, num_heads):
        super(TransformerLayer, self).__init__()

        self.self_attention = MultiHeadSelfAttention(embedding_dim, num_heads)
        self.norm1 = nn.LayerNorm(embedding_dim)

        self.mlp = MLP(embedding_dim)
        self.norm2 = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        x = x + self.self_attention(
            self.norm1(x)
        )  # residual connection with "pre-norm"
        x = x + self.mlp(self.norm2(x))

        return x

Positional Embedding

Since we're including prior context, we need some way of telling the model *where* it's copying information from when doing attention. Attention does not have any notion of position inherintly. There are many more complex solutions out there but learned positional embeddings will do for our purposes. Similar to how we assigned an integer to each word in our vocabulary, we can also assign an integer to each position for all positions from 0 to the max context length. We can do this by creating an `nn.Embedding` with `embedding_dim` equal to the main `embedding_dim` and the number of embeddings equal to the max context length and an embedding dimension. Then, you can just add the positional embeddings to the main embeddings.

self.pos_embedding = Embedding(embedding_dim,max_context_length)

The long awaited GPT

Everything is finally built. Embedding, unembedding, MLP, multi-head attention, positional embedding. All that's left is to actually put it into this class. This could should be pretty simple if you've followed up until this point.

class GPT(LanguageModel):
    def __init__(
        self, embedding_dim, num_heads, num_layers, max_context_length, vocab_size
    ):
        super().__init__(embedding_dim, vocab_size)
        self.transformer_layers = nn.ModuleList(
            (TransformerLayer(embedding_dim, num_heads) for _ in range(num_layers))
        )
        self.pos_embedding = nn.Embedding(max_context_length, embedding_dim).to("mps")
        self.init_drop = nn.Dropout(0.05)
        self.final_norm = nn.LayerNorm(embedding_dim)

    def forward(self, x):
        emb_x = self.embedding(x)

        B, T, C = emb_x.shape
        pos_emb_idx = torch.arange(0, T).to("mps")
        pos_emb_x = self.pos_embedding(pos_emb_idx)
        x = self.init_drop(pos_emb_x + emb_x)
        for transformer_layer in self.transformer_layers:
            x = transformer_layer(x)
        x = self.final_norm(x)
        logits = self.unembedding(x)

        return logits

Let's use the same training loop as before but use a small GPT.

embedding_dim = 16
num_heads = 4
num_transformer_layers = 5
context_length = 256

lm = GPT(
    embedding_dim, num_heads, num_transformer_layers, context_length, vocab_size
).to("mps")
optimizer = torch.optim.Adam(lm.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for step in range(1000):
    x, y = get_batch(train_data, 256, 64)
    x = x.to("mps")
    y = y.to("mps")

    pred_logits = lm(x)
    loss = criterion(
        pred_logits.view(-1, vocab_size).float(), y.view(-1)
    ).requires_grad_(True)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"Step {step}, Loss {loss.item()}")

generate_sequence(lm, train_data, temperature=1)ntusr oh alr?hrw  ernrre Tretstb sbd
T i e o e unOe
otelN hlme udWwWorte he m,eTigr  t f
? ooe vdo

A?bdoiSomnuedor tnrFlodA oepo uy,loc e vneoy i mi trleuord
onooc.unl,taoTroeld nod;cTaacdduo aa
Nrog
cwilloaa
Cerr reai hrtlVdtoib:Wgtr
N!?orrrmlrhdo edenilo

It might not look great, but compare that to the output of an untrained model. It's clearly picking up on things like the rarity of certain special characters and commonality of newlines.

oOEBEBAjrF;sw?OBruxLh33$AIvBvO;jVw.mnCzB.$:ns'OuQNcaYvZMbDl?jtABd

Boom

That's all there is to GPTs! With the intuition and technical knowledge you've got now, you should be able to understand new modifications to the architecture with relative ease. I'll go into some of these paradigms in another post.