| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | A deep understanding of AI large language model mechanisms | |||||||||||||||||||||
2 | Section name | Video lecture title | Python code file name | Udemy video # | Key take-home points | |||||||||||||||||
3 | Introductions | [IMPORTANT] Prerequisites and how to succeed in this course | - | 1 | LLM architecture, training, and mechanisms are advanced topics. Have a positive attitude and embrace the challenge. Taking notes by hand helps you learn more and remember better. Lecture notes are not available. Having experience with coding, linear algebra, maching learning, and deep learning will help you excel in this course, although these topics are introduced in the course as they are necessary. You can choose which lectures to watch and to skip, but keep in mind that knowledge and skills are cumulative. | |||||||||||||||||
4 | Using the Udemy platform | - | 2 | - | ||||||||||||||||||
5 | Getting the course code, and the detailed overview | - | 3 | - | ||||||||||||||||||
6 | Do you need a Colab Pro subscription? | - | 4 | You can access GPUs for free on Google Colab, though compute time and RAM are limited. Most of this course can be done on the CPU and limited GPU (free colab plan), but will be slower. Paying to upgrade to Colab Pro will be convenient for many lectures but is not necessary. You can upgrade to Pro and downgrade after the course. | ||||||||||||||||||
7 | About the "CodeChallenge" videos | - | 5 | You are the master of your education, and you should engage with this course in a way that best suits your skills and goals. Use different difficulty levels for different videos. Watch the code demos, even if you aren’t coding yourself. | ||||||||||||||||||
8 | Part 1: Tokenizations and embeddings | - | ||||||||||||||||||||
9 | Words to tokens to numbers | Why text needs to be numbered | - | 7 | Text must be transformed into numbers before LLM. A chunk of text is a “token” and can be a character, subword, or full word. Embeddings are dense representations of tokens. Tokenization and embeddings are learned from data, and there are many ways to create these schemes. | |||||||||||||||||
10 | Parsing text to numbered tokens | part1_text2num_text2numbers | 8 | Text can be split into words via spaces, although this is not done in real tokenization. Encoder and decoder functions are simple look-up tables. Tokenization (encoding text using integers) is conceptually straightforward. | ||||||||||||||||||
11 | CodeChallenge: Create and visualize tokens (part 1) | part1_text2num_CCmakeATokenizer | 9 | Encoders and decoders are created using dictionary comprehension and functions. The “context” of a token is its neighbors (before and possibly after); “context window” is the number of neighbors. | ||||||||||||||||||
12 | CodeChallenge: Create and visualize tokens (part 2) | part1_text2num_CCmakeATokenizer | 10 | “One-hot encoding” is a sparse tokenization, with one row per token and one column per vocab item. | ||||||||||||||||||
13 | Preparing text for tokenization | part1_text2num_preparingText4tokens | 11 | Real text from the web is easy to import but a pain to clean… Creating a tokenizing scheme is tricky and involves many choices with few clear optimal decisions. Encoders and decoders are easy to create and use. | ||||||||||||||||||
14 | CodeChallenge: Tokenizing The Time Machine | part1_text2num_CCtimeMachineTokens | 12 | Randomly generated tokens are usually nonsensical. There is a mathematical relationship between word length and frequency (more on this later!), which has implications for LLM performance. Tokenizers need special characters to deal with unknown tokens. | ||||||||||||||||||
15 | Tokenizing characters vs. subwords vs. words | - | 13 | Every tokenization scheme has advantages and limitations. Current best tokenizers use a combination of characters, subwords, and words. The vocab is learned based on statistical characteristics of human-written text. | ||||||||||||||||||
16 | Byte-pair encoding algorithm | part1_text2num_bytePairEncoding | 14 | The BPE algorithm is based on replacing frequent token sequences with new tokens. The basic BPE algorithm is simple and easy to implement. Production-level tokenizers add several more steps to ensure accuracy, efficiency, and speed. | ||||||||||||||||||
17 | CodeChallenge: Byte-pair encoding to a desired vocab size | part1_text2num_CCbytePairEncodingLoop | 15 | Even simple byte-pair encoding on a tiny datset creates tokens with preceeding spaces, just like professional tokenizers. | ||||||||||||||||||
18 | Exploring ChatGPT4's tokenizer | part1_text2num_GPT4tokenizer | 16 | OpenAI’s tokenizer is available, but is model-specific (e.g., GPT2 vs. GPT4). Tokenizers use character- subword- and word-level tokens. Preceding spaces are part of tokens. No text preprocessing is necessary! Just feed all the text into the tokenizer. | ||||||||||||||||||
19 | CodeChallenge: Token count by subword length (part 1) | part1_text2num_CCtokenEfficiency | 17 | Words and tokens differ in several ways, though they can overlap. Words vary in their encoding efficiency, which is partly related to how often they appear in texts. | ||||||||||||||||||
20 | CodeChallenge: Token count by subword length (part 2) | part1_text2num_CCtokenEfficiency | 18 | Words and tokens differ in several ways, though they can overlap. Words vary in their encoding efficiency, which is partly related to how often they appear in texts. | ||||||||||||||||||
21 | How many "r"s in strawberry? | part1_text2num_strawberry | 19 | ChatGPT has difficulties with letter-based calculations because it represents words as tokens. Asking ChatGPT to implement its calculations in python increases accuracy (same for math problems). | ||||||||||||||||||
22 | CodeChallenge: Create your algorithmic rapper name :) | part1_text2num_CCrapperName | 20 | Tokenization can be fun! Tokenizers take lists, not ints, as input. | ||||||||||||||||||
23 | Tokenization in BERT | part1_text2num_BERT | 21 | Different tokenizers are optimized for different purposes. BERT tokenizer has more words (vs. subwords) than GPT, and is therefore more human-interpretable. BERT tokenizer by default adds special tokens before and after text. Don’t forget about this! | ||||||||||||||||||
24 | CodeChallenge: Character counts in BERT tokens | part1_text2num_CCbertChars | 22 | Professional tokenizers are easy to work with once you get used to them. Tokenizers contain “special tokens” that you might want to filter out of analyses. | ||||||||||||||||||
25 | Translating between tokenizers | part1_text2num_tokenTranslation | 23 | Write your code for one specific tokenizer. Choose the tokenizer (and LLM!) based on your goals. | ||||||||||||||||||
26 | CodeChallenge: More on token translation | part1_text2num_CCtranslatorFuns | 24 | The more familiar you are with tokenization, the easier it will be to understand embeddings and LLM mechanisms. | ||||||||||||||||||
27 | CodeChallenge: Tokenization compression ratios | part1_text2num_CCtokenCompression | 25 | Importing text data from the web is really easy. Tokenization = compression? The primary goal of a tokenizer is to make text more efficient for LMs, but compression is a common byproduct due to redundancies in (some) written languages. Token compression ratios are stable across different texts, with higher variability for text that includes code. | ||||||||||||||||||
28 | Tokenization in different languages | part1_text2num_languages | 26 | Tokenization ≠ compression (but it often is). Languages that have more complex written forms (e.g., morphemes in Chinese, richer morphology in Tamil) may require more tokens that characters. Tokenization is less effective in languages they have less training data on. | ||||||||||||||||||
29 | CodeChallenge: Zipf's law in characters and tokens | part1_text2num_CCzipf | 27 | Zipf’s law, a.k.a. power-law scaling, a.k.a. scale-free organization, a.k.a. fractal-like, a.k.a. self-similarity is a pervasive characteristic of biological and physical systems, and is taken as evidence of complex systems. Modern computing tools and accessible digitized datasets allow you to explore nature in ways that were unthinkable until very recently. | ||||||||||||||||||
30 | Word variations in Claude tokenizer | part1_text2num_ClaudeVariations | 28 | Spaces are meaningful to humans, but are treated just like any other character to tokenizers. Lots of subwords and words have preceding spaces. Language models need a huge amount of data to learn all the ambiguities, errors, and varieties in human text. | ||||||||||||||||||
31 | Embedding spaces | Word2Vec vs. GloVe vs. GPT vs. BERT... oh my! | - | 29 | There are several word embeddings matrices that have different goals and applications, and are created in different ways. It is not trivial to compare them (more on this in the mech.interp section). The embeddings used in LLMs are not fixed, but instead are adjusted by the model based on context. | |||||||||||||||||
32 | Exploring GloVe pretrained embeddings | part1_embed_GloVe | 30 | The GloVe embeddings can be used to study texts and relations between words. The embeddings vectors are fixed once trained. GloVe is not used in LLMs, partly because of the large and word-based vocabularies. Cosine similarity has many applications in language modeling and computational linguistics. | ||||||||||||||||||
33 | CodeChallenge: Wikipedia vs. Twitter embeddings (part 1) | part1_embed_CCwikiVsTwitter | 31 | A lot of diversity across word embeddings matrices. It is difficult or impossible to compare different embeddings matrices directly, but sets of relationships can be compared (see RSA in next section!). Visualizing embeddings vectors often helps with interpretation. | ||||||||||||||||||
34 | CodeChallenge: Wikipedia vs. Twitter embeddings (part 2) | part1_embed_CCwikiVsTwitter | 32 | A lot of diversity across word embeddings matrices. It is difficult or impossible to compare different embeddings matrices directly, but sets of relationships can be compared (see RSA in next section!). Visualizing embeddings vectors often helps with interpretation. | ||||||||||||||||||
35 | Exploring GPT2 and BERT embeddings | part1_embed_GPT2BERT | 33 | All model parameters of publicly available LLMs are easily accessible. GPT2 and BERT embeddings are not directly comparable, though model comparisons are possible (see next section). LLM embeddings are not fixed! They are adjusted by attention and MLP layer as they pass through the LM. | ||||||||||||||||||
36 | CodeChallenge: Math with tokens and embeddings | part1_embed_CCmathWithTokens | 34 | Token indices are arbitrarily mapped to meaningful subwords (numbers). “Token math” makes no sense (but can be fun :P ). Embeddings vectors, however, can be manipulated mathematically — this is how LLM attention works! Use LLMs for math theory/explanations, have it solve math problems by writing code to solve the problems. | ||||||||||||||||||
37 | Cosine similarity (and relation to correlation) | part1_embed_cosineSimilarity | 35 | Cosine similarity is one of the most commonly used measures of a relationship between two variables when working with embeddings and LLMs. Cosine similarity is related to correlation: Both variance-normalize; Pearson additionally mean-centers. Use the correlation to quantify a linear relationship in data that have different scales or mean offsets. But when the variables are in the same scale and offsets are meaningful, use cosine similarity. This is often the case in LLM investigations. When learning a new technical topic, try to code the math yourself; in applications, use established libraries if available and if it’s easier. | ||||||||||||||||||
38 | CodeChallenge: GPT2 cosine similarities | part1_embed_CCcosineSimilaritiesGPT2 | 36 | Multitoken words present several challenges in LLM investigations (hint: give the model context and only analyze the final token). Understanding the math of analyses makes you a better and more flexible data scientist. | ||||||||||||||||||
39 | CodeChallenge: Unembeddings (vectors to tokens) | part1_embed_CCunembedding | 37 | An “unembeddings matrix” is the conceptual inverse of the embeddings matrix (but not a literal inverse matrix). The token with the largest unembeddings value is the next token in a generated sequence. Generated text can quickly lose meaning, which was a major hurdle for language models to overcome. The broad strokes mechanisms of next-token generation is simple conceptually and mathematically. | ||||||||||||||||||
40 | Position embeddings | part1_embed_positionEmbeddings | 38 | Language models use position embeddings to focus on tokens from various locations (“time points”) in the token sequence. Predefined position embeddings are probably good enough for smaller models; modern architectures use learned embeddings. Cosine similarity matrices can be tricky to interpret, but convey a lot of information. | ||||||||||||||||||
41 | CodeChallenge: Exploring position embeddings | part1_embed_CCpositionExplorations | 39 | The position embeddings matrix is complicated and impacts token processing in ways that are difficult to predict a priori. Visual appearances (especially of apparent null effects) should be statistically evaluated using shuffled data before making strong interpretations. The shuffling method here was too liberal; circular shifting and spectral phase scrambling are better. | ||||||||||||||||||
42 | Training embeddings from scratch | - | 40 | Embeddings matrices in LLMs are trained with the rest of the model. In the next several videos, however, we will train only and embeddings layer, to focus on the mechanisms of learning embeddings. | ||||||||||||||||||
43 | Create a data loader to train a model | part1_embed_learnEmbeddings | 41 | Preparing data to train models can become complicated (a common experience in many data fields…). Simpler data organization methods are possible, but can be suboptimal for professional-grade model training. | ||||||||||||||||||
44 | Build a model to learn the embeddings | part1_embed_learnEmbeddings | 42 | Embeddings matrices are learned from text data using gradient descent and dimension-squeezing deep learning models. The embedding dimension is preserved throughout the entire language model. Language models generate text by concatenating one new token onto an existing token sequence. | ||||||||||||||||||
45 | Loss function to train the embeddings | part1_embed_lossfunction | 43 | Negative log likelihood is the main loss function used in language model training. Log softmax increases sensitivity at small probabilities, and gives a stronger penalty for errors. Most loss functions (and other mathematical bases of DL) are simple and well-defined. Difficulties arise from the explosive dimensionality of models. | ||||||||||||||||||
46 | Train and evaluate the model | part1_embed_learnEmbeddings | 44 | Language models are trained on a small number of epochs, because each epoch has a huge amount of data (batches). GPU access will be increasingly important as the course progresses… | ||||||||||||||||||
47 | CodeChallenge: How the embeddings change | part1_embed_learnEmbeddings | 45 | The embeddings vectors expanded to fill up more of the embeddings space, reflecting encoding of high-dimensional information in a lower-dimensional space. Interpretable semantic relationships (as measured through Sc) emerge even with a small amount of targeted training data, though the strength is weak. | ||||||||||||||||||
48 | CodeChallenge: How stable are embeddings? | part1_embed_learnEmbeddings | 46 | Embeddings vectors of the same token appear to be completely unrelated across repeated training runs. Cosine similarity appears to be more consistent for related tokens. Relative embeddings within a matrix are more interpetable than absolute vectors or across matrices. The loss profile is relevant, but is just a tiny glimpse at what happens during training. | ||||||||||||||||||
49 | Part 2: Large language models | |||||||||||||||||||||
50 | Build a GPT | Why build when you can download? | - | 48 | Building a model from scratch is a fantastic way to learn how LLMs are created and trained. Please don’t ever build an LLM from scratch again (except for more education) | |||||||||||||||||
51 | Model 1: Embedding (input) and unembedding (output) | part2_build_model1 | 49 | Saying that language models “use only the final token" for next-token prediction is not accurate. They use all tokens; the final token contains the most information. Next-token selection is probabilistic. LLMs get complicated quickly; it’s good to learn about them one step at a time. | ||||||||||||||||||
52 | Understanding nn.Embedding and nn.Linear | part2_build_embeddingVlinear | 50 | nn.Embedding is a convenient wrapper for nn.Parameter. Use it to create embeddings matrices. | ||||||||||||||||||
53 | CodeChallenge: GELU vs. ReLU | part2_build_CCreluVgelu | 51 | GELU is the most common activation function in LLMs. It is more complicated than ReLU but smoother. Many PyTorch functions are available as functions and classes; using one or the other is sometimes by necessity and sometimes a personal choice. Measuring computation time is not trivial, because of optimized implementations, GPU fusing, overhead, etc. | ||||||||||||||||||
54 | Softmax (and temperature): math, numpy, and pytorch | part2_build_softmax | 52 | Softmax transforms LLM outputs from “raw” values (logits) into a probability distribution. Temperature increases stochasticity, which helps language models produce more “creative” text. Typical temperatures are .5—1.5. You’ll learn more about nuances of softmax theory and implementation in later videos. | ||||||||||||||||||
55 | Randomly sampling words with torch.multinomial | part2_build_multinomial | 53 | torch.multinomial directly maps token probability values onto selection probabilities. Many PyTorch functions are sensitive to input data type — something to check if (when) you get errors! Numpy and pytorch functions can sometimes produce identical results given proper input arguments, but often not by default. Be careful when translating between libraries. | ||||||||||||||||||
56 | Other token sampling methods: greedy, top-k, and top-p | - | 54 | There is no right or wrong token selection method. Random sampling can increase response variability, which might be good for social chatting but bad for coding or legal documents. The sampling method interacts with softmax temperature: Higher temperatures have stronger boosts of fewer tokens. | ||||||||||||||||||
57 | CodeChallenge: More softmax explorations | part2_build_CCsoftmaxExtreme | 55 | Softmax is mathematically simple, but some aspects of the transformation are apparent only for some numerical ranges. The number of data values (e.g., vocab size) impacts the probability values, because they must all sum to 1. | ||||||||||||||||||
58 | What, why, when, and how to layernorm | part2_build_layernorm | 56 | Layernorm is simple and critical. “Set it and forget it”: The learned parameters have little theoretical relevance other than preserving numerical stability throughout the model. | ||||||||||||||||||
59 | Model 2: Position embedding, layernorm, tied output, temperature | part2_build_model2 | 57 | Position embeddings are added to the token embeddings, and help the model learn temporal patterns. Embeddings are trained inside the model, not separately as with tokenization (c.f. previous section). Tokens in a sequence are processed simultaneously, not in a for-loop (causality from attention mechanism). | ||||||||||||||||||
60 | Temporal causality via linear algebra (theory) | - | 58 | Time-causal attention can be implemented using a time vector in which the future is weighted zero while the past is weighted non-zero. Causality can be implemented using matrices and softmax-probability, which is very computationally efficient on GPUs. Causal attention is not necessary for LLMs, but improves next-token generation (good for chatbots). | ||||||||||||||||||
61 | Averaging the past while ignoring the future (code) | part2_build_pastWithLinalg | 59 | Matrix multiplication with masks are an efficient way to avoid for-loops. PyTorch has optimized functions that fuse the attention algorithm, including the causal mask (next lecture!). | ||||||||||||||||||
62 | The "attention" algorithm (theory) | - | 60 | The “attention” mechanism is a clever way of pooling information across different embeddings vectors that allows for surrounding context to modify the current token transformation. All analogies break down; some are useful (or at least entertaining). The full LLM Transformer architecture is more complicated than just attention, but attention is a key aspect. | ||||||||||||||||||
63 | CodeChallenge: Code Attention manually and in Pytorch | part2_build_CCattentionAlgo | 61 | Q, K, and V matrices are trainable weights that are not changed during inference (applications). Q, K, and V are the activations resulting from multiplication with token embeddings vectors. “Highly optimized” functions can be hardware-specific. Very powerful LLMs require specialized hardware, the sale of which is regulated by governments. | ||||||||||||||||||
64 | Model 3: One attention head | part2_build_model3 | 62 | Attention adjusts (not replaces) the embeddings vectors as they pass through the model. Different tokenizers (and also different pretrained models) use different terms and variable names. Understanding how LLMs work will help you identify features in a model. | ||||||||||||||||||
65 | The Transformer block (theory) | - | 63 | The Transformer block contains an attention sublayer and an MLP sublayer. Together, they calculate an adjustment to the token embedding to point towards an appropriate next-token embedding. Expansion-nonlinearity-contraction is a typical MLP architecture for feature extraction and linear separability. LLMs comprise dozens of Transformer blocks that learn different features and timescales of texts. | ||||||||||||||||||
66 | The Transformer block (code) | part2_build_transformer | 64 | You can now implement a GPT-style Transformer :) Separating modules into callable classes helps keep code neat and organized. Both Transformer sublayers comprise the operations copy → normalize → adjustment → add back to copy. | ||||||||||||||||||
67 | Model 4: Multiple Transformer blocks | part2_build_model4 | 65 | Use nn.Sequential to create repeated instances of the same component of a model (with different weights matrices). Specialization of Transformer blocks is not imposed by the architecture, but is thought to be an emergent property. We’ll discuss this more in the mechanistic interpretability sections. | ||||||||||||||||||
68 | Multihead attention: theory and implementation | part2_build_multiheadAttention | 66 | Multihead Attention (MHA) involves applying the attention equation to different slices of the QKV matrices. MHA is thought to increase the richness and complexity of feature isolation and context-sensitivity, without increasing the number of trainable parameters. Data from all heads are linearly combined via W0. | ||||||||||||||||||
69 | Working on the GPU | part2_build_GPU | 67 | GPUs are great at number-crunching, and can save a lot of time in LLM training and applications. On the other hand, the CPU can be sufficient for a lot of explorations and investigations of LLMs. Accessing and using a GPU might be expensive and requires additional code, so use it only when it's really beneficial. | ||||||||||||||||||
70 | Model 5: Complete GPT2 on the GPU | part2_build_model5 | 68 | We just built a GPT2-small :) although the weights are all random, so it’s not functional. Commercial models, e.g., GPT4, have the same architecture and computations, but have more layers and parameters. But, quantitative increases can have qualitative impacts. | ||||||||||||||||||
71 | CodeChallenge: Time model5 on CPU and GPU | part2_build_CCgpuVsCpu | 69 | LLMs are basically worthless without a large number of dedicated high-end GPUs. Regulating GPU access/sales is part of AI safety. | ||||||||||||||||||
72 | Inspecting OpenAI's GPT2 | part2_build_OpenAIGPT2 | 70 | Publicly available pretrained models are really easy to use, explore, experiment with, and learn from. | ||||||||||||||||||
73 | Summarizing GPT using equations | - | 71 | There are many ways to understand a deep learning model; a combination of perspectives (diagrams, explanations, code, equations) is most beneficial. Rewriting equations and checking matrix sizes can help you understand the flow and order of calculations. Neither Q nor K directly contribute to the embedding vector adjustment; their combination determines which vectors in V guide next-token selection. | ||||||||||||||||||
74 | Visualizing nano-GPT | https://bbycroft.net/llm | 72 | A neat website where you can visualize different GPT variants. | ||||||||||||||||||
75 | CodeChallenge: How many parameters? (part 1) | part2_build_CCparameterCounts | 73 | Biases are a tiny fraction of model parameters, which is why many deep learning models (especially models with layernorm) ignore them. MLP layers are dense and can have 2-3x as many parameters as attention layers. The number of parameters does not equate to importance in the model (cf layernorm). Examining and dissecting models are useful skills. | ||||||||||||||||||
76 | CodeChallenge: How many parameters? (part 2) | part2_build_CCparameterCounts | 74 | Biases are a tiny fraction of model parameters, which is why many deep learning models (especially models with layernorm) ignore them. MLP layers are dense and can have 2-3x as many parameters as attention layers. The number of parameters does not equate to importance in the model (cf layernorm). Examining and dissecting models are useful skills. | ||||||||||||||||||
77 | CodeChallenge: GPT2 trained weights distributions | part2_build_CCweightsDists | 75 | Characteristics of model weights often have smooth transitions across layers, reflecting shifts in representations and calculations. In histograms, use counts for equal sample sizes, and use density (or other scaling) for unequal sample sizes. | ||||||||||||||||||
78 | CodeChallenge: Do we really need Q? | part2_build_CClobotomizeQ | 76 | Causal manipulations are easy to implement, although knowing what to manipulate is challenging and non-trivial. Model evaluations are tricky. There are quantitative evaluation methods, but many evaluations are based on qualitative inspection of generated text. | ||||||||||||||||||
79 | Pretrain LLMs | What is "pretraining" and is it necessary? | - | 77 | Pretraining is a necessary first step for any LLM. A pretrained model understands the structure and patterns of written language, and can generate text. Pretraining to create a useful modern base model is prohibitively expensive for most individuals and companies. You should learn how pretraining works, but don’t try it at home ;) | |||||||||||||||||
80 | Introducing huggingface.co | - | 78 | HuggingFace provides resources for downloading pretrained LLMs (and other models), training datasets, and more. We will some some of the free resources in this course. You do not need a HuggingFace login to access materials for this course. | ||||||||||||||||||
81 | The AdamW optimizer | - | 79 | L2 regularization in Adam involves updating and regularizing in one step (i.e., regularizing the update). AdamW updates the weights first, then regularizes (i.e., regularizing the weights). Without regularization, Adam==AdamW. AdamW implements “constant shrinkage” instead of “adaptive shrinkage,” and empirically better in large models. | ||||||||||||||||||
82 | CodeChallenge: SGD vs. Adam vs. AdamW | part2_pretrain_CCsgdVsAdams | 80 | Simple mechanisms work better in simple models. Adam is adaptive and more likely to stabilize. Gradient accumulation speeds learning, but risks over-generalizing by pooling across more training data. Gradient accumulation is only used for training very large models, and you probably will always want to reset the gradients. | ||||||||||||||||||
83 | Train model 1 | part2_pretrain_model1 | 81 | Code to train LLMs has the same basic organization as code to train any deep learning model. Even very simple models with little and limited training quickly learn text structure such as punctuation and line breaks. | ||||||||||||||||||
84 | CodeChallenge: Add a test set | part2_pretrain_CCmodel1test | 82 | Creating and evaluating a test set is not so difficult, but requires extra code. Train/test splits are less important for pretraining LLMs, but is still good practice. Additional subtleties about devsets, model vs. researcher overfitting, etc., are not discussed here. | ||||||||||||||||||
85 | CodeChallenge: Train model 1 with GPT2's embeddings | part2_pretrain_CCmodel1withEmbeds | 83 | Freezing weights is a common technique in deep learning for transfer learning or when re-training on limited data. Freezing weights is not necessarily advantageous, especially in simple models or if the new training data differ from the previously trained data. | ||||||||||||||||||
86 | CodeChallenge: Train model 5 with modifications | part2_pretrain_model5WithMods | 84 | There are several ways to sample data, depending on how meticulous you want the procedure. Some overlap or skipping is less consequential with limitless training data. Models learn to produce “language-looking” text very quickly. | ||||||||||||||||||
87 | Create a custom loss function | part2_pretrain_customLoss | 85 | Loss functions are extremely important for training deep learning models. Loss functions should be as simple as possible, both mathematically and in code implementation. Knowing how to create your own loss function gives you more control and flexibility over precise model designs and outcomes. | ||||||||||||||||||
88 | CodeChallenge: Train a model to like "X" | part2_pretrain_CCtrainXbias | 86 | KL divergence is a useful loss function for training distributions instead of individual tokens. It is easy to train biases into models. This has major implications for fairness, misuse, persuasion or manipulation, cultural or political biases, marketing, and other AI safety topics. | ||||||||||||||||||
89 | CodeChallenge: Numerical scaling issues in DL models | part2_pretrain_CCscaling | 87 | Each matrix multiplication (dot products) increases the variance and numerical range of numbers. This can have a negative impact on softmax probabilities by flattening the distribution. Repeated normalization is very important for the stability of deep learning models including LLMs. | ||||||||||||||||||
90 | Weight initializations | part2_pretrain_weightsInits | 88 | Weight initialization is not important in small models, but is crucial for training large models including LLMs. Weights should be initialized to small values, often proportional to the matrix sizes. Bias terms are typically ignored in initializations, because there are so few bias terms. Any initialization is important; the exact details and distribution shape seems to be less relevant. | ||||||||||||||||||
91 | CodeChallenge: Train model 5 with weight inits | part2_pretrain_CCmodel5weightInits | 89 | Now you know how to initialize weights :) Examining changes in the model during learning is an approaching in mechanistic interpretability. Weights distributions tend to widen as the models learn more diverse patterns and representations. | ||||||||||||||||||
92 | Dropout in theory and in Pytorch | part2_pretrain_dropout | 90 | Dropout involves “switching off” units during training, and is thought to promote distributed representations. LLM pretraining sets are so large and diverse that dropout is less important compared to, e.g., CNNs. Dropout is more useful in fine-tuning when datasets are smaller and overfitting risk is higher. | ||||||||||||||||||
93 | Should you output logits or log-softmax(logits)? | - | 91 | Calculating log-softmax inside the LLM is often fine and convenient during training and classification, but outputting the raw values allows for more flexibility in subsequent applications. | ||||||||||||||||||
94 | The FineWeb dataset | part2_pretrain_FineWeb | 92 | HuggingFace provides several high-quality datasets for LLM training, including FineWeb (and several more specific variants). | ||||||||||||||||||
95 | CodeChallenge: Fine dropout in model 5 (part 1) | part2_pretrain_CCmodel5dropout | 93 | Dropout is conceptually simple, but adding it to code can be tricky. There are many moving parts and parameters in an LLM; make sure to track variables and transformations. Models can learn very fast on small homogeneous datasets, but take longer to train on datasets with more variability. | ||||||||||||||||||
96 | CodeChallenge: Fine dropout in model 5 (part 2) | part2_pretrain_CCmodel5dropout | 94 | Dropout is conceptually simple, but adding it to code can be tricky. There are many moving parts and parameters in an LLM; make sure to track variables and transformations. Models can learn very fast on small homogeneous datasets, but take longer to train on datasets with more variability. | ||||||||||||||||||
97 | CodeChallenge: What happens to unused tokens? | part2_pretrain_CCunusedTokens | 95 | Because softmax affects all token logits, not just the ones currently being processed, all tokens embeddings are trained inside the model, not just the ones that appear in the current token sequence. This has implications for model output coherence and accuracy for tokens that are less commonly used in training datasets (e.g., obscure topics, new coding languages). | ||||||||||||||||||
98 | Optimization options | - | 96 | There are many ways to decrease computation time during LLM training. Different strategies focus on the data, the model, or the hardware. LLMs take so long to pretrain that even a few milliseconds improvement per batch can be significant. | ||||||||||||||||||
99 | Fine-tune pretrained models | What does "fine-tuning" mean? | - | 97 | Fine-tuning means making targeted adjustments to an existing pretrained model. It takes days or weeks instead of months, and has much smaller computational requirements. There are myriad choices in fine-tuning; the challenge is from knowing what you want to do with the model. | |||||||||||||||||
100 | Fine-tune a pretrained GPT2 | part2_finetune_GPT2gulliver | 98 | Shortcuts are great, but make sure you understand the mechanisms before oversimplifying your code. Fine-tuning is easy to implement; the challenge comes from choosing the appropriate datasets. The learning rate should generally be lower. Model evaluation should rely on multiple qualitative and quantitative metrics. | ||||||||||||||||||