DULM_video_code_guide

	A	B	C	D	E
1	A deep understanding of AI large language model mechanisms
2	Section name	Video lecture title	Python code file name	Udemy video #	Key take-home points

3	Introductions	[IMPORTANT] Prerequisites and how to succeed in this course	-	1	LLM architecture, training, and mechanisms are advanced topics. Have a positive attitude and embrace the challenge. Taking notes by hand helps you learn more and remember better. Lecture notes are not available. Having experience with coding, linear algebra, maching learning, and deep learning will help you excel in this course, although these topics are introduced in the course as they are necessary. You can choose which lectures to watch and to skip, but keep in mind that knowledge and skills are cumulative.
4		Using the Udemy platform	-	2	-
5		Getting the course code, and the detailed overview	-	3	-
6		Do you need a Colab Pro subscription?	-	4	You can access GPUs for free on Google Colab, though compute time and RAM are limited. Most of this course can be done on the CPU and limited GPU (free colab plan), but will be slower. Paying to upgrade to Colab Pro will be convenient for many lectures but is not necessary. You can upgrade to Pro and downgrade after the course.
7		About the "CodeChallenge" videos	-	5	You are the master of your education, and you should engage with this course in a way that best suits your skills and goals. Use different difficulty levels for different videos. Watch the code demos, even if you aren’t coding yourself.
8	Part 1: Tokenizations and embeddings		-
9	Words to tokens to numbers	Why text needs to be numbered	-	7	Text must be transformed into numbers before LLM. A chunk of text is a “token” and can be a character, subword, or full word. Embeddings are dense representations of tokens. Tokenization and embeddings are learned from data, and there are many ways to create these schemes.
10		Parsing text to numbered tokens	part1_text2num_text2numbers	8	Text can be split into words via spaces, although this is not done in real tokenization. Encoder and decoder functions are simple look-up tables. Tokenization (encoding text using integers) is conceptually straightforward.
11		CodeChallenge: Create and visualize tokens (part 1)	part1_text2num_CCmakeATokenizer	9	Encoders and decoders are created using dictionary comprehension and functions. The “context” of a token is its neighbors (before and possibly after); “context window” is the number of neighbors.
12		CodeChallenge: Create and visualize tokens (part 2)	part1_text2num_CCmakeATokenizer	10	“One-hot encoding” is a sparse tokenization, with one row per token and one column per vocab item.
13		Preparing text for tokenization	part1_text2num_preparingText4tokens	11	Real text from the web is easy to import but a pain to clean… Creating a tokenizing scheme is tricky and involves many choices with few clear optimal decisions. Encoders and decoders are easy to create and use.
14		CodeChallenge: Tokenizing The Time Machine	part1_text2num_CCtimeMachineTokens	12	Randomly generated tokens are usually nonsensical. There is a mathematical relationship between word length and frequency (more on this later!), which has implications for LLM performance. Tokenizers need special characters to deal with unknown tokens.
15		Tokenizing characters vs. subwords vs. words	-	13	Every tokenization scheme has advantages and limitations. Current best tokenizers use a combination of characters, subwords, and words. The vocab is learned based on statistical characteristics of human-written text.
16		Byte-pair encoding algorithm	part1_text2num_bytePairEncoding	14	The BPE algorithm is based on replacing frequent token sequences with new tokens. The basic BPE algorithm is simple and easy to implement. Production-level tokenizers add several more steps to ensure accuracy, efficiency, and speed.
17		CodeChallenge: Byte-pair encoding to a desired vocab size	part1_text2num_CCbytePairEncodingLoop	15	Even simple byte-pair encoding on a tiny datset creates tokens with preceeding spaces, just like professional tokenizers.
18		Exploring ChatGPT4's tokenizer	part1_text2num_GPT4tokenizer	16	OpenAI’s tokenizer is available, but is model-specific (e.g., GPT2 vs. GPT4). Tokenizers use character- subword- and word-level tokens. Preceding spaces are part of tokens. No text preprocessing is necessary! Just feed all the text into the tokenizer.
19		CodeChallenge: Token count by subword length (part 1)	part1_text2num_CCtokenEfficiency	17	Words and tokens differ in several ways, though they can overlap. Words vary in their encoding efficiency, which is partly related to how often they appear in texts.
20		CodeChallenge: Token count by subword length (part 2)	part1_text2num_CCtokenEfficiency	18	Words and tokens differ in several ways, though they can overlap. Words vary in their encoding efficiency, which is partly related to how often they appear in texts.
21		How many "r"s in strawberry?	part1_text2num_strawberry	19	ChatGPT has difficulties with letter-based calculations because it represents words as tokens. Asking ChatGPT to implement its calculations in python increases accuracy (same for math problems).
22		CodeChallenge: Create your algorithmic rapper name :)	part1_text2num_CCrapperName	20	Tokenization can be fun! Tokenizers take lists, not ints, as input.
23		Tokenization in BERT	part1_text2num_BERT	21	Different tokenizers are optimized for different purposes. BERT tokenizer has more words (vs. subwords) than GPT, and is therefore more human-interpretable. BERT tokenizer by default adds special tokens before and after text. Don’t forget about this!
24		CodeChallenge: Character counts in BERT tokens	part1_text2num_CCbertChars	22	Professional tokenizers are easy to work with once you get used to them. Tokenizers contain “special tokens” that you might want to filter out of analyses.
25		Translating between tokenizers	part1_text2num_tokenTranslation	23	Write your code for one specific tokenizer. Choose the tokenizer (and LLM!) based on your goals.
26		CodeChallenge: More on token translation	part1_text2num_CCtranslatorFuns	24	The more familiar you are with tokenization, the easier it will be to understand embeddings and LLM mechanisms.
27		CodeChallenge: Tokenization compression ratios	part1_text2num_CCtokenCompression	25	Importing text data from the web is really easy. Tokenization = compression? The primary goal of a tokenizer is to make text more efficient for LMs, but compression is a common byproduct due to redundancies in (some) written languages. Token compression ratios are stable across different texts, with higher variability for text that includes code.
28		Tokenization in different languages	part1_text2num_languages	26	Tokenization ≠ compression (but it often is). Languages that have more complex written forms (e.g., morphemes in Chinese, richer morphology in Tamil) may require more tokens that characters. Tokenization is less effective in languages they have less training data on.
29		CodeChallenge: Zipf's law in characters and tokens	part1_text2num_CCzipf	27	Zipf’s law, a.k.a. power-law scaling, a.k.a. scale-free organization, a.k.a. fractal-like, a.k.a. self-similarity is a pervasive characteristic of biological and physical systems, and is taken as evidence of complex systems. Modern computing tools and accessible digitized datasets allow you to explore nature in ways that were unthinkable until very recently.
30		Word variations in Claude tokenizer	part1_text2num_ClaudeVariations	28	Spaces are meaningful to humans, but are treated just like any other character to tokenizers. Lots of subwords and words have preceding spaces. Language models need a huge amount of data to learn all the ambiguities, errors, and varieties in human text.
31	Embedding spaces	Word2Vec vs. GloVe vs. GPT vs. BERT... oh my!	-	29	There are several word embeddings matrices that have different goals and applications, and are created in different ways. It is not trivial to compare them (more on this in the mech.interp section). The embeddings used in LLMs are not fixed, but instead are adjusted by the model based on context.
32		Exploring GloVe pretrained embeddings	part1_embed_GloVe	30	The GloVe embeddings can be used to study texts and relations between words. The embeddings vectors are fixed once trained. GloVe is not used in LLMs, partly because of the large and word-based vocabularies. Cosine similarity has many applications in language modeling and computational linguistics.
33		CodeChallenge: Wikipedia vs. Twitter embeddings (part 1)	part1_embed_CCwikiVsTwitter	31	A lot of diversity across word embeddings matrices. It is difficult or impossible to compare different embeddings matrices directly, but sets of relationships can be compared (see RSA in next section!). Visualizing embeddings vectors often helps with interpretation.
34		CodeChallenge: Wikipedia vs. Twitter embeddings (part 2)	part1_embed_CCwikiVsTwitter	32	A lot of diversity across word embeddings matrices. It is difficult or impossible to compare different embeddings matrices directly, but sets of relationships can be compared (see RSA in next section!). Visualizing embeddings vectors often helps with interpretation.
35		Exploring GPT2 and BERT embeddings	part1_embed_GPT2BERT	33	All model parameters of publicly available LLMs are easily accessible. GPT2 and BERT embeddings are not directly comparable, though model comparisons are possible (see next section). LLM embeddings are not fixed! They are adjusted by attention and MLP layer as they pass through the LM.
36		CodeChallenge: Math with tokens and embeddings	part1_embed_CCmathWithTokens	34	Token indices are arbitrarily mapped to meaningful subwords (numbers). “Token math” makes no sense (but can be fun :P ). Embeddings vectors, however, can be manipulated mathematically — this is how LLM attention works! Use LLMs for math theory/explanations, have it solve math problems by writing code to solve the problems.
37		Cosine similarity (and relation to correlation)	part1_embed_cosineSimilarity	35	Cosine similarity is one of the most commonly used measures of a relationship between two variables when working with embeddings and LLMs. Cosine similarity is related to correlation: Both variance-normalize; Pearson additionally mean-centers. Use the correlation to quantify a linear relationship in data that have different scales or mean offsets. But when the variables are in the same scale and offsets are meaningful, use cosine similarity. This is often the case in LLM investigations. When learning a new technical topic, try to code the math yourself; in applications, use established libraries if available and if it’s easier.
38		CodeChallenge: GPT2 cosine similarities	part1_embed_CCcosineSimilaritiesGPT2	36	Multitoken words present several challenges in LLM investigations (hint: give the model context and only analyze the final token). Understanding the math of analyses makes you a better and more flexible data scientist.
39		CodeChallenge: Unembeddings (vectors to tokens)	part1_embed_CCunembedding	37	An “unembeddings matrix” is the conceptual inverse of the embeddings matrix (but not a literal inverse matrix). The token with the largest unembeddings value is the next token in a generated sequence. Generated text can quickly lose meaning, which was a major hurdle for language models to overcome. The broad strokes mechanisms of next-token generation is simple conceptually and mathematically.
40		Position embeddings	part1_embed_positionEmbeddings	38	Language models use position embeddings to focus on tokens from various locations (“time points”) in the token sequence. Predefined position embeddings are probably good enough for smaller models; modern architectures use learned embeddings. Cosine similarity matrices can be tricky to interpret, but convey a lot of information.
41		CodeChallenge: Exploring position embeddings	part1_embed_CCpositionExplorations	39	The position embeddings matrix is complicated and impacts token processing in ways that are difficult to predict a priori. Visual appearances (especially of apparent null effects) should be statistically evaluated using shuffled data before making strong interpretations. The shuffling method here was too liberal; circular shifting and spectral phase scrambling are better.
42		Training embeddings from scratch	-	40	Embeddings matrices in LLMs are trained with the rest of the model. In the next several videos, however, we will train only and embeddings layer, to focus on the mechanisms of learning embeddings.
43		Create a data loader to train a model	part1_embed_learnEmbeddings	41	Preparing data to train models can become complicated (a common experience in many data fields…). Simpler data organization methods are possible, but can be suboptimal for professional-grade model training.
44		Build a model to learn the embeddings	part1_embed_learnEmbeddings	42	Embeddings matrices are learned from text data using gradient descent and dimension-squeezing deep learning models. The embedding dimension is preserved throughout the entire language model. Language models generate text by concatenating one new token onto an existing token sequence.
45		Loss function to train the embeddings	part1_embed_lossfunction	43	Negative log likelihood is the main loss function used in language model training. Log softmax increases sensitivity at small probabilities, and gives a stronger penalty for errors. Most loss functions (and other mathematical bases of DL) are simple and well-defined. Difficulties arise from the explosive dimensionality of models.
46		Train and evaluate the model	part1_embed_learnEmbeddings	44	Language models are trained on a small number of epochs, because each epoch has a huge amount of data (batches). GPU access will be increasingly important as the course progresses…
47		CodeChallenge: How the embeddings change	part1_embed_learnEmbeddings	45	The embeddings vectors expanded to fill up more of the embeddings space, reflecting encoding of high-dimensional information in a lower-dimensional space. Interpretable semantic relationships (as measured through Sc) emerge even with a small amount of targeted training data, though the strength is weak.
48		CodeChallenge: How stable are embeddings?	part1_embed_learnEmbeddings	46	Embeddings vectors of the same token appear to be completely unrelated across repeated training runs. Cosine similarity appears to be more consistent for related tokens. Relative embeddings within a matrix are more interpetable than absolute vectors or across matrices. The loss profile is relevant, but is just a tiny glimpse at what happens during training.
49	Part 2: Large language models
50	Build a GPT	Why build when you can download?	-	48	Building a model from scratch is a fantastic way to learn how LLMs are created and trained. Please don’t ever build an LLM from scratch again (except for more education)
51		Model 1: Embedding (input) and unembedding (output)	part2_build_model1	49	Saying that language models “use only the final token" for next-token prediction is not accurate. They use all tokens; the final token contains the most information. Next-token selection is probabilistic. LLMs get complicated quickly; it’s good to learn about them one step at a time.
52		Understanding nn.Embedding and nn.Linear	part2_build_embeddingVlinear	50	nn.Embedding is a convenient wrapper for nn.Parameter. Use it to create embeddings matrices.
53		CodeChallenge: GELU vs. ReLU	part2_build_CCreluVgelu	51	GELU is the most common activation function in LLMs. It is more complicated than ReLU but smoother. Many PyTorch functions are available as functions and classes; using one or the other is sometimes by necessity and sometimes a personal choice. Measuring computation time is not trivial, because of optimized implementations, GPU fusing, overhead, etc.
54		Softmax (and temperature): math, numpy, and pytorch	part2_build_softmax	52	Softmax transforms LLM outputs from “raw” values (logits) into a probability distribution. Temperature increases stochasticity, which helps language models produce more “creative” text. Typical temperatures are .5—1.5. You’ll learn more about nuances of softmax theory and implementation in later videos.
55		Randomly sampling words with torch.multinomial	part2_build_multinomial	53	torch.multinomial directly maps token probability values onto selection probabilities. Many PyTorch functions are sensitive to input data type — something to check if (when) you get errors! Numpy and pytorch functions can sometimes produce identical results given proper input arguments, but often not by default. Be careful when translating between libraries.
56		Other token sampling methods: greedy, top-k, and top-p	-	54	There is no right or wrong token selection method. Random sampling can increase response variability, which might be good for social chatting but bad for coding or legal documents. The sampling method interacts with softmax temperature: Higher temperatures have stronger boosts of fewer tokens.
57		CodeChallenge: More softmax explorations	part2_build_CCsoftmaxExtreme	55	Softmax is mathematically simple, but some aspects of the transformation are apparent only for some numerical ranges. The number of data values (e.g., vocab size) impacts the probability values, because they must all sum to 1.
58		What, why, when, and how to layernorm	part2_build_layernorm	56	Layernorm is simple and critical. “Set it and forget it”: The learned parameters have little theoretical relevance other than preserving numerical stability throughout the model.
59		Model 2: Position embedding, layernorm, tied output, temperature	part2_build_model2	57	Position embeddings are added to the token embeddings, and help the model learn temporal patterns. Embeddings are trained inside the model, not separately as with tokenization (c.f. previous section). Tokens in a sequence are processed simultaneously, not in a for-loop (causality from attention mechanism).
60		Temporal causality via linear algebra (theory)	-	58	Time-causal attention can be implemented using a time vector in which the future is weighted zero while the past is weighted non-zero. Causality can be implemented using matrices and softmax-probability, which is very computationally efficient on GPUs. Causal attention is not necessary for LLMs, but improves next-token generation (good for chatbots).
61		Averaging the past while ignoring the future (code)	part2_build_pastWithLinalg	59	Matrix multiplication with masks are an efficient way to avoid for-loops. PyTorch has optimized functions that fuse the attention algorithm, including the causal mask (next lecture!).
62		The "attention" algorithm (theory)	-	60	The “attention” mechanism is a clever way of pooling information across different embeddings vectors that allows for surrounding context to modify the current token transformation. All analogies break down; some are useful (or at least entertaining). The full LLM Transformer architecture is more complicated than just attention, but attention is a key aspect.
63		CodeChallenge: Code Attention manually and in Pytorch	part2_build_CCattentionAlgo	61	Q, K, and V matrices are trainable weights that are not changed during inference (applications). Q, K, and V are the activations resulting from multiplication with token embeddings vectors. “Highly optimized” functions can be hardware-specific. Very powerful LLMs require specialized hardware, the sale of which is regulated by governments.
64		Model 3: One attention head	part2_build_model3	62	Attention adjusts (not replaces) the embeddings vectors as they pass through the model. Different tokenizers (and also different pretrained models) use different terms and variable names. Understanding how LLMs work will help you identify features in a model.
65		The Transformer block (theory)	-	63	The Transformer block contains an attention sublayer and an MLP sublayer. Together, they calculate an adjustment to the token embedding to point towards an appropriate next-token embedding. Expansion-nonlinearity-contraction is a typical MLP architecture for feature extraction and linear separability. LLMs comprise dozens of Transformer blocks that learn different features and timescales of texts.
66		The Transformer block (code)	part2_build_transformer	64	You can now implement a GPT-style Transformer :) Separating modules into callable classes helps keep code neat and organized. Both Transformer sublayers comprise the operations copy → normalize → adjustment → add back to copy.
67		Model 4: Multiple Transformer blocks	part2_build_model4	65	Use nn.Sequential to create repeated instances of the same component of a model (with different weights matrices). Specialization of Transformer blocks is not imposed by the architecture, but is thought to be an emergent property. We’ll discuss this more in the mechanistic interpretability sections.
68		Multihead attention: theory and implementation	part2_build_multiheadAttention	66	Multihead Attention (MHA) involves applying the attention equation to different slices of the QKV matrices. MHA is thought to increase the richness and complexity of feature isolation and context-sensitivity, without increasing the number of trainable parameters. Data from all heads are linearly combined via W0.
69		Working on the GPU	part2_build_GPU	67	GPUs are great at number-crunching, and can save a lot of time in LLM training and applications. On the other hand, the CPU can be sufficient for a lot of explorations and investigations of LLMs. Accessing and using a GPU might be expensive and requires additional code, so use it only when it's really beneficial.
70		Model 5: Complete GPT2 on the GPU	part2_build_model5	68	We just built a GPT2-small :) although the weights are all random, so it’s not functional. Commercial models, e.g., GPT4, have the same architecture and computations, but have more layers and parameters. But, quantitative increases can have qualitative impacts.
71		CodeChallenge: Time model5 on CPU and GPU	part2_build_CCgpuVsCpu	69	LLMs are basically worthless without a large number of dedicated high-end GPUs. Regulating GPU access/sales is part of AI safety.
72		Inspecting OpenAI's GPT2	part2_build_OpenAIGPT2	70	Publicly available pretrained models are really easy to use, explore, experiment with, and learn from.
73		Summarizing GPT using equations	-	71	There are many ways to understand a deep learning model; a combination of perspectives (diagrams, explanations, code, equations) is most beneficial. Rewriting equations and checking matrix sizes can help you understand the flow and order of calculations. Neither Q nor K directly contribute to the embedding vector adjustment; their combination determines which vectors in V guide next-token selection.
74		Visualizing nano-GPT	https://bbycroft.net/llm	72	A neat website where you can visualize different GPT variants.
75		CodeChallenge: How many parameters? (part 1)	part2_build_CCparameterCounts	73	Biases are a tiny fraction of model parameters, which is why many deep learning models (especially models with layernorm) ignore them. MLP layers are dense and can have 2-3x as many parameters as attention layers. The number of parameters does not equate to importance in the model (cf layernorm). Examining and dissecting models are useful skills.
76		CodeChallenge: How many parameters? (part 2)	part2_build_CCparameterCounts	74	Biases are a tiny fraction of model parameters, which is why many deep learning models (especially models with layernorm) ignore them. MLP layers are dense and can have 2-3x as many parameters as attention layers. The number of parameters does not equate to importance in the model (cf layernorm). Examining and dissecting models are useful skills.
77		CodeChallenge: GPT2 trained weights distributions	part2_build_CCweightsDists	75	Characteristics of model weights often have smooth transitions across layers, reflecting shifts in representations and calculations. In histograms, use counts for equal sample sizes, and use density (or other scaling) for unequal sample sizes.
78		CodeChallenge: Do we really need Q?	part2_build_CClobotomizeQ	76	Causal manipulations are easy to implement, although knowing what to manipulate is challenging and non-trivial. Model evaluations are tricky. There are quantitative evaluation methods, but many evaluations are based on qualitative inspection of generated text.
79	Pretrain LLMs	What is "pretraining" and is it necessary?	-	77	Pretraining is a necessary first step for any LLM. A pretrained model understands the structure and patterns of written language, and can generate text. Pretraining to create a useful modern base model is prohibitively expensive for most individuals and companies. You should learn how pretraining works, but don’t try it at home ;)
80		Introducing huggingface.co	-	78	HuggingFace provides resources for downloading pretrained LLMs (and other models), training datasets, and more. We will some some of the free resources in this course. You do not need a HuggingFace login to access materials for this course.
81		The AdamW optimizer	-	79	L2 regularization in Adam involves updating and regularizing in one step (i.e., regularizing the update). AdamW updates the weights first, then regularizes (i.e., regularizing the weights). Without regularization, Adam==AdamW. AdamW implements “constant shrinkage” instead of “adaptive shrinkage,” and empirically better in large models.
82		CodeChallenge: SGD vs. Adam vs. AdamW	part2_pretrain_CCsgdVsAdams	80	Simple mechanisms work better in simple models. Adam is adaptive and more likely to stabilize. Gradient accumulation speeds learning, but risks over-generalizing by pooling across more training data. Gradient accumulation is only used for training very large models, and you probably will always want to reset the gradients.
83		Train model 1	part2_pretrain_model1	81	Code to train LLMs has the same basic organization as code to train any deep learning model. Even very simple models with little and limited training quickly learn text structure such as punctuation and line breaks.
84		CodeChallenge: Add a test set	part2_pretrain_CCmodel1test	82	Creating and evaluating a test set is not so difficult, but requires extra code. Train/test splits are less important for pretraining LLMs, but is still good practice. Additional subtleties about devsets, model vs. researcher overfitting, etc., are not discussed here.
85		CodeChallenge: Train model 1 with GPT2's embeddings	part2_pretrain_CCmodel1withEmbeds	83	Freezing weights is a common technique in deep learning for transfer learning or when re-training on limited data. Freezing weights is not necessarily advantageous, especially in simple models or if the new training data differ from the previously trained data.
86		CodeChallenge: Train model 5 with modifications	part2_pretrain_model5WithMods	84	There are several ways to sample data, depending on how meticulous you want the procedure. Some overlap or skipping is less consequential with limitless training data. Models learn to produce “language-looking” text very quickly.
87		Create a custom loss function	part2_pretrain_customLoss	85	Loss functions are extremely important for training deep learning models. Loss functions should be as simple as possible, both mathematically and in code implementation. Knowing how to create your own loss function gives you more control and flexibility over precise model designs and outcomes.
88		CodeChallenge: Train a model to like "X"	part2_pretrain_CCtrainXbias	86	KL divergence is a useful loss function for training distributions instead of individual tokens. It is easy to train biases into models. This has major implications for fairness, misuse, persuasion or manipulation, cultural or political biases, marketing, and other AI safety topics.
89		CodeChallenge: Numerical scaling issues in DL models	part2_pretrain_CCscaling	87	Each matrix multiplication (dot products) increases the variance and numerical range of numbers. This can have a negative impact on softmax probabilities by flattening the distribution. Repeated normalization is very important for the stability of deep learning models including LLMs.
90		Weight initializations	part2_pretrain_weightsInits	88	Weight initialization is not important in small models, but is crucial for training large models including LLMs. Weights should be initialized to small values, often proportional to the matrix sizes. Bias terms are typically ignored in initializations, because there are so few bias terms. Any initialization is important; the exact details and distribution shape seems to be less relevant.
91		CodeChallenge: Train model 5 with weight inits	part2_pretrain_CCmodel5weightInits	89	Now you know how to initialize weights :) Examining changes in the model during learning is an approaching in mechanistic interpretability. Weights distributions tend to widen as the models learn more diverse patterns and representations.
92		Dropout in theory and in Pytorch	part2_pretrain_dropout	90	Dropout involves “switching off” units during training, and is thought to promote distributed representations. LLM pretraining sets are so large and diverse that dropout is less important compared to, e.g., CNNs. Dropout is more useful in fine-tuning when datasets are smaller and overfitting risk is higher.
93		Should you output logits or log-softmax(logits)?	-	91	Calculating log-softmax inside the LLM is often fine and convenient during training and classification, but outputting the raw values allows for more flexibility in subsequent applications.
94		The FineWeb dataset	part2_pretrain_FineWeb	92	HuggingFace provides several high-quality datasets for LLM training, including FineWeb (and several more specific variants).
95		CodeChallenge: Fine dropout in model 5 (part 1)	part2_pretrain_CCmodel5dropout	93	Dropout is conceptually simple, but adding it to code can be tricky. There are many moving parts and parameters in an LLM; make sure to track variables and transformations. Models can learn very fast on small homogeneous datasets, but take longer to train on datasets with more variability.
96		CodeChallenge: Fine dropout in model 5 (part 2)	part2_pretrain_CCmodel5dropout	94	Dropout is conceptually simple, but adding it to code can be tricky. There are many moving parts and parameters in an LLM; make sure to track variables and transformations. Models can learn very fast on small homogeneous datasets, but take longer to train on datasets with more variability.
97		CodeChallenge: What happens to unused tokens?	part2_pretrain_CCunusedTokens	95	Because softmax affects all token logits, not just the ones currently being processed, all tokens embeddings are trained inside the model, not just the ones that appear in the current token sequence. This has implications for model output coherence and accuracy for tokens that are less commonly used in training datasets (e.g., obscure topics, new coding languages).
98		Optimization options	-	96	There are many ways to decrease computation time during LLM training. Different strategies focus on the data, the model, or the hardware. LLMs take so long to pretrain that even a few milliseconds improvement per batch can be significant.
99	Fine-tune pretrained models	What does "fine-tuning" mean?	-	97	Fine-tuning means making targeted adjustments to an existing pretrained model. It takes days or weeks instead of months, and has much smaller computational requirements. There are myriad choices in fine-tuning; the challenge is from knowing what you want to do with the model.
100		Fine-tune a pretrained GPT2	part2_finetune_GPT2gulliver	98	Shortcuts are great, but make sure you understand the mechanisms before oversimplifying your code. Fine-tuning is easy to implement; the challenge comes from choosing the appropriate datasets. The learning rate should generally be lower. Model evaluation should rely on multiple qualitative and quantitative metrics.