1 of 120

Machine Learning�Week 14 – Large Language Models

Seungtaek Choi

Division of Language & AI at HUFS

seungtaek.choi@hufs.ac.kr

2 of 120

Deep Learning

Attention

3 of 120

Attention

4 of 120

Attention in Cognitive Science

  • Behavioral and cognitive process of selectively concentrating on a discrete aspect of information while ignoring other perceivable information
  • State of arousal
  • It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought.
  • Focalization, the concentration of consciousness
  • The allocation of limited cognitive processing resources

5 of 120

Issues with Recurrent Models (1)

  • Linear interaction distance
    • RNNs are unrolled “left-to-right”.
    • This encodes linear locality: a useful heuristic!
      • Nearby words often affect each other’s meanings

    • Problem: RNNs take O(sequence length) steps for distant word pairs to interact.
      • Hard to learn long-distance dependencies (because gradient problems!)

short-term dependency�: enough with RNN

long-term dependency�: NOT enough with RNN

O(sequence length)

6 of 120

Issues with Recurrent Models (2)

  • Lack of parallelizability
    • Forward and backward passes have O(sequence length) unparallelizable operations
      • GPUs can perform a bunch of independent computations at once!
      • But future RNN hidden states can’t be computed in full before past RNN hidden states have been computed.
      • Inhibits training on very large datasets!

1

2

3

0

1

2

n

Numbers indicate min # of steps before a state can be computed

7 of 120

Issues with Recurrent Models (3)

  • Encoding bottleneck
    • It’s like a one-shot compression (sequence of features 🡪 a vector).
    • RNN + classification is like judging a book by its summary.

BiLSTM Encoder

LSTM Decoder

8 of 120

Solution: Attention

  •  

 

Encoder-Decoder �with Attention

9 of 120

Solution: Attention

  • We can think of attention as performing fuzzy lookup in a key-value store.

In a lookup table, we have a table of keys that map to values. The query matches one of the keys, returning its value.

In attention, the query matches all keys softly, to a weight between 0 and 1. The keys’ values are multiplied by the weights and summed.

query

d

keys

a

values

v1

b

v2

c

v3

d

v4

e

v5

output

v4

query

q

keys

k1

values

v1

k2

v2

k3

v3

k4

v4

k5

v5

output

 

weighted �sum

10 of 120

Attention for Machine Translation

  • Sequence-to-sequence (seq2seq)
    • Encode source sequence to latent representation
    • Decode to target sequence one character at a time
    • Need memory for long sequences

https://alex.smola.org/talks/ICML19-attention.pdf

Sequence to Sequence Learning with Neural Networks (https://arxiv.org/pdf/1409.3215)

11 of 120

Attention for Machine Translation

  •  

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

Effective Approaches to Attention-based Neural Machine Translation (https://arxiv.org/pdf/1508.04025)

Image generated by Gemini 3 Pro

12 of 120

Attention for Machine Translation

  • With attention, translation quality remains much more stable for long sentences.

https://alex.smola.org/talks/ICML19-attention.pdf

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

13 of 120

Attention for Machine Translation

  •  

Neural Machine Translation by Jointly Learning to Align and Translate (https://arxiv.org/pdf/1409.0473)

14 of 120

Attention for Image Captioning

  • In computer vision, attention learns alignment between image region and word.

Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/pdf/1502.03044)

15 of 120

Self-Attention as a New Building Block

  • Prev: Attention on recurrent models
  • Todays: Attention only
    • Number of unparallelizable operations does not increase with sequence length.
    • Maximum interaction distance: O(1), since all words interact at every layer!
    • How? Treat each word’s representation as a query to access and incorporate information from a set of values.

Encoder-Decoder �with Attention

All words attend to all words in previous layer

16 of 120

Self-Attention as a New Building Block

  •  

17 of 120

Limitations of Self-Attention (1)

  • 1st challenge: self-attention doesn’t have an inherent notion of order!

q

k

v

k

v

k

v

k

v

k

v

k

v

k

v

k

v

went

to

HUFS

LAI

at

2025

and

learned

No order information!

There is just summation

over the set

18 of 120

Position Representation Vectors

  •  

 

In deep self-attention networks, we do this at the first layer! You could concatenate them as well, but people mostly just add…

19 of 120

Position Representation Vectors

Attention Is All You Need (https://arxiv.org/pdf/1706.03762)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://arxiv.org/pdf/1810.04805)

20 of 120

Limitations of Self-Attention (2)

  • 2nd challenge: there is no non-linearities. It’s all just weighted averages.

query

q

keys

k1

values

v1

k2

v2

k3

v3

k4

v4

k5

v5

output

 

Even stacking multiple layers cannot introduce any non-linearity because it’s still a summation of value vectors

output

 

21 of 120

Adding nonlinearities in self-attention

  •  

22 of 120

Limitations of Self-Attention (3)

  • 3rd challenge: need to ensure we don’t “look at the future” when predicting a sequence

23 of 120

Masking the Future in Self-Attention

  •  

24 of 120

“Transformer”

  • Self-attention:
    • The basis of the method.
  • Position representations:
    • Specify the sequence order, since self-attention is an unordered function of its inputs.
  • Nonlinearities:
    • At the output of the self-attention block
    • Frequently implemented as a simple feed-forward network.
  • Masking:
    • In order to parallelize operations while not looking at the future.
    • Keeps information about the future from “leaking” to the past.

25 of 120

Understanding Transformer Step-by-Step

26 of 120

Transformer Decoder

  • A Transformer decoder is how we’ll build systems like language models.
  • It’s a lot like our minimal self-attention architecture, but with a few more components.
  • The embeddings and position embeddings are identical.
  • We’ll next replace our self-attention with multi-head self-attention.

27 of 120

Multi-Head Attention

  •  

Attention head 1 �attends to entities

Attention head 2 attends to syntactically relevant words

28 of 120

q

k

v

k

v

k

v

k

v

k

v

k

v

k

v

k

v

went

to

HUFS

LAI

at

2025

and

learned

29 of 120

q

k

v

k

v

k

v

k

v

k

v

k

v

k

v

k

v

went

to

HUFS

LAI

at

2025

and

learned

30 of 120

Multi-Head Attention

  •  

Multi-head attention = stacking multiple self-attention

31 of 120

Transformer …

  • Today, we will focus more on “attention”, rather than “Transformer”.
  • However, it’s recommended to study the full details of Transformer, such as …
    • Low-rank computation in multi-head attention
    • Scaled Dot-Product Attention (dot product between large vectors is bad)
    • Residual connections
    • Layer normalization
    • Cross-attention (decoder attends to encoder states)
    • Quadratically growing computation by sequence length

32 of 120

Analysis of Transformer Attention

What Does BERT Look At? An Analysis of BERT’s Attention (https://arxiv.org/pdf/1906.04341)

33 of 120

Could We Trust Attention? (1)

  • Is attention an explanation for its prediction?
    • If attention were a faithful explanation for its prediction, then perturbing attention should significantly change the prediction.
    • In that sense, (Jain & Wallace, 2019) claims attention weights are NOT reliably faithful explanations.
    • They constructed shuffled / adversarial attention distributions on a fixed model, but the predictions stay almost the same.

Attention is not Explanation (https://arxiv.org/pdf/1902.10186)

34 of 120

Could We Trust Attention? (2)

  • Is a token with high attention really one that matters?
    • (Serrano & Smith, 2019) remove words in order of attention vs. gradients, and find that deleting high-attention words often barely affects the prediction.
    • => Attention weights are not a reliable importance ranking, so we should be careful using them as explanations.

Is Attention Interpretable? (https://arxiv.org/pdf/1906.03731)

35 of 120

Advanced Topics

Large Language Models

36 of 120

Today’s Topic

  • What is language model?
  • What is “large” language model?
  • How to train large language model?
  • How to evaluate large language model?
  • How to use large language model?

37 of 120

What is Language Model?

  • A language model (LM) is a probabilistic model over sequences of symbols (words, subwords, characters).
    • It assigns a probability to each possible sentence or text sequence.
    • It captures what “sounds natural” in a language.
  • Core capability: predicting the next token given previous tokens.

38 of 120

What is Language Model?

  •  

39 of 120

Example: Smartphone Keyboard

  •  

40 of 120

What is Language Model?

  •  

 

41 of 120

What is Language Model?

  •  

 

 

42 of 120

Autoregressive Language Models

  •  

 

 

43 of 120

Autoregressive Language Models

  • Predict next word -> predict next word -> predict next word -> …

44 of 120

Autoregressive Language Models

  • Generic language model is thus a next word predictor.
    • Input: a sequence of tokens
    • Output: the most likely next token

Google Cloud Tech -Introduction to large language models (https://www.youtube.com/watch?v=zizonToFXDs)

45 of 120

Autoregressive Language Models

  • Transformer decoders take the token embeddings and positional encodings.

46 of 120

Autoregressive Language Models

  • In the decoder, self-attention with causal attention mask computes the (left-to-right) representations.

47 of 120

Autoregressive Language Models

  • In the decoder, we could get the probability distribution over the vocabularies by multiplying the last hidden representation and token embeddings.

48 of 120

Autoregressive Language Models

  • Each row in the embedding matrix corresponds to the embedding of a word in the model’s vocabulary. The results of this multiplication is interpreted as a score for each word in the model’s vocabulary.

49 of 120

Autoregressive Language Models

  • We can simply select the token with the highest score (top_k = 1).

50 of 120

Autoregressive Language Models

  • Example of machine translation with decoder-only transformer:

51 of 120

Autoregressive Language Models

  • Example of summarization with decoder-only transformer:

52 of 120

How to Train Language Models?

  •  

53 of 120

How to Train Language Models?

  •  

54 of 120

How to Train Language Models?

  •  

55 of 120

How to Train Language Models?

  • Maximum likelihood training
    • Language model defines:

    • Maximum likelihood objective:

    • Equivalent minimization of cross-entropy loss:

 

 

 

56 of 120

How to Train Language Models?

  •  

57 of 120

How to Train Language Models?

  • What is teacher forcing?
    • At every step, the context tokens come from the dataset, not from the model’s own predictions.

Generated by Gemini 3 Pro (Nano Banana Pro)

58 of 120

How to Train Language Models?

  • Training loop (w/ teacher forcing)
    • input_tokens are always ground-truth tokens from the dataset.

59 of 120

How to Train Language Models?

  •  

60 of 120

How to Train Language Models?

  • Exposure bias
    • During training, the model is only exposed to correct prefixes from the dataset.
    • During inference, the model is exposed to its own mistakes:
      • If it predicts a wrong token early, all future contexts are “off-distribution”.
    • Consequences:
      • The model may behave well when the prefix is correct,�but poorly when earlier errors happen.
      • Errors can snowball and produce unnatural text.

Generated by Gemini 3 Pro (Nano Banana Pro)

61 of 120

Rethinking Next Token Prediction

  • Training objective:
    • Next token prediction
  • But we can also view it as:
    • Learning to solve (multiple) tasks
    • Injecting knowledge into the model
  • Same objective, �different perspectives

Generated by Gemini 3 Pro (Nano Banana Pro)

62 of 120

Next Token Prediction = Learning to Solve Tasks

  • Many texts already contain implicit tasks:
    • “Translate to French: The cat is sleeping. 🡪 Le chat dort.”
    • “Question: Where is the Eiffel Tower? Answer: In Paris, France.”
    • “Summary: … In conclusion, …”
    • “def add(a, b): return a + b”
  • For the model, each of these is just one long token sequence.
  • To predict the next token correctly, the model must:
    • Understand the instruction in the prefix.
    • Understand the input (sentence, question, code, …).
    • Produce the correct output tokens.

63 of 120

Multitask Learning Emerges from Raw Text

  • A single next-token objective is trained on:
    • Dialogues (conversation modeling)
    • Articles and headlines (summarization)
    • Q&A forums (question answering)
    • Code repositories (program synthesis, debugging)
    • Tables, lists, bullet points (information extraction)
  • The model sees many task formats mixed together.
  • Result:
    • The model becomes a general task solver conditioned on the input text (the prompt).

64 of 120

Next Token Prediction = Knowledge Injection

  • To assign high probability to real text, the model must capture:
    • Linguistic knowledge:
      • Grammar, syntax, collocations, style.
    • World knowledge:
      • “Paris is the capital of France.”
      • “Water boils at 100°C at sea level.”
    • Procedural knowledge:
      • “2 + 2 = 4”
      • “for i in range(n): …”
  • Each gradient update injects tiny pieces of knowledge into the parameters.
  • After large-scale training, the model stores a compressed representation of the knowledge in the corpus.

65 of 120

Pre-training with Next Token Prediction

  • Pre-training:
    • Train a large neural network on massive text corpora.
    • Use the next-token prediction objective on billions or trillions of tokens.
  • Objective:

  • No explicit task labels are needed:
    • The supervision signal comes from the text itself.
  • Result:
    • A general-purpose pre-trained language model.

 

66 of 120

From Language Model to “Large” Language Model

  • Language model (LM):
    • Any model trained with the next-token objective.
  • Large language model (LLM):
    • Same objective, but:
      • Much larger model (billions of parameters).
      • Much larger and more diverse data (web, books, code, dialogues, …).
      • Longer training, often on specialized hardware.
    • Often followed by:
      • Instruction fine-tuning.
      • Preference optimization (e.g., RLHF).
  • The core remains the same:
    • Next-token prediction as task learning and knowledge injection.

Generated by Gemini 3 Pro (Nano Banana Pro)

67 of 120

What Makes a Language Model “Large”?

  • Model size has exploded
    • Models grew from hundreds of millions to hundreds of billions of parameters.
    • Larger capacity allows storing more patterns, tasks, and knowledge.
    • But it requires much more data, compute, and engineering.

68 of 120

What Makes a Language Model “Large”?

  • Max effective context length
    • Same next-token objective, but Transformers can attend to thousands of tokens at once.
    • This enables long-range dependencies and complex prompts that describe a task, give examples, and provide background.

69 of 120

What Makes a Language Model “Large”?

  •  

Language Models are Few-Shot Learners (https://arxiv.org/pdf/2005.14165)

70 of 120

What Makes a Language Model “Large”?

  • Zero-shot, one-shot, and few-shot in-context learning
    • Zero-shot
      • Input: natural language task description only.
      • The model must infer the task from the description.
    • One-shot
      • Input: Task description + one example.
    • Few-shot
      • Input: Task description + several examples.
    • No gradient updates; the model uses in-context information only.

Language Models are Few-Shot Learners (https://arxiv.org/pdf/2005.14165)

71 of 120

What Makes a Language Model “Large”?

  • Larger models use context more efficiently.
    • All models improve with more in-context examples.
    • Larger models show steeper in-context learning curves.
    • With a natural language prompt, large models can often perform tasks zero-shot or few-shot.
    • Scaling up improves the model’s ability to read, understand, and exploit the prompt as training data.

Language Models are Few-Shot Learners (https://arxiv.org/pdf/2005.14165)

72 of 120

What Makes a Language Model “Large”?

  • More recently: increased context size x more examples (shots)

Many-Shot In-Context Learning (https://arxiv.org/pdf/2404.11018)

73 of 120

What Makes a Language Model “Large”?

  • Scaling law

74 of 120

What Makes a Language Model “Large”?

  • Scaling law

75 of 120

From Pre-trained LLMs to Practical Systems

  • Pre-trained LLM:
    • Trained with next-token prediction at scale.
    • Already capable of many tasks via prompting.
  • To build real applications, we often add:
    • Instruction fine-tuning on human-written prompts and responses.
    • Preference optimization (e.g., RLHF) for helpful and safe behavior.
    • Retrieval-augmented generation (RAG) to access up-to-date external knowledge.
    • Tool use and APIs (search, code execution, databases, …).
  • These layers turn a general LLM into:
    • A practical assistant, coding partner, or domain-specific agent.

76 of 120

How to Train Large Language Models?

  • Several paradigms exist
    • General pre-training -> Domain pre-training -> Task fine-tuning
    • Pre-training -> Mid-training -> Post-training
    • Pre-training -> Instruction tuning -> Alignment tuning -> Reinforcement learning

Pre-training

Post-training

General �Pre-training

Domain �Pre-training

Mid-training

Instruction Tuning

Preference Tuning

Reinforcement Learning

77 of 120

Pre-training

  • Pre-training is usually considered as a step for injecting knowledge.
    • Usually, LLMs are trained in the order of specificity (general to specific).
    • By the specificity of corpus, the model could be specialized.
      • General corpus (news, wiki, book, paper, code, …)
      • Specific corpus (finance, medical, legal, game, …)

78 of 120

Pre-training

  • Foundation model in Korea

79 of 120

Pre-training

  • Domain-specific foundation model in Korea

80 of 120

Post-training

  • Language modeling != assisting users
    • Language models are not aligned with user intent.
    • Post-training is required as a step for making LLMs follow human instructions.

81 of 120

Post-training

  • Instruction tuning
    • Collect examples of (instruction, output) pairs across many tasks and finetune an LM
    • (and evaluate on unseen tasks)

Scaling Instruction-Finetuned Language Models (https://arxiv.org/pdf/2210.11416)

82 of 120

Post-training

  • Instruction tuning with chat (conversational) template.

User view

Data view

Model view

83 of 120

Post-training

  • However, instruction tuning is not enough to make LLMs aligned. �(seeing only good examples)
    • The training objective is still next-token prediction (w/ cross-entropy).
    • It aims to increase the probability of correct response, rather than calibrating it. �(over/under-confidence issue)
    • Thus, instruction-tuned LLMs often fail to NOT generate wrong responses.
      • Unsafe responses …

Iterative Reasoning Preference Optimization (https://arxiv.org/pdf/2404.19733)

84 of 120

Post-training

  • Preference tuning
    • LLMs learn to generate chosen (preferred) response over rejected response.
      • Higher likelihood on the chosen response, lower likelihood on the rejected response.

85 of 120

Post-training

  • A recipe of ChatGPT: Reinforcement Learning from Human Feedback (RLHF)

Training language models to follow instructions with human feedback (https://arxiv.org/pdf/2203.02155)

86 of 120

Post-training

  • However, collecting labels from human experts is very expensive.
  • Alternative: Reinforcement Learning from AI Feedback (RLAIF)

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (https://arxiv.org/pdf/2309.00267)

87 of 120

Post-training

  • Also, handling multiple models (policy, reward, reference) is very complex and difficult to optimize.
  • Alternative: Direct Preference Optimization (DPO)

MIT 6.S191 (Liquid AI): Large Language Models (https://www.youtube.com/watch?v=_HfdncCbMOE)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/pdf/2305.18290)

88 of 120

LLM-related Framework

89 of 120

Masked Language Models

  •  

 

 

90 of 120

Masked Language Models

91 of 120

Conditional Language Models

  • Sometimes we want to model output text given an input:
    • Example tasks: translation, summarization, question answering
  • A conditional language model defines a distribution����where x is some input (source sentence, document, prompt, etc.)
  • With the chain rule:

  • Many modern “sequence-to-sequence” models are conditional autoregressive language models.

 

 

92 of 120

How to Use Large Language Model?

  • Prompt engineering
    • Write your task as a natural language description as clearly as possible.
    • You’ve learned few-shot & in-context learning. So, add more examples in the prompt!
    • Ultimately, it’s about thinking “which texts are more likely by the LLMs?”

93 of 120

How to Use Large Language Model?

  • Chain-of-thought prompting
    • Add reasoning steps before answering the question.
    • Trigger words: “Let’s think step by step.”

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (https://arxiv.org/pdf/2201.11903)

Large Language Models are Zero-Shot Reasoners (https://arxiv.org/pdf/2205.11916)

94 of 120

How to Use Large Language Model?

  • Retrieval-Augmented Generation (RAG)
    • Augment the input context with relevant documents.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (https://arxiv.org/pdf/2005.11401)

95 of 120

How to Use Large Language Model?

  • Context engineering
    • Ultimately, it’s about engineering the input “context”.
      • system prompt, dialogue history, relevant documents, tools, personal memory, …

96 of 120

How to Use Large Language Model?

  • But, please equip your own evaluation first.
    • Generation largely depends on randomness.
    • It’s not guaranteed that your prompt works for every scenario you want.

Stanford CS230 | Autumn 2025 | Lecture 7: Agents, Prompts, and RAG.

97 of 120

How to Use Large Language Model?

  • But, please equip your own evaluation first.
    • Generation largely depends on randomness.
    • It’s not guaranteed that your prompt works for every scenario you want.

Stanford CS230 | Autumn 2025 | Lecture 7: Agents, Prompts, and RAG.

98 of 120

How to Use Large Language Model?

  • Be aware of that LLMs are not perfect.
    • Sometimes they say incorrect answers.
    • Sometimes they say incorrect references.
    • Sometimes they say unsafe answers.

Stanford CS230 | Autumn 2025 | Lecture 7: Agents, Prompts, and RAG.

99 of 120

How to Use Large Language Model?

  • 1. Local LLM
    • transformers & generate()
    • serving engine (vLLM)
  • 2. Cloud LLM
    • OpenAI API
    • Structured Outputs
      • Classification example

100 of 120

Local LLM – Basics

  • 1. Load the tokenizer and model from huggingface.

https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct

🡪 MODEL_NAME = “LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct”

101 of 120

Local LLM – Basics

  • 2. Tokenize the prompt

102 of 120

Local LLM – Basics

  • 3. Compute logits over vocabulary

103 of 120

Local LLM – Basics

  • 4. Select the most likely token among the vocabulary (greedy decoding)

104 of 120

Local LLM – Basics

  • 5. Select the most likely token among the vocabulary (greedy decoding)

105 of 120

Local LLM – Basics

  • 6. Append the token to the inputs (generated_ids = “Hello, my name is Alex”)

106 of 120

Local LLM – Basics

  • 7. Repeat until generating {max_new_tokens} tokens

107 of 120

Local LLM – Basics

  • 7. Repeat until generating {max_new_tokens} tokens

108 of 120

Local LLM – Basics

  • 7. Repeat until generating {max_new_tokens} tokens

109 of 120

Local LLM – Basics

  • 7. Repeat until generating {max_new_tokens} tokens

110 of 120

Local LLM – Basics

  • Limitation
    • The model should recomptue the entire tokens (input + generated) every time generating each token.

KV Caching Explained: Optimizing Transformer Inference Efficiency (https://huggingface.co/blog/not-lain/kv-caching)

111 of 120

Local LLM – Basics

  • Solution
    • In transformers, the key-value vectors are cached.

KV Caching Explained: Optimizing Transformer Inference Efficiency (https://huggingface.co/blog/not-lain/kv-caching)

112 of 120

Local LLM – Basics

  • Solution
    • In transformers, the key-value vectors are cached.

KV Caching Explained: Optimizing Transformer Inference Efficiency (https://huggingface.co/blog/not-lain/kv-caching)

KV Caching

Standard Inference

113 of 120

Local LLM – transformers & generate()

  • Generation with KV caching
    • Pass the “past_key_values”

By Me!

114 of 120

Next

  • How to use LLMs (in short)
  • Project Presentation Day

115 of 120

LLMs

  • tokenization
  • decoding
  • training
  • vLLM
  • API usage
  • json response
  • structured outputs (schema-guided decoding)

116 of 120

117 of 120

118 of 120

Today’s Topic

  • Large Language Models (LLMs)
    • What is Language Model?
    • Data preparation
      • Tokenization
      • Data Deduplication
      • Quality Filtering
    • Neural Architecture
      • Encoder-only (BERT)
      • Encoder-Decoder (T5)
      • Decoder-only (GPT)
      • Scaling law
    • Training Paradigm
      • General PT -> Domain PT -> Task Fine-tuning
      • Pre-training -> Mid-training -> Post-training
      • Reinforcement Learning
    • Training Objective
      • Maximum likelihood estimation (MLE)
      • Attention mask for supervised fine-tuning (SFT), a.k.a. instruction fine-tuning
        • Chat template
      • Reinforcement Learning from Human Feedback (RLHF) PPO
      • Direct preference optimization (DPO)
      • Group Relative Preference Optimization (GRPO)
  • Large Language Models (LLMs)
    • Evaluation
      • MMLU
      • LMArena
      • LLM-as-a-Judge
    • Inference
      • Decoding
        • Greedy decoding
        • Beam search
        • Schema-guided decoding (a.k.a. structured outputs)
        • Speculative decoding
      • Test-time scaling
        • Prompt engineering
        • Majority voting
        • Chain-of-thought
        • Reasoning
        • Tool use (a.k.a. function calling)
        • Retrieval-Augmented Generation (RAG)
        • Context Engineering
          • Memory
      • API
        • OpenAI-compatible API

119 of 120

Further Topics

  • Architectures: Encoder-only, Encoder-Decoder, Decoder-only LMs
  • Data preparation: Tokenization, Dataset Construction
  • Scaling Law
  • Training paradigm: Scaling Law, Domain pre-training, Task fine-tuning, Post-training, Preference Learning, Reinforcement learning, …
  • Evaluation: Metrics, Benchmarks, LLM-as-a-Judge, …
  • Inference: Decoding, Test-time scaling, Retrieval-augmented generation (RAG), Reasoning, Tool Use, …

120 of 120

References