10 of 48

Generative models

Teaching a computer to read: Classification

Teaching a computer to write: Generative

11 of 48

LLMs

Generative models for language

12 of 48

LLMs: A simplified setting

13 of 48

LLMs: A simplified setting

14 of 48

Challenge 1: Words aren’t enough!

Typos: What if a paragraph contains “ardvarrk”?�
Different versions of words: �Listen, listened, listening, listener, …�Biology, biologist, biological, biologically, …

Remedy: Use tokens = short character subsequences � instead of words

15 of 48

Challenge 1: Words aren’t enough!

Typos: What if a paragraph contains “ardvarrk”?�
Different versions of words: �Listen, listened, listening, listener, …�Biology, biologist, biological, biologically, …

Remedy: Use tokens = short character subsequences � instead of words

16 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

17 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

single-char tokens

18 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

two-char tokens

19 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

two-char tokens

20 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

two-char tokens

21 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

two-char tokens

22 of 48

Tokenization

For tokenizing, we start with tokens for all characters, then continue finding most common next-shortest subsequences.

3+ char tokens

23 of 48

Challenge 2: Language is not fixed length

24 of 48

Challenge 2: Language is not fixed length

25 of 48

Challenge 2: Language is not fixed length

26 of 48

Challenge 2: Language is not fixed length

27 of 48

Challenge 2: Language is not fixed length

28 of 48

LLMs

Tokenize language
Predict the next token based off the previous tokens
Use encoding / embedding of tokens as vectors
Large neural network (“transformer” architecture) �as generative model�
Extremely successful! (clearly!)

But: still largely based on patterns of language

31 of 48

Why is this wrong?

32 of 48

Why is this wrong?

“Given that a ball is red, is it more likely in the left jar or in the right jar?”

33 of 48

Why is this wrong?

“Given the person is republican, are they more likely from California or Louisiana?”

34 of 48

Consequences of Language Models

The association between tokens “Lousiana” and “Republican”

is very strong!!

And “California” to “Democrat” is very strong!

LLM answers with Republican = Louisiana

But: This is also a counterintuitive question that people get wrong ☺�And: “Reasoning” / “Thinking” models get the question right

35 of 48

Consequences of Language Models

Be aware of these when using LLMs!

36 of 48