1 of 17

Programming Practice 01

In Colab, write a function to do the following:

Given two integer arguments: rows and cols create a list of lists of size (rows, cols) full of 0s and return it.

print(create_matrix(2, 2))

> [[0, 0], [0, 0]]

2 of 17

Programming Practice Solution 01

3 of 17

programming: list comprehensions

list comprehensions are a compact way to create a new list, and are frequently used to show off

4 of 17

CSE 10124�Tokenization 03

5 of 17

Where are we?

Embeddings

Tokenizer

“Please help me with my homework!!”

Response: “Certainly, as a large language model I’d be happy to help you cheat on your homework assignment”

6 of 17

Last Class - Tokenization 02

1.) Convert input to bytes

string_bytes = list('rug pug hug pun bun hugs run gun'.encode('utf-8')))

> [114, 117, 103, 32, 112, 117, 103, 32, 104, 117, 103, 32, 112, 117, 110, 32, 98, 117, 110, 32, 104, 117, 103, 115, 32, 114, 117, 110, 32, 103, 117, 110]

2.) Find and count pairs

counts = count_pairs(string_bytes)

> {(114, 117): 2, (117, 103): 4, (103, 32): 3, (32, 112): 2, (112, 117): 2, (32, 104): 2, (104, 117): 2, (117, 110): 4, (110, 32): 3, (32, 98): 1, (98, 117): 1, (103, 115): 1, (115, 32): 1, (32, 114): 1, (32, 103): 1, (103, 117): 1}

3.) Replace the most common pair with a new token

merged_string = replace_pairs(string_bytes, counts.most_common()[0][0], 257)

> [114, 257, 32, 112, 257, 32, 104, 257, 32, 112, 117, 110, 32, 98, 117, 110, 32, 104, 257, 115, 32, 114, 117, 110, 32, 103, 117, 110]

7 of 17

programming: count_pairs()

Two classes ago we wrote a function we called count_consecutive_pairs(), we can use this to find tokens that frequently appear near each other!

8 of 17

programming: merge_pair()

Last class we wrote a function we called replace_pairs() which we can use to merge the frequent pairs into new, larger tokens!

9 of 17

programming: Classes

A class is a data type that defines both data, and functions used to manipulate the data. A Specific instances of a class are called objects.

NOTE: We can use dir(class) to print out the attributes and functions a class has

10 of 17

programming: Simple_Tokenizer()

We can put all of this together in a class to create a tokenizer we can re-use!

11 of 17

“tonight?” vs. “tonight”

This simple tokenizer works, but it seems very silly that “tonight?” is a different token than “tonight”.

BPE is excellent at breaking down words statistically, but it can make suboptimal merges like combining punctuation with words.

12 of 17

programming: Regex

Regex allows us to define patterns that we can use to filter strings

NOTE: Regex is super confusing and almost a language unto itself, only sickos have it memorized. Just use a cheat sheet or ask chat.

NOTE: regex is short for “regular expression” (or “rational expression”) and I actually met one of the creators while interviewing at Princeton!

13 of 17

Regex in Tokenization

What we can do is pre-split our text based on linguistic “landmarks” like whitespace and punctuation using regex

regex acts as a "pre-filter" to prevent these issues, ensuring that letters, numbers, punctuation, and whitespace split.

print(r_t.chunk("No cap are u rolling the party tonight?"))

> '(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+

> ['No', ' cap', ' are', ' u', ' rolling', ' the', ' party', ' tonight', '?']

14 of 17

Code Copy → Lab 01

Lab 01 will be a guided walkthrough of programming a RegexTokenizer() class which will be the first piece of code for our LLM we’re building this semester.

Embeddings

Tokenizer

“Please help me with my homework!!”

Response: “Certainly, as a large language model I’d be happy to help you cheat on your homework assignment”

Lab 01

Lab 02

Lab 03

Lab 04

Lab 05

15 of 17

nanochat Tokenizer

16 of 17

Tokenization

Are we actually gaining anything from this?

17 of 17

Tokenization