1 of 17

Programming Practice 01

In Colab, write a function to do the following:

Given two integer arguments: rows and cols create a list of lists of size (rows, cols) full of 0s and return it.

print(create_matrix(2, 2))

> [[0, 0], [0, 0]]

2 of 17

Programming Practice Solution 01

3 of 17

programming: list comprehensions

list comprehensions are a compact way to create a new list, and are frequently used to show off

4 of 17

CSE 10124�Tokenization 03

5 of 17

Where are we?

Embeddings

Tokenizer

“Please help me with my homework!!”

Response: “Certainly, as a large language model I’d be happy to help you cheat on your homework assignment”

6 of 17

Last Class - Tokenization 02

1.) Convert input to bytes

string_bytes = list('rug pug hug pun bun hugs run gun'.encode('utf-8')))

> [114, 117, 103, 32, 112, 117, 103, 32, 104, 117, 103, 32, 112, 117, 110, 32, 98, 117, 110, 32, 104, 117, 103, 115, 32, 114, 117, 110, 32, 103, 117, 110]

2.) Find and count pairs

counts = count_pairs(string_bytes)

> {(114, 117): 2, (117, 103): 4, (103, 32): 3, (32, 112): 2, (112, 117): 2, (32, 104): 2, (104, 117): 2, (117, 110): 4, (110, 32): 3, (32, 98): 1, (98, 117): 1, (103, 115): 1, (115, 32): 1, (32, 114): 1, (32, 103): 1, (103, 117): 1}

3.) Replace the most common pair with a new token

merged_string = replace_pairs(string_bytes, counts.most_common()[0][0], 257)

> [114, 257, 32, 112, 257, 32, 104, 257, 32, 112, 117, 110, 32, 98, 117, 110, 32, 104, 257, 115, 32, 114, 117, 110, 32, 103, 117, 110]

7 of 17

programming: count_pairs()

Two classes ago we wrote a function we called count_consecutive_pairs(), we can use this to find tokens that frequently appear near each other!

8 of 17

programming: merge_pair()

Last class we wrote a function we called replace_pairs() which we can use to merge the frequent pairs into new, larger tokens!

9 of 17

programming: Classes

A class is a data type that defines both data, and functions used to manipulate the data. A Specific instances of a class are called objects.

NOTE: We can use dir(class) to print out the attributes and functions a class has

10 of 17

programming: Simple_Tokenizer()

We can put all of this together in a class to create a tokenizer we can re-use!

11 of 17

“tonight?” vs. “tonight”

This simple tokenizer works, but it seems very silly that “tonight?” is a different token than “tonight”.

BPE is excellent at breaking down words statistically, but it can make suboptimal merges like combining punctuation with words.

12 of 17

programming: Regex

Regex allows us to define patterns that we can use to filter strings

NOTE: Regex is super confusing and almost a language unto itself, only sickos have it memorized. Just use a cheat sheet or ask chat.

NOTE: regex is short for “regular expression” (or “rational expression”) and I actually met one of the creators while interviewing at Princeton!

13 of 17

Regex in Tokenization

What we can do is pre-split our text based on linguistic “landmarks” like whitespace and punctuation using regex

regex acts as a "pre-filter" to prevent these issues, ensuring that letters, numbers, punctuation, and whitespace split.

print(r_t.chunk("No cap are u rolling the party tonight?"))

> '(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+

> ['No', ' cap', ' are', ' u', ' rolling', ' the', ' party', ' tonight', '?']

14 of 17

Code Copy → Lab 01

Lab 01 will be a guided walkthrough of programming a RegexTokenizer() class which will be the first piece of code for our LLM we’re building this semester.

Embeddings

Tokenizer

“Please help me with my homework!!”

Response: “Certainly, as a large language model I’d be happy to help you cheat on your homework assignment”

Lab 01

Lab 02

Lab 03

Lab 04

Lab 05