Programming Practice 01
In Colab, write a function to do the following:
Given two integer arguments: rows and cols create a list of lists of size (rows, cols) full of 0s and return it.
print(create_matrix(2, 2))
> [[0, 0], [0, 0]]
Programming Practice Solution 01
programming: list comprehensions
list comprehensions are a compact way to create a new list, and are frequently used to show off
CSE 10124�Tokenization 03
Where are we?
Embeddings
Tokenizer
“Please help me with my homework!!”
Response: “Certainly, as a large language model I’d be happy to help you cheat on your homework assignment”
Last Class - Tokenization 02
1.) Convert input to bytes
string_bytes = list('rug pug hug pun bun hugs run gun'.encode('utf-8')))
> [114, 117, 103, 32, 112, 117, 103, 32, 104, 117, 103, 32, 112, 117, 110, 32, 98, 117, 110, 32, 104, 117, 103, 115, 32, 114, 117, 110, 32, 103, 117, 110]
2.) Find and count pairs
counts = count_pairs(string_bytes)
> {(114, 117): 2, (117, 103): 4, (103, 32): 3, (32, 112): 2, (112, 117): 2, (32, 104): 2, (104, 117): 2, (117, 110): 4, (110, 32): 3, (32, 98): 1, (98, 117): 1, (103, 115): 1, (115, 32): 1, (32, 114): 1, (32, 103): 1, (103, 117): 1}
3.) Replace the most common pair with a new token
merged_string = replace_pairs(string_bytes, counts.most_common()[0][0], 257)
> [114, 257, 32, 112, 257, 32, 104, 257, 32, 112, 117, 110, 32, 98, 117, 110, 32, 104, 257, 115, 32, 114, 117, 110, 32, 103, 117, 110]
programming: count_pairs()
Two classes ago we wrote a function we called count_consecutive_pairs(), we can use this to find tokens that frequently appear near each other!
programming: merge_pair()
Last class we wrote a function we called replace_pairs() which we can use to merge the frequent pairs into new, larger tokens!
programming: Classes
A class is a data type that defines both data, and functions used to manipulate the data. A Specific instances of a class are called objects.
NOTE: We can use dir(class) to print out the attributes and functions a class has
programming: Simple_Tokenizer()
We can put all of this together in a class to create a tokenizer we can re-use!
“tonight?” vs. “tonight”
This simple tokenizer works, but it seems very silly that “tonight?” is a different token than “tonight”.
BPE is excellent at breaking down words statistically, but it can make suboptimal merges like combining punctuation with words.
programming: Regex
Regex allows us to define patterns that we can use to filter strings
NOTE: Regex is super confusing and almost a language unto itself, only sickos have it memorized. Just use a cheat sheet or ask chat.
NOTE: regex is short for “regular expression” (or “rational expression”) and I actually met one of the creators while interviewing at Princeton!
Regex in Tokenization
What we can do is pre-split our text based on linguistic “landmarks” like whitespace and punctuation using regex
regex acts as a "pre-filter" to prevent these issues, ensuring that letters, numbers, punctuation, and whitespace split.
print(r_t.chunk("No cap are u rolling the party tonight?"))
> '(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+
> ['No', ' cap', ' are', ' u', ' rolling', ' the', ' party', ' tonight', '?']
Code Copy → Lab 01
Lab 01 will be a guided walkthrough of programming a RegexTokenizer() class which will be the first piece of code for our LLM we’re building this semester.
Embeddings
Tokenizer
“Please help me with my homework!!”
Response: “Certainly, as a large language model I’d be happy to help you cheat on your homework assignment”
Lab 01
Lab 02
Lab 03
Lab 04
Lab 05
nanochat Tokenizer
Tokenization
Are we actually gaining anything from this?
Tokenization