LLM & Structure workshop
Uri Goren
Chief Data Scientist @ Argmax
Agenda
1
2
3
4
5
6
About us
The Data: AdVec
LLM: Tokenizer
LLM: Transformer
LLM: Decoder
Workshop begins
About Uri
BestPractix
e-Commerce
Why argmax cares about search ?
We develop recommendation systems and personalized search solutions
Creator Economy
Ad Tech
AdVec: The dataset
AdVec: Ask any question about an app
AdVec: Challenges
How LLMs are built?
A language model is “autocomplete”
But what is a word?
Tokenization | Best-seller, R.I.P |
Phrases | “New York”, “as is” “State of the art” |
Portmanteau | Overrated, Pancake, manicure |
Stemming | deserialization |
But what is a word? token!
Character level
Word level
Wordpieces are the middleground
“It was overall unremarkable”
[‘it’, 'was', 'overall', 'un', '##rem', '##ark', '##able']
Bert’s wordpieces
Deriving word pieces
Usually transformers have special tokens reserved for fine-tunings
The transformer
Transformer: Masked vs Causal
Used for Generation
Used for token/seq classification
The Decoding algorithm
Assistant: 0.5
Machine: 0.1
Salami: 1e-100
Constraining the decoding algo
<=
Salami
For(int i=0; i
MASK
i
i=0
int
(
for
;
<
==
>=
>
?
A.k.a Logits
Verbalizers: Classifying with LLMs
Salami
Candy crush is an app for
MASK
Productivity
Dating
Game
…
A.k.a Logits
LLMs: Summary
Decoder
The search algorithm that takes into account the transformer’s probabilities and constraints
Tokenizer
Splits strings into tokens, tokens are usually shorter than words and longer than 1 character.�
Transformer
The heart of the LLM, the transformer is the deep learning model that predicts the probability of the next word.
Our Focus�Today
Ready to get started ?
The workshop consists of 3 parts:
THANK YOU!
Uri Goren
Chief Data Scientist @ Argmax