1 of 21

LLM & Structure workshop

Uri Goren

Chief Data Scientist @ Argmax

2 of 21

Agenda

1

2

3

4

5

6

About us

The Data: AdVec

LLM: Tokenizer

LLM: Transformer

LLM: Decoder

Workshop begins

3 of 21

About Uri

BestPractix

Leading argmax.ml
Our expertise:

NLP
RecSys

ML Community

Podcast
Meetup
Conferences

4 of 21

e-Commerce

Why argmax cares about search ?

We develop recommendation systems and personalized search solutions

Creator Economy

Ad Tech

5 of 21

AdVec: The dataset

6 of 21

AdVec: Ask any question about an app

7 of 21

AdVec: Challenges

Answers are constrained to “yes” or “no”
Need to be ran at scale (millions of apps)
Cost-effectiveness

8 of 21

How LLMs are built?

9 of 21

A language model is “autocomplete”

Estimate probability of each word given prior context

P(phone | Please turn off your cell)

An N-gram model uses only N−1 words of prior context (Markov assumption)

unigram: P(phone)
bigram: P(phone | cell)
trigram: P(phone | your cell)

bigram approximation:

N-gram approximation:

10 of 21

But what is a word?

Tokenization	www.google.com Best-seller, R.I.P
Phrases	“New York”, “as is” “State of the art”
Portmanteau	Overrated, Pancake, manicure
Stemming	deserialization

11 of 21

But what is a word? token!

Character level

Small vocab
Static vocab
Low semantic value
Requires more constraints

Word level

Big Vocab
New words appear
Very descriptive
More likely to generate sentences
Words share morphological meaning

12 of 21

Wordpieces are the middleground

“It was overall unremarkable”

[‘it’, 'was', 'overall', 'un', '##rem', '##ark', '##able']

Bert’s wordpieces

13 of 21

Deriving word pieces

Usually transformers have special tokens reserved for fine-tunings

14 of 21

The transformer

“Attention is all you need”
More efficient than RNNs
Uses attention with all possible combinations of words
Powers most of SOTA these days

15 of 21

Transformer: Masked vs Causal

Used for Generation

Used for token/seq classification

16 of 21

The Decoding algorithm

Assistant: 0.5

Machine: 0.1

Salami: 1e-100

17 of 21

Constraining the decoding algo

<=

Salami

For(int i=0; i

MASK

i

i=0

int

(

for

;

<

==

>=

>

?

A.k.a Logits

18 of 21

Verbalizers: Classifying with LLMs

Salami

Candy crush is an app for

MASK

Productivity

Dating

Game

…

A.k.a Logits

19 of 21

LLMs: Summary

Decoder

The search algorithm that takes into account the transformer’s probabilities and constraints

Tokenizer

Splits strings into tokens, tokens are usually shorter than words and longer than 1 character.�

Transformer

The heart of the LLM, the transformer is the deep learning model that predicts the probability of the next word.

Our Focus�Today

20 of 21

Ready to get started ?

The workshop consists of 3 parts:

Verbalizers
OpenAI functions
Constrained search in HuggingFace

21 of 21

THANK YOU!

Uri Goren

Chief Data Scientist @ Argmax