1 of 21

LLM & Structure workshop

Uri Goren

Chief Data Scientist @ Argmax

2 of 21

Agenda

1

2

3

4

5

6

About us

The Data: AdVec

LLM: Tokenizer

LLM: Transformer

LLM: Decoder

Workshop begins

3 of 21

About Uri

BestPractix

  • Leading argmax.ml
  • Our expertise:
    • NLP
    • RecSys
  • ML Community
    • Podcast
    • Meetup
    • Conferences

4 of 21

e-Commerce

Why argmax cares about search ?

We develop recommendation systems and personalized search solutions

Creator Economy

Ad Tech

5 of 21

AdVec: The dataset

6 of 21

AdVec: Ask any question about an app

7 of 21

AdVec: Challenges

  • Answers are constrained to “yes” or “no”
  • Need to be ran at scale (millions of apps)
  • Cost-effectiveness

8 of 21

How LLMs are built?

9 of 21

A language model is “autocomplete”

  • Estimate probability of each word given prior context
    • P(phone | Please turn off your cell)
  • An N-gram model uses only N−1 words of prior context (Markov assumption)
    • unigram: P(phone)
    • bigram: P(phone | cell)
    • trigram: P(phone | your cell)
  • bigram approximation:

  • N-gram approximation:

10 of 21

But what is a word?

Tokenization

www.google.com

Best-seller, R.I.P

Phrases

“New York”, “as is”

“State of the art”

Portmanteau

Overrated, Pancake, manicure

Stemming

deserialization

11 of 21

But what is a word? token!

Character level

  • Small vocab
  • Static vocab
  • Low semantic value
  • Requires more constraints

Word level

  • Big Vocab
  • New words appear
  • Very descriptive
  • More likely to generate sentences
  • Words share morphological meaning

12 of 21

Wordpieces are the middleground

“It was overall unremarkable”

[‘it’, 'was', 'overall', 'un', '##rem', '##ark', '##able']

Bert’s wordpieces

13 of 21

Deriving word pieces

Usually transformers have special tokens reserved for fine-tunings

14 of 21

The transformer

  • “Attention is all you need”
  • More efficient than RNNs
  • Uses attention with all possible combinations of words
  • Powers most of SOTA these days

15 of 21

Transformer: Masked vs Causal

Used for Generation

Used for token/seq classification

16 of 21

The Decoding algorithm

Assistant: 0.5

Machine: 0.1

Salami: 1e-100

17 of 21

Constraining the decoding algo

<=

Salami

For(int i=0; i

MASK

i

i=0

int

(

for

;

<

==

>=

>

?

A.k.a Logits

18 of 21

Verbalizers: Classifying with LLMs

Salami

Candy crush is an app for

MASK

Productivity

Dating

Game

A.k.a Logits

19 of 21

LLMs: Summary

Decoder

The search algorithm that takes into account the transformer’s probabilities and constraints

Tokenizer

Splits strings into tokens, tokens are usually shorter than words and longer than 1 character.�

Transformer

The heart of the LLM, the transformer is the deep learning model that predicts the probability of the next word.

Our Focus�Today

20 of 21

Ready to get started ?

The workshop consists of 3 parts:

  • Verbalizers
  • OpenAI functions
  • Constrained search in HuggingFace

21 of 21

THANK YOU!

Uri Goren

Chief Data Scientist @ Argmax