1 of 51

Simplicity

Joel Grus

Southern Data Science Conference 2022

2 of 51

3 of 51

Simplicity

4 of 51

What's the simplest thing that might possibly work?

5 of 51

What's the simplest thing that might possibly work? And why didn't you try that first?

6 of 51

Simplicity

7 of 51

"There's a lot to love about the ML community, but one thing I don't love is the cult of complexity where you get respect by using big equations, big words, and big models."

https://twitter.com/_brohrer_/status/1553811938886520842

8 of 51

ā€œSimplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better.ā€

– Edsger W. Dijkstra

9 of 51

"Simple solutions are easier to implement and scale than complex solutions. .... We need to start acknowledging the power of simple."

https://twitter.com/bryanl/status/1555619528096235525

10 of 51

11 of 51

12 of 51

13 of 51

14 of 51

15 of 51

Complexity?

16 of 51

Model Complexity

17 of 51

18 of 51

"Seek simplicity, and distrust it."

– Alfred North Whitehead

19 of 51

Text Classification

20 of 51

Naive Bayes model

  • chop email into words
  • P(spam|words) āˆ P(words|spam)P(spam)
  • = P(word1|spam) … P(wordn|spam) P(spam)

  • 2n + 2 parameters

21 of 51

Logistic Regression model

  • convert email to feature vector x1 … xn
  • fit model
  • log (p/(1-p)) = b0 + b1 x1 + … bn xn

  • (n + 1) parameters

22 of 51

LSTM Model

  • tokenize email and convert to a sequence of embeddings
  • run the embeddings through an LSTM and get a final "state"
  • learn to classify the final hidden state

  • thousands of parameters*

23 of 51

BERT model

  • convert email to
    • wordpiece embeddings
    • segment embeddings
    • positional embeddings
  • feed to transformer model
  • use pretrained final [CLS] embedding as input to classifier
  • fine-tune

  • 110M parameters

24 of 51

Which is simplest?

  1. Naive Bayes
  2. Logistic Regression
  3. LSTM
  4. BERT

25 of 51

Which would you recommend?

  • Naive Bayes
  • Logistic Regression
  • LSTM
  • BERT

26 of 51

Which would you recommend?

  • Naive Bayes
  • Logistic Regression
  • LSTM
  • BERT

27 of 51

Free Riding

LSTM Model

GLoVe vectors

28 of 51

BERT model

  • convert email to
    • wordpiece embeddings
    • segment embeddings
    • positional embeddings
  • feed to transformer model
  • use pretrained final [CLS] embedding as input to classifier
  • fine-tune

  • 110M parameters

29 of 51

BERT model

  • convert email to
    • wordpiece embeddings
    • segment embeddings
    • positional embeddings
  • feed to transformer model
  • use pretrained final [CLS] embedding as input to classifier
  • fine-tune

  • 110M parameters

compute [CLS] embedding

thousands of parameters?

30 of 51

Hidden Complexity

31 of 51

"Out of clutter, find simplicity."

– Albert Einstein

32 of 51

A digression:

Woodworking

33 of 51

34 of 51

35 of 51

36 of 51

37 of 51

38 of 51

BERT in 2018

39 of 51

BERT 2022

from transformers import (

AutoTokenizer,

DataCollatorWithPadding,

AutoModelForSequenceClassification,

TrainingArguments,

Trainer

)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):

return tokenizer(examples["text"], truncation=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

training_args = TrainingArguments(...)

trainer = Trainer(

model=model,

args=training_args,

train_dataset=train_dataset,

eval_dataset=eval_dataset,

tokenizer=tokenizer,

data_collator=data_collator,

)

trainer.train()

40 of 51

Another example:

Sorting

41 of 51

As our tools get better, the boundary between "simple" and "complex" changes.

42 of 51

"Simple solutions are easier to implement and scale than complex solutions. .... We need to start acknowledging the power of simple."

https://twitter.com/bryanl/status/1555619528096235525

43 of 51

"I’m often asked: Why use AngelList to run a fund?

My answer: AngelList abstracts away *all* the complexity associated w/ starting & running a fund"

https://twitter.com/avlok/status/1564350808002478080

44 of 51

Abstract Away the Complexity

45 of 51

Create Jigs to Abstract Away the Complexity

  • pretrained models
  • modeling libraries
  • project templates
  • clean APIs
  • best practices
  • shared processes

46 of 51

"Manifest plainness,

Embrace simplicity,

Reduce selfishness,

Have few desires."

– Lao-tzu

47 of 51

Manifest plainness

48 of 51

Reduce selfishness

49 of 51

Have few desires

50 of 51

Embrace simplicity

51 of 51

Thanks!