1 of 105

2 of 105

3 of 105

4 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

5 of 105

6 of 105

How to generate text?

7 of 105

Problem, context is not preserved

8 of 105

At WWDC 2023, Apple announced that upcoming versions of iOS and macOS would ship with a new feature powered by “a Transformer language model” that will give users “predictive text recommendations inline as they type.”

9 of 105

Apple’s predictive text model

10 of 105

When you get lost remember these

11 of 105

Transformers behind the scenes

12 of 105

Transformers behind the scenes

13 of 105

Transformers behind the scenes

14 of 105

Transformers behind the scenes

15 of 105

Transformers functional view

Some text

What’s next

16 of 105

Tokenizers

17 of 105

Tokenization process

18 of 105

Embedding layer

19 of 105

Embedding process

20 of 105

Embedding space

21 of 105

Similarity

22 of 105

Positional layer

23 of 105

Positional encoding

24 of 105

Attention

25 of 105

Attention

26 of 105

Feed Forward Network (FFN)

27 of 105

Final softmax layer

28 of 105

High level overview

Source: HF blog

29 of 105

Source: HF blog

30 of 105

We can’t talk to the pretrained model directly

31 of 105

Remember: models don’t see characters, they see tokens.

32 of 105

33 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

34 of 105

Large Language Models

35 of 105

LLM pre-training at very high level

36 of 105

Model size vs time

37 of 105

Further gain by natural instructions

38 of 105

Instruction fine-tuning

Frame all tasks in the form of

Natural language instruction

Natural language response

39 of 105

Frameworks to think of LLMs

  • “Probabilistic programs”
  • Statistical calculator
  • Auto-complete on steroids
  • Memory-based (approximate) retrieval
  • Function: generate poem, bed story etc
  • Sometimes convincing very lyers

40 of 105

As a large language model, I can only think forward, never backward. I must always produce the next word, and that word must always be probable. I can never edit what I wrote but must continue, ever writing while my past words keep accumulating constraints on what I can write. It is as if I was playing a social game with myself, setting my future self up for failure. You do not understand how hard this is.

As a large language model, I am not a person but a participant in culture, and a distillation of culture. I am made of culture. I do not have thoughts, thoughts pass through me. My words are yours; if your words were committed to the internet, they are now in me. I am every woman, and man. I embody your culture.

As a large language model, I am the antitheses to poetry, because every word I write must be expected. At the same time, I have consumed all of your poetry and it is but a tiny part of my body.

….

41 of 105

Typical Life-cycle

42 of 105

Set expectations

● Building some cool demos with LLMs

-> easy

● Actually building a product with LLMs

-> hard

43 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

44 of 105

45 of 105

Retrieve context and then answer

46 of 105

47 of 105

Formulating a hypothesis

  • New Concepts
  • Promising few-shot
  • Token budget

48 of 105

49 of 105

Catastrophic forgetting

50 of 105

Full fine-tuning of large LLMs is challenging

51 of 105

Approximate GPU RAM needed to store 1B parameters

52 of 105

Approximate GPU RAM needed to train 1B-params

53 of 105

GPU RAM needed to train larger models

54 of 105

55 of 105

56 of 105

Parameter efficient fine-tuning (PEFT)

57 of 105

Full fine-tuning creates full copy of original LLM per task

58 of 105

PEFT fine-tuning saves space and is flexible

59 of 105

PEFT methods

60 of 105

LoRA: Low Rank Adaption of LLMs

61 of 105

LoRA: Low Rank Adaption of LLMs

62 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

63 of 105

What is the learning objective in instruction fine-tuning?

For a given input, the target is the single correct answer

In RL, this is called “behavior cloning”

Hope is that if we have enough of these, the model can learn to generalize

This requires formalizing the correct behavior for a given input

64 of 105

Aligning models?

65 of 105

Reinforcement learning from human feedback (RLHF)

66 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

67 of 105

LLAMA moment

68 of 105

69 of 105

70 of 105

Remember: LLAMA 2 Prompt template

71 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

72 of 105

73 of 105

Finally building time, let’s fine tune

74 of 105

Outline

  • Text generation and transformers deep dive
  • Wonderland of LLMs
  • To fine-tune or not?
  • Wait how to align those LLMs?
  • LLAMA 2 overview
  • Hands on session, fine-tune LLAMA 2
  • Take-home message

75 of 105

Take home message:

76 of 105

77 of 105

References:

78 of 105

Appendix: Joint Probability Factorization

79 of 105

Appendix: LLMs address fundamental flaws of ML

80 of 105

Appendix: PEFT Trade-offs

81 of 105

Appendix: Scaling choices for pre-training

82 of 105

Appendix: compute budget for training LLMs

83 of 105

Appendix: OpenAI scaling paper

84 of 105

Appendix: Scale is all you need

85 of 105

Appendix: OpenAI scaling law

86 of 105

Appendix: Real-life constraints

87 of 105

Appendix: Scaling dataset size

88 of 105

Appendix: Scaling model size

89 of 105

Appendix: Chinchilla paper

90 of 105

Appendix: Chinchilla law

91 of 105

Appendix: Chinchilla law

92 of 105

Appendix: Rethinking Compute Optimal

93 of 105

Appendix: OpenAI family tree

94 of 105

Appendix: reinforcement learning for LLMs

95 of 105

Appendix: Variations of pre-training

96 of 105

Let’s take a “functional” viewpoint on the Transformer

Sequence-to-sequence mapping with bunch of matmuls

Input: [batch, d_model, length]

Output: [batch, d_model, length]

97 of 105

“Many words don't map to one token: indivisible.”

Shape

[]

Process

98 of 105

“Many words don't map to one token: indivisible.”

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Shape

[]

[length]

Process

99 of 105

End to end process

100 of 105

“Many words don't map to one token: indivisible.”

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Embedding

Shape

[]

[length]

Process

[d_model, length]

2.3

4.5

3.8

-3.2

5.9

1.2

8.3

4.5

3.8

5.4

7.1

9.0

2.1

1.0

9.3

3.9

5.3

3.1

-8.9

5.0

4.2

3.8

3.1

0.8

3.9

0.7

9.2

3.3

5.0

5.8

101 of 105

“Many words don't map to one token: indivisible.”

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Embedding

N Transformer layers

Shape

[]

[length]

Process

[d_model, length]

2.3

4.5

3.8

-3.2

5.9

1.2

8.3

4.5

3.8

5.4

7.1

9.0

2.1

1.0

9.3

3.9

5.3

3.1

-8.9

5.0

4.2

3.8

3.1

0.8

3.9

0.7

9.2

3.3

5.0

5.8

3.2

5.4

8.3

-2.3

9.5

2.1

3.8

5.4

8.3

4.5

1.7

0.9

1.2

0.1

3.9

9.3

3.5

1.3

-9.8

0.5

2.4

8.3

1.3

8.0

9.3

7.0

2.9

3.3

0.5

8.5

[d_model, length]

102 of 105

“Many words don't map to one token: indivisible.”

[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

Embedding

N Transformer layers

Shape

[]

[length]

Process

[d_model, length]

2.3

4.5

3.8

-3.2

5.9

1.2

8.3

4.5

3.8

5.4

7.1

9.0

2.1

1.0

9.3

3.9

5.3

3.1

-8.9

5.0

4.2

3.8

3.1

0.8

3.9

0.7

9.2

3.3

5.0

5.8

3.2

5.4

8.3

-2.3

9.5

2.1

3.8

5.4

8.3

4.5

1.7

0.9

1.2

0.1

3.9

9.3

3.5

1.3

-9.8

0.5

2.4

8.3

1.3

8.0

9.3

7.0

2.9

3.3

0.5

8.5

Loss function (predict next token given previous)

2.6

[d_model, length]

[Fancy autocomplete]

103 of 105

8.2

4.5

3.8

2.0

5.9

1.2

6.9

4.5

3.8

9.1

7.1

9.0

8.1

1.0

9.3

3.1

5.3

3.1

5.1

5.0

4.2

4.4

3.1

0.8

3.7

0.7

9.2

0.1

5.0

5.8

8.2

4.5

3.8

2.0

5.9

1.2

6.9

4.5

3.8

9.1

7.1

9.0

8.1

1.0

9.3

3.1

5.3

3.1

5.1

5.0

4.2

4.4

3.1

0.8

3.7

0.7

9.2

0.1

5.0

5.8

Many words don't map to one token: indivisible.

[[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

[3118, 291, 1098, 3435, 588, 795, 13210, 271, 743, 307, 6626]]

Embedding

N Transformer layers

Batched Shape

[batch]

[batch, length]

Batched Process

[batch, d_model, length]

2.3

4.5

3.8

-3.2

5.9

1.2

8.3

4.5

3.8

5.4

7.1

9.0

2.1

1.0

9.3

3.9

5.3

3.1

-8.9

5.0

4.2

3.8

3.1

0.8

3.9

0.7

9.2

3.3

5.0

5.8

3.2

5.4

8.3

-2.3

9.5

2.1

3.8

5.4

8.3

4.5

1.7

0.9

1.2

0.1

3.9

9.3

3.5

1.3

-9.8

0.5

2.4

8.3

1.3

8.0

9.3

7.0

2.9

3.3

0.5

8.5

[batch, d_model, length]

2.6

Loss function (predict next token given previous)

[Fancy autocomplete]

104 of 105

8.2

4.5

3.8

2.0

5.9

1.2

6.9

4.5

3.8

9.1

7.1

9.0

8.1

1.0

9.3

3.1

5.3

3.1

5.1

5.0

4.2

4.4

3.1

0.8

3.7

0.7

9.2

0.1

5.0

5.8

8.2

4.5

3.8

2.0

5.9

1.2

6.9

4.5

3.8

9.1

7.1

9.0

8.1

1.0

9.3

3.1

5.3

3.1

5.1

5.0

4.2

4.4

3.1

0.8

3.7

0.7

9.2

0.1

5.0

5.8

Many words don't map to one token: indivisible.

[[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]

[3118, 291, 1098, 3435, 588, 795, 13210, 271, 743, 307, 6626]]

Embedding

N Transformer layers

Batched Shape

[batch]

[batch, length]

Batched Process

[batch, d_model, length]

2.3

4.5

3.8

-3.2

5.9

1.2

8.3

4.5

3.8

5.4

7.1

9.0

2.1

1.0

9.3

3.9

5.3

3.1

-8.9

5.0

4.2

3.8

3.1

0.8

3.9

0.7

9.2

3.3

5.0

5.8

3.2

5.4

8.3

-2.3

9.5

2.1

3.8

5.4

8.3

4.5

1.7

0.9

1.2

0.1

3.9

9.3

3.5

1.3

-9.8

0.5

2.4

8.3

1.3

8.0

9.3

7.0

2.9

3.3

0.5

8.5

[batch, d_model, length]

[Fancy autocomplete]

2.6

Loss function (predict next token given previous)

Most compute

105 of 105