7 of 72

Self-attention

Self attention provides:

Ability to learn the relevance and context of all the words in a sentence…not just to each neighboring word
But to each word in the sentence
And also applies attention weights to those relationships so it learns how each word relates to each other word in the sentence

Consider the word “bank” in these two sentences:

He sat by the river bank.
I went to the bank to deposit some cash.

https://www.coursera.org/learn/generative-ai-with-llms/

8 of 72

The Transformer

Attention is All You Need [2017]
Initially developed for machine translation tasks
Consists of an encoder and a decoder that are made up of the same self-attention mechanism

9 of 72

The Transformer

Attention is All You Need [2017]
Initially developed for machine translation tasks
Consists of an encoder and a decoder that are made up of the same self-attention mechanism

https://www.coursera.org/learn/generative-ai-with-llms/

10 of 72

The Transformer

Attention is All You Need [2017]
Initially developed for machine translation tasks
Consists of an encoder and a decoder that are made up of the same self-attention mechanism
Words converted to numerical representation via tokenization

https://www.coursera.org/learn/generative-ai-with-llms/

11 of 72

The Transformer

Attention is All You Need [2017]
Initially developed for machine translation tasks
Consists of an encoder and a decoder that are made up of the same self-attention mechanism
Words converted to numerical representation via tokenization
An embedding (dense vector) is looked up for each token. These vectors serve as input to self attention

https://www.coursera.org/learn/generative-ai-with-llms/

12 of 72

The Transformer

Attention is All You Need [2017]
Initially developed for machine translation tasks
Consists of an encoder and a decoder that are made up of the same self-attention mechanism
Words converted to numerical representation via tokenization
An embedding (dense vector) is looked up for each token. These vectors serve as input to self attention.
Self attention builds a contextual representation of each token as it relates to every other token

Feed-forward network

https://www.coursera.org/learn/generative-ai-with-llms/

13 of 72

The Transformer

Attention is All You Need [2017]
Initially developed for machine translation tasks
Consists of an encoder and a decoder that are made up of the same self-attention mechanism
Words converted to numerical representation via tokenization
An embedding (dense vector) is looked up for each token. These vectors serve as input to self attention.
Self attention builds a contextual representation of each token as it relates to every other token
Outputs are normalized into a probability score for each word in the vocabulary

https://www.coursera.org/learn/generative-ai-with-llms/

14 of 72

Translation Example

https://www.coursera.org/learn/generative-ai-with-llms/

15 of 72

Translation Example

(3 x 512)

https://www.coursera.org/learn/generative-ai-with-llms/

16 of 72

Translation Example

(3 x 512)

(1 x 512) - <s>

https://www.coursera.org/learn/generative-ai-with-llms/

17 of 72

Translation Example

(3 x 512)

(2 x 512) - [<s>, 297]

https://www.coursera.org/learn/generative-ai-with-llms/

18 of 72

Translation Example

https://www.coursera.org/learn/generative-ai-with-llms/

19 of 72

Let’s recap

Encoder

Builds up bi-directional, contextual understanding of the input sequence
Produces one vector per input token

Decoder

Utilizes encoders built-up representation
Accepts input token/sequence
Auto-regressively predicts what token comes next

20 of 72

Different models use different parts

https://www.coursera.org/learn/generative-ai-with-llms/

21 of 72

A rapidly evolving ecosystem

https://github.com/Mooler0410/LLMsPracticalGuide

22 of 72

Decoder Models

23 of 72

Decoders

Remove the encoder altogether
Auto-regressive generation
Causal attention
One forward pass per token
Originally used for open text generation (i.e. continue generating given this starting sequence)
Ex. GPT, BLOOM,OPT, etc.

https://jalammar.github.io/how-gpt3-works-visualizations-animations/

24 of 72

recite

the

first

law

Input

Inspiration: The Illustrated GPT-2

LLMs

25 of 72

recite

the

first

law

LLM

Input

Inspiration: The Illustrated GPT-2

LLMs

26 of 72

recite

the

first

law

LLM

Input

Inspiration: The Illustrated GPT-2

LLMs

27 of 72

recite

the

first

law

LLM

Input

Inspiration: The Illustrated GPT-2

LLMs

28 of 72

recite

the

first

law

LLM

Input

Inspiration: The Illustrated GPT-2

LLMs

29 of 72

recite

the

first

law

LLM

Input

Inspiration: The Illustrated GPT-2

LLMs

30 of 72

recite

the

first

law

LLM

Input

Inspiration: The Illustrated GPT-2

LLMs

31 of 72

recite

the

first

law

LLM

Input

robot

Inspiration: The Illustrated GPT-2

LLMs

32 of 72

robot

recite

the

first

law

LLM

Input

robot

Inspiration: The Illustrated GPT-2

LLMs

33 of 72

robot

recite

the

first

law

LLM

Input

robot

may

Inspiration: The Illustrated GPT-2

LLMs

34 of 72

may

robot

recite

the

first

law

LLM

Input

robot

may

Inspiration: The Illustrated GPT-2

LLMs

35 of 72

may

robot

recite

the

first

law

LLM

Input

robot

may

not

Inspiration: The Illustrated GPT-2

LLMs

36 of 72

may

robot

recite

the

first

law

LLM

Input

robot

may

not

Inspiration: The Illustrated GPT-2

LLMs

37 of 72

Decoding Methods: How LLMs Generate Text

38 of 72

Recall the decoder’s final output layer

… a probability distribution over all tokens in the vocabulary

The dog [MASK]

39 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

https://www.coursera.org/learn/generative-ai-with-llms/

40 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

41 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

42 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

43 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

Misses high probability sequences hidden behind low probability words

44 of 72

Beam Search

Keep most likely num_beams at each time step. Choose the hypothesis that has the highest overall probability.

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

num_beams = 2

@ t == 1:

The dog (0.4)

The nice (0.5)

@ t == 2:

The dog has (0.36)

The nice woman (0.2)

Suffers from repetitive generation… human language doesn’t follow distribution of high probability next words… sounds boring

45 of 72

Sampling

Randomly pick the next word according to its conditional probability distribution

Reduce likelihood that words are repeated, however increases chance that model is “too creative” - i.e. it wanders off to words/topics that don’t make sense

https://www.coursera.org/learn/generative-ai-with-llms/

46 of 72

Inference Parameters

47 of 72

Top-K Sampling

Select an output from the top-k results after applying random-weighted strategy on the redistributed probability mass

Some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

48 of 72

Top-P (nucleus) Sampling

Choose from the smallest possible set of words whose cumulative probability exceeds the probability p

The size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution

p=0.92

49 of 72

Temperature

Scaling factor that impacts the shape of probability distribution for next token

Low temperature is less random, high temperature is more random

https://www.coursera.org/learn/generative-ai-with-llms/

50 of 72

What generation params should I use??

Top-p and top-k sound the most human, but require experimentation

Probably shouldn’t use greedy decoding - will get repetitive word sequences�
Beam search is decent for tasks where the length of desired generation is more or less predictable (translation, summarization), but not well suited for open ended generation.

For more decoding strategies, including newer ones, check out this guide

51 of 72

Efficient Inference Techniques

52 of 72

Why the need?

Challenges:

LLMs are large → high compute and memory demands

LLMs require accelerated hardware → expensive

For real world tasks, LLMs often need contextual info → long sequence lengths

Self attention mechanism scales quadratically in memory and compute wrt sequence length!!

Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision:

GPT3 requires 2 * 175B = 350 GB VRAM
Llama-2-70B requires 2 * 70B = 140 GB VRAM
Falcon-40B requires 2 * 40B = 80 GB VRAM

GPU	VRAM	~ $/hr ($/month)
NVIDIA Tesla T4	16 GB	$0.50 ($336)
NVIDIA A10	24 GB	$1.00 ($672)
NVIDIA A100	40 GB \| 80 GB	$6.00 ($4032)

For reference, common GPU Hardware:

53 of 72

1. Lower Precision

54 of 72

Lower Precision via Quantization

Operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.

The key is to reduce precision without compromising model expressivity / accuracy

55 of 72

Lower Precision via Quantization

Operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.

2. Quantize

model weights

3. Dequantize weights

and compute

4. Quantize weights again

1. Load

model weights

Input

One Transformer Layer

16 bit

8 bit

16 bit

8 bit

We dynamically de-quantize weights on-the-fly to perform matrix multiplications in 16 bit, and then re-quantize.

Inference time is not reduced (often increases), but memory overhead is.

Memory savings:

8-bit = 2x
4-bit = 4x

56 of 72

Popular Quantization Schemes

Two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq

bitsandbytes

Easy - does not require calibrating quantized model with input data (zero-shot)�
Cross-modality - Any model can be quantized as long as it has torch.nn.Linear layers�
No degradation when merging adapters - PEFT can be natively used for fine-tuning on quantized base model

auto-gptq

Fast for text gen - faster than bitsandbytes for generating text�
N-bit support - possible to quantize up to 2 bits (though this will degrade quality)�
easily-serializable - bnb only supports saving serialized models for 8-bit

Works only for language models (today)

Requires calibration on a dataset

Today we also have: AWQ, EETQ, HQQ, AQLM, Quanto, gguf

57 of 72

2. Flash Attention

58 of 72

Regular Self-Attention

The traditional self attention mechanism scales quadratically in memory and compute wrt sequence length

The Q and K matrices each consist of N vectors, so QK^T is size N²

Assuming the LLM has 40 attention heads and runs in bf16, the memory requirement to store QK^T is:� = 40 x 2 x N²

So for:

N = 1000 → 50MB VRAM
N = 16k → 19GB VRAM
N = 100k → 1TB VRAM

Output O of a single self attention layer for input X of length N:

Self attention is prohibitively memory expensive for long input contexts�

How can we reduce this memory requirement?

Flash Attention

59 of 72

Flash Attention

A variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.

LLM inference is memory-IO bound - i.e. takes longer to move 1MB of data to GPU compute core than it does to perform the actual computation on it.

Standard attention requires transfering intermediate values back and forth between HBM and SRAM multiple times during the computation.

Flash attention loads all of the data just once by fusing kernel operations and tiling to partition inputs for parallel processing.

VRAM scales linearly with seq. length + up to 3x faster computations

SRAM (static random access memory) - on chip

Good resource

60 of 72

3. KV-cache

61 of 72

KV-Cache

Autoregressive generation works by iteratively generating tokens and adding them to the input.

62 of 72

KV-Cache

KV-cache saves compute resources by reusing previously calculated self-attention key-value pairs, instead of recalculating them for each generated token.

Without kv-cache

Generation steps for input of length = 20 tokens:

shape of input_ids torch.Size([1, 21])

shape of input_ids torch.Size([1, 22])

shape of input_ids torch.Size([1, 23])

shape of input_ids torch.Size([1, 24])

shape of input_ids torch.Size([1, 25])

[' Here is a Python function']

63 of 72

KV-Cache

KV-cache saves compute resources by reusing previously calculated self-attention key-value pairs, instead of recalculating them for each generated token.

Without kv-cache

Generation steps for input of length = 20 tokens:

shape of input_ids torch.Size([1, 21])

shape of input_ids torch.Size([1, 22])

shape of input_ids torch.Size([1, 23])

shape of input_ids torch.Size([1, 24])

shape of input_ids torch.Size([1, 25])

[' Here is a Python function']

With kv-cache

Generation steps for input of length = 20 tokens:

shape of input_ids torch.Size([1, 1])

length of key-value cache 20

shape of input_ids torch.Size([1, 1])

length of key-value cache 21

shape of input_ids torch.Size([1, 1])

length of key-value cache 22

shape of input_ids torch.Size([1, 1])

length of key-value cache 23

shape of input_ids torch.Size([1, 1])

length of key-value cache 24

[' Here', ' is', ' a', ' Python', ' function']

64 of 72

KV-Cache

KV-cache saves compute resources by reusing previously calculated self-attention key-value pairs, instead of recalculating them for each generated token.

Using the key-value cache has two advantages:

Significant increase in computational efficiency as less computations are performed compared to computing the full QK^T matrix. This leads to an increase in inference speed

The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.

Extremely useful for chat and RAG applications as we don’t need to re-encode the entire history of tokens on each forward pass

65 of 72

4. Dynamic Batching

66 of 72

Naive/Static Batching

Batch size remains constant until inference is complete for each sequence in the batch

LLM inference is memory-IO bound - i.e. takes longer to load 1MB of data to GPU compute core than it does to perform the actual computation
In text generation, sequences (both input and output) are of variable length
This means GPU’s are under utilized when some sequences finish generating before others

https://www.anyscale.com/blog/continuous-batching-llm-inference

67 of 72

Dynamic/Continuous Batching

Also called “iteration-level scheduling” where sequences in a batch are swapped in and out per iteration to make best use of GPU memory

https://www.anyscale.com/blog/continuous-batching-llm-inference

Adjust batches per iteration instead of waiting until every sequence in a batch has completed generation
Once a sequence in a batch has completed, we need logic to determine if it makes sense to insert a new sequence in its place
Achieves higher GPU utilization since GPU does not wait for all sequences to complete before starting a new one

68 of 72

5. Tensor Parallelism

69 of 72

Tensor Parallelism

https://www.anyscale.com/blog/continuous-batching-llm-inference

A technique used to fit a large model in multiple GPUs - each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing
Allows large models that don’t fit in single GPU RAM to run across multiple GPUs efficiently
More available GPU RAM also means ability to handle larger batches == more throughput
Requires rewriting/modifying the model architecture

What happens when you have Llama3-70B that requires 140GB of VRAM, but largest GPU card only has 80GB?

70 of 72

Text Generation Inference (TGI)

71 of 72

Serving framework is important for performance

TGI: A Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power LLMs

Continuous batching of incoming requests for increased total throughput
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Optimized transformers code for inference using flash-attention and Paged Attention
Quantization with bitsandbytes and GPT-Q
Guidance / constrained decoding / function calling / json-mode

Others include: vLLM, Triton, Seldon, etc.

Read this for more info on serving frameworks

1 of 72

2 of 72

3 of 72

4 of 72

5 of 72

6 of 72

7 of 72

8 of 72

9 of 72

10 of 72

11 of 72

12 of 72

13 of 72

14 of 72

15 of 72

16 of 72

17 of 72

18 of 72

19 of 72

20 of 72

21 of 72

22 of 72

23 of 72

24 of 72

25 of 72

26 of 72

27 of 72

28 of 72

29 of 72

30 of 72

31 of 72

32 of 72

33 of 72

34 of 72

35 of 72

36 of 72

37 of 72

38 of 72

39 of 72

40 of 72

41 of 72

42 of 72

43 of 72

44 of 72

45 of 72

46 of 72

47 of 72

48 of 72

49 of 72

50 of 72

51 of 72

52 of 72

53 of 72

54 of 72

55 of 72

56 of 72

57 of 72

58 of 72

59 of 72

60 of 72

61 of 72

62 of 72

63 of 72

64 of 72

65 of 72

66 of 72

67 of 72

68 of 72

69 of 72

70 of 72

71 of 72

72 of 72