1 of 72

Inference Considerations

for LLMs

May 2024

2 of 72

Overview

Topics we’ll cover

  • Background:
    • The Transformer architecture
    • Decoders
  • Decoding Methods: How LLMs generate text
  • Inference Parameters
  • Efficient Inference Techniques
    • Lower Precision
    • Flash Attention
    • KV-cache
    • Dynamic Batching
    • Tensor Parallelism
  • Text Generation Inference (TGI)

3 of 72

Background:

The Transformer Architecture

4 of 72

Self-attention - the key to it all

Self attention provides:

  • Ability to learn the relevance and context of all the words in a sentence…not just to each neighboring word

5 of 72

Self-attention

Self attention provides:

  • Ability to learn the relevance and context of all the words in a sentence…not just to each neighboring word
  • But to each word in the sentence

6 of 72

Self-attention

Self attention provides:

  • Ability to learn the relevance and context of all the words in a sentence…not just to each neighboring word
  • But to each word in the sentence
  • And also applies attention weights to those relationships so it learns how each word relates to each other word in the sentence

7 of 72

Self-attention

Self attention provides:

  • Ability to learn the relevance and context of all the words in a sentence…not just to each neighboring word
  • But to each word in the sentence
  • And also applies attention weights to those relationships so it learns how each word relates to each other word in the sentence

Consider the word “bank” in these two sentences:

  1. He sat by the river bank.
  2. I went to the bank to deposit some cash.

8 of 72

The Transformer

  • Attention is All You Need [2017]
  • Initially developed for machine translation tasks
  • Consists of an encoder and a decoder that are made up of the same self-attention mechanism

9 of 72

The Transformer

  • Attention is All You Need [2017]
  • Initially developed for machine translation tasks
  • Consists of an encoder and a decoder that are made up of the same self-attention mechanism

10 of 72

The Transformer

  • Attention is All You Need [2017]
  • Initially developed for machine translation tasks
  • Consists of an encoder and a decoder that are made up of the same self-attention mechanism
  • Words converted to numerical representation via tokenization

11 of 72

The Transformer

  • Attention is All You Need [2017]
  • Initially developed for machine translation tasks
  • Consists of an encoder and a decoder that are made up of the same self-attention mechanism
  • Words converted to numerical representation via tokenization
  • An embedding (dense vector) is looked up for each token. These vectors serve as input to self attention

12 of 72

The Transformer

  • Attention is All You Need [2017]
  • Initially developed for machine translation tasks
  • Consists of an encoder and a decoder that are made up of the same self-attention mechanism
  • Words converted to numerical representation via tokenization
  • An embedding (dense vector) is looked up for each token. These vectors serve as input to self attention.
  • Self attention builds a contextual representation of each token as it relates to every other token

Feed-forward network

Feed-forward network

13 of 72

The Transformer

  • Attention is All You Need [2017]
  • Initially developed for machine translation tasks
  • Consists of an encoder and a decoder that are made up of the same self-attention mechanism
  • Words converted to numerical representation via tokenization
  • An embedding (dense vector) is looked up for each token. These vectors serve as input to self attention.
  • Self attention builds a contextual representation of each token as it relates to every other token
  • Outputs are normalized into a probability score for each word in the vocabulary

14 of 72

Translation Example

15 of 72

Translation Example

(3 x 512)

(3 x 512)

16 of 72

Translation Example

(3 x 512)

(3 x 512)

(1 x 512) - <s>

17 of 72

Translation Example

(3 x 512)

(3 x 512)

(2 x 512) - [<s>, 297]

18 of 72

Translation Example

19 of 72

Let’s recap

Encoder

  • Builds up bi-directional, contextual understanding of the input sequence
  • Produces one vector per input token

Decoder

  • Utilizes encoders built-up representation
  • Accepts input token/sequence
  • Auto-regressively predicts what token comes next

20 of 72

Different models use different parts

21 of 72

A rapidly evolving ecosystem

22 of 72

Decoder Models

23 of 72

Decoders

Decoders

  • Remove the encoder altogether
  • Auto-regressive generation
  • Causal attention
  • One forward pass per token
  • Originally used for open text generation (i.e. continue generating given this starting sequence)
  • Ex. GPT, BLOOM,OPT, etc.

24 of 72

$

recite

the

first

law

Input

LLMs

25 of 72

$

recite

the

first

law

LLM

Input

LLMs

26 of 72

$

recite

the

first

law

LLM

Input

LLMs

27 of 72

$

recite

the

first

law

LLM

Input

LLMs

28 of 72

$

recite

the

first

law

LLM

Input

A

LLMs

29 of 72

A

$

recite

the

first

law

LLM

Input

A

LLMs

30 of 72

A

$

recite

the

first

law

LLM

Input

A

LLMs

31 of 72

A

$

recite

the

first

law

LLM

Input

A

robot

LLMs

32 of 72

robot

A

$

recite

the

first

law

LLM

Input

A

robot

LLMs

33 of 72

robot

A

$

recite

the

first

law

LLM

Input

A

robot

may

LLMs

34 of 72

may

robot

A

$

recite

the

first

law

LLM

Input

A

robot

may

LLMs

35 of 72

may

robot

A

$

recite

the

first

law

LLM

Input

A

robot

may

not

LLMs

36 of 72

may

robot

A

$

recite

the

first

law

LLM

Input

A

robot

may

not

not

LLMs

37 of 72

Decoding Methods: How LLMs Generate Text

38 of 72

Recall the decoder’s final output layer

… a probability distribution over all tokens in the vocabulary

The dog [MASK]

39 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

40 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

41 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

42 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

43 of 72

Greedy Search

Simplest decoding approach - select the word with the highest probability as its next word

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

?

Misses high probability sequences hidden behind low probability words

44 of 72

Beam Search

Keep most likely num_beams at each time step. Choose the hypothesis that has the highest overall probability.

The

dog (0.4)

nice (0.5)

car (0.1)

woman (0.4)

house (0.3)

guy (0.3)

and (0.05)

runs (0.05)

has (0.9)

is (0.3)

drives (0.5)

turns (0.2)

num_beams = 2

@ t == 1:

The dog (0.4)

The nice (0.5)

@ t == 2:

The dog has (0.36)

The nice woman (0.2)

Suffers from repetitive generation… human language doesn’t follow distribution of high probability next words… sounds boring

45 of 72

Sampling

Randomly pick the next word according to its conditional probability distribution

Reduce likelihood that words are repeated, however increases chance that model is “too creative” - i.e. it wanders off to words/topics that don’t make sense

46 of 72

Inference Parameters

47 of 72

Top-K Sampling

Select an output from the top-k results after applying random-weighted strategy on the redistributed probability mass

Some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

48 of 72

Top-P (nucleus) Sampling

Choose from the smallest possible set of words whose cumulative probability exceeds the probability p

The size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution

p=0.92

49 of 72

Temperature

Scaling factor that impacts the shape of probability distribution for next token

Low temperature is less random, high temperature is more random

50 of 72

What generation params should I use??

  • Top-p and top-k sound the most human, but require experimentation

  • Probably shouldn’t use greedy decoding - will get repetitive word sequences�
  • Beam search is decent for tasks where the length of desired generation is more or less predictable (translation, summarization), but not well suited for open ended generation.

For more decoding strategies, including newer ones, check out this guide

51 of 72

Efficient Inference Techniques

52 of 72

Why the need?

Challenges:

  • LLMs are large → high compute and memory demands

  • LLMs require accelerated hardware → expensive

  • For real world tasks, LLMs often need contextual info → long sequence lengths
    • Self attention mechanism scales quadratically in memory and compute wrt sequence length!!

Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision:

  • GPT3 requires 2 * 175B = 350 GB VRAM
  • Llama-2-70B requires 2 * 70B = 140 GB VRAM
  • Falcon-40B requires 2 * 40B = 80 GB VRAM

GPU

VRAM

~ $/hr ($/month)

NVIDIA Tesla T4

16 GB

$0.50 ($336)

NVIDIA A10

24 GB

$1.00 ($672)

NVIDIA A100

40 GB | 80 GB

$6.00 ($4032)

For reference, common GPU Hardware:

53 of 72

1. Lower Precision

54 of 72

Lower Precision via Quantization

Operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.

The key is to reduce precision without compromising model expressivity / accuracy

55 of 72

Lower Precision via Quantization

Operating at reduced numerical precision, namely 8-bit and 4-bit, can achieve computational advantages without a considerable decline in model performance.

2. Quantize

model weights

3. Dequantize weights

and compute

4. Quantize weights again

1. Load

model weights

Input

One Transformer Layer

16 bit

8 bit

16 bit

16 bit

8 bit

We dynamically de-quantize weights on-the-fly to perform matrix multiplications in 16 bit, and then re-quantize.

Inference time is not reduced (often increases), but memory overhead is.

Memory savings:

  • 8-bit = 2x
  • 4-bit = 4x

56 of 72

Popular Quantization Schemes

Two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq

bitsandbytes

  • Easy - does not require calibrating quantized model with input data (zero-shot)�
  • Cross-modality - Any model can be quantized as long as it has torch.nn.Linear layers�
  • No degradation when merging adapters - PEFT can be natively used for fine-tuning on quantized base model

auto-gptq

  • Fast for text gen - faster than bitsandbytes for generating text�
  • N-bit support - possible to quantize up to 2 bits (though this will degrade quality)
  • easily-serializable - bnb only supports saving serialized models for 8-bit

  • Works only for language models (today)

  • Requires calibration on a dataset

Today we also have: AWQ, EETQ, HQQ, AQLM, Quanto, gguf

57 of 72

2. Flash Attention

58 of 72

Regular Self-Attention

The traditional self attention mechanism scales quadratically in memory and compute wrt sequence length

The Q and K matrices each consist of N vectors, so QKT is size N2

Assuming the LLM has 40 attention heads and runs in bf16, the memory requirement to store QKT is:� = 40 x 2 x N2

So for:

  • N = 1000 → 50MB VRAM
  • N = 16k → 19GB VRAM
  • N = 100k → 1TB VRAM

Output O of a single self attention layer for input X of length N:

Self attention is prohibitively memory expensive for long input contexts

How can we reduce this memory requirement?

.

.

.

Flash Attention

59 of 72

Flash Attention

A variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.

LLM inference is memory-IO bound - i.e. takes longer to move 1MB of data to GPU compute core than it does to perform the actual computation on it.

Standard attention requires transfering intermediate values back and forth between HBM and SRAM multiple times during the computation.

Flash attention loads all of the data just once by fusing kernel operations and tiling to partition inputs for parallel processing.

VRAM scales linearly with seq. length + up to 3x faster computations

SRAM (static random access memory) - on chip

60 of 72

3. KV-cache

61 of 72

KV-Cache

Autoregressive generation works by iteratively generating tokens and adding them to the input.

62 of 72

KV-Cache

KV-cache saves compute resources by reusing previously calculated self-attention key-value pairs, instead of recalculating them for each generated token.

Without kv-cache

Generation steps for input of length = 20 tokens:

shape of input_ids torch.Size([1, 21])

shape of input_ids torch.Size([1, 22])

shape of input_ids torch.Size([1, 23])

shape of input_ids torch.Size([1, 24])

shape of input_ids torch.Size([1, 25])

[' Here is a Python function']

63 of 72

KV-Cache

KV-cache saves compute resources by reusing previously calculated self-attention key-value pairs, instead of recalculating them for each generated token.

Without kv-cache

Generation steps for input of length = 20 tokens:

shape of input_ids torch.Size([1, 21])

shape of input_ids torch.Size([1, 22])

shape of input_ids torch.Size([1, 23])

shape of input_ids torch.Size([1, 24])

shape of input_ids torch.Size([1, 25])

[' Here is a Python function']

With kv-cache

Generation steps for input of length = 20 tokens:

shape of input_ids torch.Size([1, 1])

length of key-value cache 20

shape of input_ids torch.Size([1, 1])

length of key-value cache 21

shape of input_ids torch.Size([1, 1])

length of key-value cache 22

shape of input_ids torch.Size([1, 1])

length of key-value cache 23

shape of input_ids torch.Size([1, 1])

length of key-value cache 24

[' Here', ' is', ' a', ' Python', ' function']

64 of 72

KV-Cache

KV-cache saves compute resources by reusing previously calculated self-attention key-value pairs, instead of recalculating them for each generated token.

Using the key-value cache has two advantages:

  • Significant increase in computational efficiency as less computations are performed compared to computing the full QKT matrix. This leads to an increase in inference speed

  • The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly.

Extremely useful for chat and RAG applications as we don’t need to re-encode the entire history of tokens on each forward pass

65 of 72

4. Dynamic Batching

66 of 72

Naive/Static Batching

Batch size remains constant until inference is complete for each sequence in the batch

  • LLM inference is memory-IO bound - i.e. takes longer to load 1MB of data to GPU compute core than it does to perform the actual computation
  • In text generation, sequences (both input and output) are of variable length
  • This means GPU’s are under utilized when some sequences finish generating before others

67 of 72

Dynamic/Continuous Batching

Also called “iteration-level scheduling” where sequences in a batch are swapped in and out per iteration to make best use of GPU memory

  • Adjust batches per iteration instead of waiting until every sequence in a batch has completed generation
  • Once a sequence in a batch has completed, we need logic to determine if it makes sense to insert a new sequence in its place
  • Achieves higher GPU utilization since GPU does not wait for all sequences to complete before starting a new one

68 of 72

5. Tensor Parallelism

69 of 72

Tensor Parallelism

  • A technique used to fit a large model in multiple GPUs - each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing
  • Allows large models that don’t fit in single GPU RAM to run across multiple GPUs efficiently
  • More available GPU RAM also means ability to handle larger batches == more throughput
  • Requires rewriting/modifying the model architecture

What happens when you have Llama3-70B that requires 140GB of VRAM, but largest GPU card only has 80GB?

70 of 72

Text Generation Inference (TGI)

71 of 72

Serving framework is important for performance

TGI: A Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power LLMs

  • Continuous batching of incoming requests for increased total throughput
  • Tensor Parallelism for faster inference on multiple GPUs
  • Token streaming using Server-Sent Events (SSE)
  • Optimized transformers code for inference using flash-attention and Paged Attention
  • Quantization with bitsandbytes and GPT-Q
  • Guidance / constrained decoding / function calling / json-mode

Others include: vLLM, Triton, Seldon, etc.

72 of 72

Thank you!