1 of 32

Speeding up LLM responses

Taivo Pungas

2 of 32

Will your user wait 30 seconds for an LLM call?

  1. No.
  2. But you can reduce latency (sometimes drastically).
  3. How?

3 of 32

About me

  • 0 → 25 ppl ML @ unicorn
  • CS MSc ETH
  • Comp. neurosci. in 2013
  • …startups

4 of 32

“Theory”

5 of 32

T = const. + k N

Response time

output token count.

is linear in

6 of 32

T = const. + k N

Response time

DNS, proxies, queueing, input token processing,…

output token count.

7 of 32

T = const. + k N

Response time

How long it takes to generate 1 token

output token count.

8 of 32

T = const. + k N

Response time

output token count.

Depends on

model: size, quantization, optimizations, …

execution: hardware, interconnect, CUDA drivers, …

9 of 32

T = const. + k N

≈ 1s

≈ 47s

500 tokens from OpenAI gpt-4

10 of 32

Empirically:

Response time

# output tokens

11 of 32

Response time

# output tokens

12 of 32

Response time

# output tokens

13 of 32

Response time

# output tokens

14 of 32

Response time

# output tokens

??????

15 of 32

Implications

16 of 32

Output fewer tokens.

17 of 32

💡 Output 0 tokens

Don’t use an LLM.

  • Code
  • Regex
  • Heuristics
  • Classical NLP (via API?)
  • Fallback to LLM?

Run fewer agent loop iterations and tool executions.

18 of 32

💡 Output 1 token

(up to 100,000-class classification with cl100k_base)

19 of 32

💡 Prompting: “be brief” / “in 5-10 words”

~1/3 improvement

20 of 32

💡 Prompting: trim your JSON

long keys

whitespace

unneeded keys

21 of 32

💡 Prompting: can you avoid chain-of-thought?

  • Prompt instead of CoT

=> Replaces output tokens (expensive) with input tokens (cheap)

22 of 32

💡 Re-factor into multiple calls (“guidance acceleration”)

Implied trade-off: tokens or roundtrip?

Makes sense for local models

23 of 32

💡 Stream-and-stop / stop sequence

24 of 32

💡 Switch providers or models

ms per token

25 of 32

💡 Parallelize

What’s the critical path?

Paragraph

Paragraph

Paragraph

§ summary

§ summary

§ summary

Doc summary

Document

26 of 32

💡 Fake it: streaming

27 of 32

💡 Fake it: smaller steps

“Write an email to Heiki to cancel our dinner”

> Hey Heiki,

>

28 of 32

💡 Fake it: smaller steps

“Write an email to Heiki to cancel our dinner”

> Hey Heiki,

> I unfortunately cannot make it on Sunday.

>

29 of 32

💡 Fake it: smaller steps

“Write an email to Heiki to cancel our dinner”

> Hey Heiki,

> I unfortunately cannot make it on Sunday.

> I’d still love to catch up though.

>

30 of 32

Output fewer tokens.

31 of 32

“FastAPI for copilots”

opencopilotdev taivo@opencopilot.dev

32 of 32

“FastAPI for copilots”

opencopilotdev taivo@opencopilot.dev