Speeding up LLM responses
Taivo Pungas
Will your user wait 30 seconds for an LLM call?
About me
“Theory”
T = const. + k N
Response time
output token count.
is linear in
T = const. + k N
Response time
DNS, proxies, queueing, input token processing,…
output token count.
T = const. + k N
Response time
How long it takes to generate 1 token
output token count.
T = const. + k N
Response time
output token count.
Depends on
model: size, quantization, optimizations, …
execution: hardware, interconnect, CUDA drivers, …
T = const. + k N
≈ 1s
≈ 47s
500 tokens from OpenAI gpt-4
Empirically:
Response time
# output tokens
Response time
# output tokens
Response time
# output tokens
Response time
# output tokens
Response time
# output tokens
??????
Implications
Output fewer tokens.
💡 Output 0 tokens
Don’t use an LLM.
Run fewer agent loop iterations and tool executions.
💡 Output 1 token
(up to 100,000-class classification with cl100k_base)
💡 Prompting: “be brief” / “in 5-10 words”
~1/3 improvement
💡 Prompting: trim your JSON
long keys
whitespace
unneeded keys
💡 Prompting: can you avoid chain-of-thought?
=> Replaces output tokens (expensive) with input tokens (cheap)
💡 Re-factor into multiple calls (“guidance acceleration”)
Implied trade-off: tokens or roundtrip?
Makes sense for local models
💡 Stream-and-stop / stop sequence
💡 Switch providers or models
ms per token
💡 Parallelize
What’s the critical path?
Paragraph
Paragraph
Paragraph
§ summary
§ summary
§ summary
Doc summary
Document
💡 Fake it: streaming
💡 Fake it: smaller steps
“Write an email to Heiki to cancel our dinner”
> Hey Heiki,
>
💡 Fake it: smaller steps
“Write an email to Heiki to cancel our dinner”
> Hey Heiki,
> I unfortunately cannot make it on Sunday.
>
💡 Fake it: smaller steps
“Write an email to Heiki to cancel our dinner”
> Hey Heiki,
> I unfortunately cannot make it on Sunday.
> I’d still love to catch up though.
>
Output fewer tokens.
“FastAPI for copilots”
opencopilotdev taivo@opencopilot.dev
“FastAPI for copilots”
opencopilotdev taivo@opencopilot.dev