JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 7

ALL-Sort

improved retrieval augmented generation

Trelis Research

2 of 7

All-sort

(assisted large language sorting)

Text chunks

Database

--------

--------

--------

--------

Helper LLM (Smaug 34B)

Sorted Chunks

Chunk	Relevance
Lorem ipsum	5
Eval	4
Total	3

Text chunks

Rate Relevance

3 of 7

all-sort

Prompt

=

{ context = high relevance chunks }

+

{ Query }

Sorted Chunks

Chunk	Relevance
Lorem ipsum	5
Eval	4
Total	3

4 of 7

overview

Preparing the test questions.
Full context prompt setup.
“standard” RAG setup.
All-sort setup:

Enforcing Regex for responses.
Classifying chunks.
Sorting chunks by relevance.
Deploying an api endpoint (SMAUG 34B).

Live demo running all-sort.
Costing + latency
Helper model Ablations: openchat_3.5 7b, mixtral, yi.

5 of 7

costing

Assumptions:

100k tokens of data.
GPT-4-Turbo costs $1 per 100k.
ALL-SORT Self-hosted GPU (e.g. runpod):

A100 ADA to run yi costs $2.25/hr
Takes ~75 seconds for prompt prep (unoptimised)
=> ~$0.05 for prompt prep

ALL-SORT Hosted gpu:

100k tokens of input x $0.5/M-token = $0.05

6 of 7

Costing

	Full context	rag	All-sort (self)	All-sort (hosted)
Prep	n/A	$0	$0.05	$0.05
Eval	$1.00	$0.02	$0.02	$0.02
Total	$1.00	$0.02	$0.07	$0.07

7 of 7

Latency (very unoptimised)

	Full context	rag	All-sort (self)	All-sort (hosted)
Eval	25 s	40 s	75 s	? 30-40 s ?