1 of 7

ALL-Sort

improved retrieval augmented generation

Trelis Research

2 of 7

All-sort

(assisted large language sorting)

Text chunks

Database

--------

--------

--------

--------

Helper LLM (Smaug 34B)

Sorted Chunks

Chunk

Relevance

Lorem ipsum

5

Eval

4

Total

3

Text chunks

Rate Relevance

3 of 7

all-sort

Prompt

=

{ context = high relevance chunks }

+

{ Query }

Sorted Chunks

Chunk

Relevance

Lorem ipsum

5

Eval

4

Total

3

4 of 7

overview

  1. Preparing the test questions.
  2. Full context prompt setup.
  3. “standard” RAG setup.
  4. All-sort setup:
    1. Enforcing Regex for responses.
    2. Classifying chunks.
    3. Sorting chunks by relevance.
    4. Deploying an api endpoint (SMAUG 34B).
  5. Live demo running all-sort.
  6. Costing + latency
  7. Helper model Ablations: openchat_3.5 7b, mixtral, yi.

5 of 7

costing

Assumptions:

  • 100k tokens of data.
  • GPT-4-Turbo costs $1 per 100k.
  • ALL-SORT Self-hosted GPU (e.g. runpod):
    • A100 ADA to run yi costs $2.25/hr
    • Takes ~75 seconds for prompt prep (unoptimised)
    • => ~$0.05 for prompt prep
  • ALL-SORT Hosted gpu:
    • 100k tokens of input x $0.5/M-token = $0.05

6 of 7

Costing

Full context

rag

All-sort

(self)

All-sort

(hosted)

Prep

n/A

$0

$0.05

$0.05

Eval

$1.00

$0.02

$0.02

$0.02

Total

$1.00

$0.02

$0.07

$0.07

7 of 7

Latency (very unoptimised)

Full context

rag

All-sort

(self)

All-sort

(hosted)

Eval

25 s

40 s

75 s

? 30-40 s ?