LLM Context Length Testing
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
0%
0%
0%
0%
0%
0%
0%
40%
40%
65%
55%
55%
65%
65%
55%
55%
55%
55%
65%
65%
65%
65%
65%
65%
85%
65%
65%
Placed Fact
Document Depth
Top Of
Document
Bottom Of
Document
25%
Doc Depth
50%
Doc Depth
75%
Doc Depth
Pressure Testing GPT-4 128K via "Needle In A HayStack"
Asking GPT-4 To Do Fact Retrieval Across Context Lengths & Document Depth
Context Length (# Tokens)
128K
1K
10K
19K
28K
37K
46K
55K
64K
73K
82K
91K
100K
109K
118K
100%
Accuracy Of Retrieval
0%
Accuracy Of Retrieval
50%
Accuracy Of Retrieval
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
100%
Goal: Test GPT-4 Ability To Retrieve Information From Large Context Windows
A fact was placed within a document. GPT-4 (1106-preview) was then asked to retrieve it. The output was evaluated for accuracy.
This test was run at 15 different document depths (top > bottom) and 15 different context lengths (1K >128K tokens).
2x tests were run for larger contexts for a larger sample size.
GPT-4 recall ability
started to degrade at large context lengths when the placed-fact was between 10%-50% document depth
Pressure Testing Claude-2.1 200K via "Needle In A HayStack"
Asking Claude 2.1 To Do Fact Retrieval Across Context Lengths & Document Depth
Context Length (# Tokens)
100%
Accuracy Of Retrieval
0%
Accuracy Of Retrieval
50%
Accuracy Of Retrieval
Goal: Test Claude 2.1 Ability To Retrieve Information From Large Context Windows
A fact was placed within a document. Claude 2.1 (200K) was then asked to retrieve it. The output was evaluated (with GPT-4) for accuracy.
This test was run at 35 different document depths (top > bottom) and 35 different context lengths (1K >200K tokens).
Document Depths followed a sigmoid distribution
100% Doc Depth
99.10%
98.80%
98.39%
97.86%
97.15%
96.21%
94.98%
93.38%
91.31%
85.38%
81.31%
76.43%
70.73%
64.29%
57.30%
50% Doc Depth
42.70%
35.70%
29.26%
23.56%
18.68%
14.62%
11.31%
8.68%
6.61%
5.01%
3.78%
2.84%
2.13%
1.60%
1.19%
0.89%
0% Doc Depth
88.68%
1K
7K
13K
19K
24K
30K
36K
42K
48K
54K
60K
65K
71K
77K
83K
89K
95K
101K
106K
112K
118K
124K
130K
136K
141K
147K
153K
159K
165K
170K
177K
182K
188K
194K
200K
Placed Fact
Document Depth
Top Of
Document
Bottom Of
Document
Claude 2.1 200K retrieval accuracy progressively decreased as context lengths increased.