1 of 3

LLM Context Length Testing

2 of 3

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

0%

0%

0%

0%

0%

0%

0%

40%

40%

65%

55%

55%

65%

65%

55%

55%

55%

55%

65%

65%

65%

65%

65%

65%

85%

65%

65%

Placed Fact

Document Depth

Top Of

Document

Bottom Of

Document

25%

Doc Depth

50%

Doc Depth

75%

Doc Depth

Pressure Testing GPT-4 128K via "Needle In A HayStack"

Asking GPT-4 To Do Fact Retrieval Across Context Lengths & Document Depth

Context Length (# Tokens)

128K

1K

10K

19K

28K

37K

46K

55K

64K

73K

82K

91K

100K

109K

118K

100%

Accuracy Of Retrieval

0%

Accuracy Of Retrieval

50%

Accuracy Of Retrieval

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

100%

Goal: Test GPT-4 Ability To Retrieve Information From Large Context Windows

A fact was placed within a document. GPT-4 (1106-preview) was then asked to retrieve it. The output was evaluated for accuracy.

This test was run at 15 different document depths (top > bottom) and 15 different context lengths (1K >128K tokens).

2x tests were run for larger contexts for a larger sample size.

GPT-4 recall ability

started to degrade at large context lengths when the placed-fact was between 10%-50% document depth

3 of 3

Pressure Testing Claude-2.1 200K via "Needle In A HayStack"

Asking Claude 2.1 To Do Fact Retrieval Across Context Lengths & Document Depth

Context Length (# Tokens)

100%

Accuracy Of Retrieval

0%

Accuracy Of Retrieval

50%

Accuracy Of Retrieval

Goal: Test Claude 2.1 Ability To Retrieve Information From Large Context Windows

A fact was placed within a document. Claude 2.1 (200K) was then asked to retrieve it. The output was evaluated (with GPT-4) for accuracy.

This test was run at 35 different document depths (top > bottom) and 35 different context lengths (1K >200K tokens).

Document Depths followed a sigmoid distribution

100% Doc Depth

99.10%

98.80%

98.39%

97.86%

97.15%

96.21%

94.98%

93.38%

91.31%

85.38%

81.31%

76.43%

70.73%

64.29%

57.30%

50% Doc Depth

42.70%

35.70%

29.26%

23.56%

18.68%

14.62%

11.31%

8.68%

6.61%

5.01%

3.78%

2.84%

2.13%

1.60%

1.19%

0.89%

0% Doc Depth

88.68%

1K

7K

13K

19K

24K

30K

36K

42K

48K

54K

60K

65K

71K

77K

83K

89K

95K

101K

106K

112K

118K

124K

130K

136K

141K

147K

153K

159K

165K

170K

177K

182K

188K

194K

200K

Placed Fact

Document Depth

Top Of

Document

Bottom Of

Document

Claude 2.1 200K retrieval accuracy progressively decreased as context lengths increased.