ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
LMMs-EvalLLaVA-1.5LLaVA-1.6Update: Mar. 8th, 2024Comments
2
DatasetsMeta Info1.5-7B (report)1.5-7B (lmms-eval)1.5-13B (report)1.5-13B (lmms-eval)1.6-7B (lmms-eval)1.6-7B (lmms-eval)1.6-13B (lmms-eval)1.6-34B
(lmms-eval)
Env Info:
3
SplitMetric#Numliuhaotian/llava-v1.5-7bliuhaotian/llava-v1.5-7bliuhaotian/llava-v1.5-13bliuhaotian/llava-v1.5-13bliuhaotian/llava-v1.6-mistral-7bliuhaotian/llava-v1.6-vicuna-7bliuhaotian/llava-v1.6-vicuna-13bliuhaotian/llava-v1.6-34b
4
AI2DtestAcc3,088-54.8-59.560.866.670.074.9torch 2.2.1 + cuda 12.1
5
ChartQAtestRelaxedAcc2,500-18.2-18.238.854.862.268.7torch 2.2.1 + cuda 12.1
6
CMMMUvalAcc900-21.8-26.322.724.023,239.9torch 2.2.1 + cuda 12.1
7
ClothoAQAtestGPT-Eval-Avg
8
COCO-Capcococap_val_2014CIDEr40,504-108.7-113.9107.797.099.5103.2torch 2.2.1 + cuda 12.1
9
COCO-Capcococap_val_2017CIDEr5,000-110.4-115.6109.299.9102.0105.9torch 2.2.1 + cuda 12.1
10
DocVQAvalANLS5,349-28.1-30.372.274.477.584.0torch 2.2.1 + cuda 12.1
11
Flickr-CIDEr31,784-74.9-79.673.168.466.768.5torch 2.2.1 + cuda 12.1
12
GQAgqa_evalAcc12,57862.0062.063.363.255.064.265.467.1torch 2.2.1 + cuda 12.1
13
Hallusion-BenchtestAll Acc.95144.942.341.741.544.5torch 2.2.1 + cuda 12.1
14
InfoVQAvalANLS2,801-25.8-29.443.837.141.351.5torch 2.2.1 + cuda 12.1
15
LLaVA-WtestGPT-Eval-Avg6063.4065.3 (0314) 59.6 (0613)-72.8 (0314) 66.1 (0613)71.7 (0613)72.3 (0613)72.3 (0613)torch 2.2.1 + cuda 12.1LLaVA 1.5 uses GPT4-0314, but it has been deprecated. we use GPT4-0613 and it gives lower score on all model versions
16
MathVistatestminiAcc1,00027.4026.727.626.437.434.435.1torch 2.2.1 + cuda 12.1
17
MMBenchdevAcc4377 (dev)\64.3064.867.768.7torch 2.2.1 + cuda 12.1
18
MMBench-ChdevAcc4329 (dev)58.3057.663.662.5torch 2.2.1 + cuda 12.1
19
MME-Cognitiontesttotal score2,374-348.2-295.4323.9322.5316.8397.1torch 2.2.1 + cuda 12.1
20
MME-Perceptiontesttotal score2,3741510.701510.8-1522.61500.91519.31575.11633.2torch 2.2.1 + cuda 12.1
21
MMMUvalAcc900-35.336.434.833.435.135.946.7torch 2.2.1 + cuda 12.1Implementation needs to be improved, LLaVA-Next reports results with multiple images while lmms-eval currently only consider single image
22
MMVettestGPT-Eval-Avg21830.5030.6-35.347.844.149.1torch 2.2.1 + cuda 12.1
23
MultidocVQAvalAnls/acc5,18716.65/7.2118.25/8.0241.4/27.8944.42/31.3246.28/32.5650.16/34.93torch 2.2.1 + cuda 12.1
24
NoCapsnocaps_evalCIDEr4,500-105.5-109.396.188.388.391.9torch 2.2.1 + cuda 12.1
25
OKVQAvalAcc5,046-53.4-58.254.844.346.346.8torch 2.2.1 + cuda 12.1
26
POPEtestF1 Score9,00085.9085.9-85.986.886.486.387.8torch 2.2.1 + cuda 12.1
27
ScienceQAscienceqa-fullAcc.4,114-70.4-75.00.273.275.985.8torch 2.2.1 + cuda 12.1
28
ScienceQAscienceqa-imgAcc2,01766.8070.471.672.90.070.273.6torch 2.2.1 + cuda 12.1
29
SEED-BenchSeed-1Image-Acc17,990total: 58.6total: 60.49image: 66.92image: 67.0666.064.765.669.6torch 2.2.1 + cuda 12.1
30
SEED-Bench-2Seed-2Acc24,371total : 57.89total : 59.8860.859.960.765.0torch 2.2.1 + cuda 12.1
31
RefcocoallCIder29.834.39.534.234.8torch 2.2.1 + cuda 12.1
32
Refcocobbox-testCider5,00032.534.39.636.238.2torch 2.2.1 + cuda 12.1
33
bbox-testACider1,97516.016.75.918.518.6torch 2.2.1 + cuda 12.1
34
bbox-testB1,81042.045.212.549.951.0torch 2.2.1 + cuda 12.1
35
bbox-val8,81130.433.19.936.337.3torch 2.2.1 + cuda 12.1
36
seg-test5,00030.432.09.433.833.5torch 2.2.1 + cuda 12.1
37
seg-testA1,97514.415.55.315.414.7torch 2.2.1 + cuda 12.1
38
seg-testB1,81040.243.512.947.247.0torch 2.2.1 + cuda 12.1
39
seg-val8,81129.131.59.433.133.2torch 2.2.1 + cuda 12.1
40
Refcoco+allCIder28.931.09.131.832.0torch 2.2.1 + cuda 12.1
41
Refcoco+bbox-testACider1,97520.319.86.622.121.6torch 2.2.1 + cuda 12.1
42
bbox-testB1,79839.141.611.243.944.9torch 2.2.1 + cuda 12.1
43
bbox-val3,80530.233.49.634.535.6torch 2.2.1 + cuda 12.1
44
seg-testACider1,97518.018.36.118.117.9torch 2.2.1 + cuda 12.1
45
seg-testB1,79837.540.011.641.241.7torch 2.2.1 + cuda 12.1
46
seg-val3,80521.531.89.131.230.5torch 2.2.1 + cuda 12.1
47
RefcocogallCIder57.859.219.452.258.0torch 2.2.1 + cuda 12.1
48
Refcocogbbox-testCider5,02358.959.920.253.361.8torch 2.2.1 + cuda 12.1
49
bbox-val7,57360.561.619.855.361.2torch 2.2.1 + cuda 12.1
50
seg-testCider5,02355.857.318.849.454.3torch 2.2.1 + cuda 12.1
51
seg-val7,57355.657.718.750.254.8torch 2.2.1 + cuda 12.1
52
TextCapsvalCIDEr3,166-98.2-103.970.471.867.467.1torch 2.2.1 + cuda 12.1
53
TextVQAvalexact_match5,000-46.1-48.765.864.966.969.3torch 2.2.1 + cuda 12.1In the LLaVA paper, the OCR token was utilized as a prompt for the evaluation of TextVQA. You can take this issue as a reference.
54
VizWiz (val)valAcc4,319-54.4-56.763.860.663.666.6torch 2.2.1 + cuda 12.1
55
VQAv2valAcc214,354-76.6-78.380.380.180.982.1torch 2.2.1 + cuda 12.1
56
VQAv2testAcc-78.5080.080.0
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100