ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
LMMs-EvalLLaVA-1.5LLaVA-1.6CommentsUpdate: Mar. 8th, 2024
2
DatasetsMeta Info1.5-7B (report)1.5-7B (lmms-eval)1.5-13B (report)1.5-13B (lmms-eval)1.6-7B (lmms-eval)1.6-7B (lmms-eval)1.6-13B (lmms-eval)1.6-34B
(lmms-eval)
Env Info:
3
SplitMetric#Numliuhaotian/llava-v1.5-7bliuhaotian/llava-v1.5-7bliuhaotian/llava-v1.5-13bliuhaotian/llava-v1.5-13bliuhaotian/llava-v1.6-mistral-7bliuhaotian/llava-v1.6-vicuna-7bliuhaotian/llava-v1.6-vicuna-13bliuhaotian/llava-v1.6-34b
4
AI2DtestAcc3,088-54.79-59.4960.7566.5870.0474.94torch 2.2.1 + cuda 12.1
5
ChartQAtestRelaxedAcc2,500-18.24-18.2038.7654.8462.268.72torch 2.2.1 + cuda 12.1
6
CMMMUvalAcc900-21.80-26.3022.72423,239.9torch 2.2.1 + cuda 12.1
7
COCO-Capcococap_val_2014CIDEr40,504-108.66-113.88107.6696.9899.45103.16torch 2.2.1 + cuda 12.1
8
COCO-Capcococap_val_2017CIDEr5,000-110.38-115.61109.2299.93101.99105.89torch 2.2.1 + cuda 12.1
9
DocVQAvalANLS5,349-28.08-30.2972.1674.3577.4583.98torch 2.2.1 + cuda 12.1
10
Flickr-CIDEr31,784-74.93-79.5973.1468.4466.768.48torch 2.2.1 + cuda 12.1
11
GQAgqa_evalAcc12,57862.0061.9763.3063.2454.9864.2365.3667.08torch 2.2.1 + cuda 12.1
12
Hallusion-BenchtestAll Acc.95144.9042.2741.7441.5344.47torch 2.2.1 + cuda 12.1
13
InfoVQAvalANLS2,801-25.81-29.3543.7737.0941.3451.45torch 2.2.1 + cuda 12.1
14
LLaVA-WtestGPT-Eval-Avg6063.4065.3 (0314) 59.6 (0613)-72.8 (0314) 66.1 (0613)71.7 (0613)72.3 (0613)72.3 (0613)LLaVA 1.5 uses GPT4-0314, but it has been deprecated. we use GPT4-0613 and it gives lower score on all model versionstorch 2.2.1 + cuda 12.1
15
MathVistatestminiAcc1,00027.4026.7027.6026.4037.434.435.1torch 2.2.1 + cuda 12.1
16
MMBenchdevAcc4377 (dev)\64.3064.8067.7068.73torch 2.2.1 + cuda 12.1
17
MMBench-ChdevAcc4329 (dev)58.3057.6263.6062.54torch 2.2.1 + cuda 12.1
18
MME-Cognitiontesttotal score2,374-348.21-295.35323.92322.5316.78397.14torch 2.2.1 + cuda 12.1
19
MME-Perceptiontesttotal score2,3741510.701510.75-1522.591500.851519.291575.071633.24torch 2.2.1 + cuda 12.1
20
MMMUvalAcc900-35.3036.4034.8033.435.135.946.7Implementation needs to be improved, LLaVA-Next reports results with multiple images while lmms-eval currently only consider single imagetorch 2.2.1 + cuda 12.1
21
MMVettestGPT-Eval-Avg21830.5030.55-35.2547.7544.0849.12torch 2.2.1 + cuda 12.1
22
MultidocVQAvalAnls/acc5,18716.65/7.2118.25/8.0241.4/27.8944.42/31.3246.28/32.5650.16/34.93torch 2.2.1 + cuda 12.1
23
NoCapsnocaps_evalCIDEr4,500-105.54-109.2896.1488.2988.2791.94torch 2.2.1 + cuda 12.1
24
OKVQAvalAcc5,046-53.44-58.2254.7744.2546.2746.84torch 2.2.1 + cuda 12.1
25
POPEtestF1 Score9,00085.9085.87-85.9286.7986.486.2687.77torch 2.2.1 + cuda 12.1
26
ScienceQAscienceqa-fullAcc.4,114-70.41-74.960.2373.2175.8585.81torch 2.2.1 + cuda 12.1
27
ScienceQAscienceqa-imgAcc2,01766.8070.4371.6072.88070.1573.57torch 2.2.1 + cuda 12.1
28
SEED-BenchSeed-1Image-Acc17,990total: 58.6total: 60.49image: 66.92image: 67.0665.9764.7465.6469.55torch 2.2.1 + cuda 12.1
29
SEED-Bench-2Seed-2Acc24,371total : 57.89total : 59.8860.8359.8860.7264.98torch 2.2.1 + cuda 12.1
30
RefcocoallCIder29.7634.269.4734.234.75torch 2.2.1 + cuda 12.1
31
Refcocobbox-testCider5,00032.4534.269.6336.1738.2torch 2.2.1 + cuda 12.1
32
bbox-testACider1,97515.9816.685.918.4718.63torch 2.2.1 + cuda 12.1
33
bbox-testB1,81041.9945.1512.549.9151.01torch 2.2.1 + cuda 12.1
34
bbox-val8,81130.3533.129.8836.2837.27torch 2.2.1 + cuda 12.1
35
seg-test5,00030.4432.039.4233.7933.52torch 2.2.1 + cuda 12.1
36
seg-testA1,97514.4415.495.2615.4314.74torch 2.2.1 + cuda 12.1
37
seg-testB1,81040.1943.4712.947.1846.97torch 2.2.1 + cuda 12.1
38
seg-val8,81129.1231.549.4233.133.23torch 2.2.1 + cuda 12.1
39
Refcoco+allCIder28.9231.019.0531.8232torch 2.2.1 + cuda 12.1
40
Refcoco+bbox-testACider1,97520.3419.786.6122.121.62torch 2.2.1 + cuda 12.1
41
bbox-testB1,79839.0941.6111.1843.8544.93torch 2.2.1 + cuda 12.1
42
bbox-val3,80530.1633.369.5634.5335.56torch 2.2.1 + cuda 12.1
43
seg-testACider1,97517.9818.346.0518.117.85torch 2.2.1 + cuda 12.1
44
seg-testB1,79837.4640.0211.6441.1541.68torch 2.2.1 + cuda 12.1
45
seg-val3,80521.5031.819.1231.1930.47torch 2.2.1 + cuda 12.1
46
RefcocogallCIder57.7659.2319.3552.1858.02torch 2.2.1 + cuda 12.1
47
Refcocogbbox-testCider5,02358.9059.8620.253.3161.83torch 2.2.1 + cuda 12.1
48
bbox-val7,57360.4561.6119.775561torch 2.2.1 + cuda 12.1
49
seg-testCider5,02355.7857.3418.8249.3654.28torch 2.2.1 + cuda 12.1
50
seg-val7,57355.6357.7018.7150.1954.78torch 2.2.1 + cuda 12.1
51
TextCapsvalCIDEr3,166-98.15-103.9270.3971.7967.3967.11torch 2.2.1 + cuda 12.1
52
TextVQAvalexact_match5,000-46.07-48.7365.7664.8566.9269.31In the LLaVA paper, the OCR token was utilized as a prompt for the evaluation of TextVQA. You can take this issue as a reference.torch 2.2.1 + cuda 12.1
53
VizWiz (val)valAcc4,319-54.39-56.6563.7960.6463.5666.61torch 2.2.1 + cuda 12.1
54
VQAv2valAcc214,354-76.64-78.2680.3280.0680.9282.07torch 2.2.1 + cuda 12.1
55
VQAv2testAcc-78.5080.0079.99
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100