LMMs-Eval Shared Results

	A	B	C	D	E	F	G	H	I	J	K	L	M	N
1	LMMs-Eval				LLaVA-1.5				LLaVA-1.6				Update: Mar. 8th, 2024	Comments
2	Datasets	Meta Info			1.5-7B (report)	1.5-7B (lmms-eval)	1.5-13B (report)	1.5-13B (lmms-eval)	1.6-7B (lmms-eval)	1.6-7B (lmms-eval)	1.6-13B (lmms-eval)	1.6-34B (lmms-eval)	Env Info:
3	Datasets	Split	Metric	#Num	liuhaotian/llava-v1.5-7b	liuhaotian/llava-v1.5-7b	liuhaotian/llava-v1.5-13b	liuhaotian/llava-v1.5-13b	liuhaotian/llava-v1.6-mistral-7b	liuhaotian/llava-v1.6-vicuna-7b	liuhaotian/llava-v1.6-vicuna-13b	liuhaotian/llava-v1.6-34b
4	AI2D	test	Acc	3,088	-	54.8	-	59.5	60.8	66.6	70.0	74.9	torch 2.2.1 + cuda 12.1
5	ChartQA	test	RelaxedAcc	2,500	-	18.2	-	18.2	38.8	54.8	62.2	68.7	torch 2.2.1 + cuda 12.1
6	CMMMU	val	Acc	900	-	21.8	-	26.3	22.7	24.0	23,2	39.9	torch 2.2.1 + cuda 12.1
7	ClothoAQA	test	GPT-Eval-Avg
8	COCO-Cap	cococap_val_2014	CIDEr	40,504	-	108.7	-	113.9	107.7	97.0	99.5	103.2	torch 2.2.1 + cuda 12.1
9	COCO-Cap	cococap_val_2017	CIDEr	5,000	-	110.4	-	115.6	109.2	99.9	102.0	105.9	torch 2.2.1 + cuda 12.1
10	DocVQA	val	ANLS	5,349	-	28.1	-	30.3	72.2	74.4	77.5	84.0	torch 2.2.1 + cuda 12.1
11	Flickr	-	CIDEr	31,784	-	74.9	-	79.6	73.1	68.4	66.7	68.5	torch 2.2.1 + cuda 12.1
12	GQA	gqa_eval	Acc	12,578	62.00	62.0	63.3	63.2	55.0	64.2	65.4	67.1	torch 2.2.1 + cuda 12.1
13	Hallusion-Bench	test	All Acc.	951		44.9		42.3	41.7	41.5	44.5		torch 2.2.1 + cuda 12.1
14	InfoVQA	val	ANLS	2,801	-	25.8	-	29.4	43.8	37.1	41.3	51.5	torch 2.2.1 + cuda 12.1
15	LLaVA-W	test	GPT-Eval-Avg	60	63.40	65.3 (0314) 59.6 (0613)	-	72.8 (0314) 66.1 (0613)	71.7 (0613)	72.3 (0613)	72.3 (0613)		torch 2.2.1 + cuda 12.1	LLaVA 1.5 uses GPT4-0314, but it has been deprecated. we use GPT4-0613 and it gives lower score on all model versions
16	MathVista	testmini	Acc	1,000	27.40	26.7	27.6	26.4	37.4	34.4	35.1		torch 2.2.1 + cuda 12.1
17	MMBench	dev	Acc	4377 (dev)\	64.30	64.8	67.7	68.7					torch 2.2.1 + cuda 12.1
18	MMBench-Ch	dev	Acc	4329 (dev)	58.30	57.6	63.6	62.5					torch 2.2.1 + cuda 12.1
19	MME-Cognition	test	total score	2,374	-	348.2	-	295.4	323.9	322.5	316.8	397.1	torch 2.2.1 + cuda 12.1
20	MME-Perception	test	total score	2,374	1510.70	1510.8	-	1522.6	1500.9	1519.3	1575.1	1633.2	torch 2.2.1 + cuda 12.1
21	MMMU	val	Acc	900	-	35.3	36.4	34.8	33.4	35.1	35.9	46.7	torch 2.2.1 + cuda 12.1	Implementation needs to be improved, LLaVA-Next reports results with multiple images while lmms-eval currently only consider single image
22	MMVet	test	GPT-Eval-Avg	218	30.50	30.6	-	35.3	47.8	44.1	49.1		torch 2.2.1 + cuda 12.1
23	MultidocVQA	val	Anls/acc	5,187		16.65/7.21		18.25/8.02	41.4/27.89	44.42/31.32	46.28/32.56	50.16/34.93	torch 2.2.1 + cuda 12.1
24	NoCaps	nocaps_eval	CIDEr	4,500	-	105.5	-	109.3	96.1	88.3	88.3	91.9	torch 2.2.1 + cuda 12.1
25	OKVQA	val	Acc	5,046	-	53.4	-	58.2	54.8	44.3	46.3	46.8	torch 2.2.1 + cuda 12.1
26	POPE	test	F1 Score	9,000	85.90	85.9	-	85.9	86.8	86.4	86.3	87.8	torch 2.2.1 + cuda 12.1
27	ScienceQA	scienceqa-full	Acc.	4,114	-	70.4	-	75.0	0.2	73.2	75.9	85.8	torch 2.2.1 + cuda 12.1
28	ScienceQA	scienceqa-img	Acc	2,017	66.80	70.4	71.6	72.9	0.0	70.2	73.6		torch 2.2.1 + cuda 12.1
29	SEED-Bench	Seed-1	Image-Acc	17,990	total: 58.6	total: 60.49	image: 66.92	image: 67.06	66.0	64.7	65.6	69.6	torch 2.2.1 + cuda 12.1
30	SEED-Bench-2	Seed-2	Acc	24,371		total : 57.89		total : 59.88	60.8	59.9	60.7	65.0	torch 2.2.1 + cuda 12.1
31	Refcoco	all	CIder			29.8		34.3	9.5	34.2	34.8		torch 2.2.1 + cuda 12.1
32	Refcoco	bbox-test	Cider	5,000		32.5		34.3	9.6	36.2	38.2		torch 2.2.1 + cuda 12.1
33		bbox-testA	Cider	1,975		16.0		16.7	5.9	18.5	18.6		torch 2.2.1 + cuda 12.1
34		bbox-testB		1,810		42.0		45.2	12.5	49.9	51.0		torch 2.2.1 + cuda 12.1
35		bbox-val		8,811		30.4		33.1	9.9	36.3	37.3		torch 2.2.1 + cuda 12.1
36		seg-test		5,000		30.4		32.0	9.4	33.8	33.5		torch 2.2.1 + cuda 12.1
37		seg-testA		1,975		14.4		15.5	5.3	15.4	14.7		torch 2.2.1 + cuda 12.1
38		seg-testB		1,810		40.2		43.5	12.9	47.2	47.0		torch 2.2.1 + cuda 12.1
39		seg-val		8,811		29.1		31.5	9.4	33.1	33.2		torch 2.2.1 + cuda 12.1
40	Refcoco+	all	CIder			28.9		31.0	9.1	31.8	32.0		torch 2.2.1 + cuda 12.1
41	Refcoco+	bbox-testA	Cider	1,975		20.3		19.8	6.6	22.1	21.6		torch 2.2.1 + cuda 12.1
42		bbox-testB		1,798		39.1		41.6	11.2	43.9	44.9		torch 2.2.1 + cuda 12.1
43		bbox-val		3,805		30.2		33.4	9.6	34.5	35.6		torch 2.2.1 + cuda 12.1
44		seg-testA	Cider	1,975		18.0		18.3	6.1	18.1	17.9		torch 2.2.1 + cuda 12.1
45		seg-testB		1,798		37.5		40.0	11.6	41.2	41.7		torch 2.2.1 + cuda 12.1
46		seg-val		3,805		21.5		31.8	9.1	31.2	30.5		torch 2.2.1 + cuda 12.1
47	Refcocog	all	CIder			57.8		59.2	19.4	52.2	58.0		torch 2.2.1 + cuda 12.1
48	Refcocog	bbox-test	Cider	5,023		58.9		59.9	20.2	53.3	61.8		torch 2.2.1 + cuda 12.1
49		bbox-val		7,573		60.5		61.6	19.8	55.3	61.2		torch 2.2.1 + cuda 12.1
50		seg-test	Cider	5,023		55.8		57.3	18.8	49.4	54.3		torch 2.2.1 + cuda 12.1
51		seg-val		7,573		55.6		57.7	18.7	50.2	54.8		torch 2.2.1 + cuda 12.1
52	TextCaps	val	CIDEr	3,166	-	98.2	-	103.9	70.4	71.8	67.4	67.1	torch 2.2.1 + cuda 12.1
53	TextVQA	val	exact_match	5,000	-	46.1	-	48.7	65.8	64.9	66.9	69.3	torch 2.2.1 + cuda 12.1	In the LLaVA paper, the OCR token was utilized as a prompt for the evaluation of TextVQA. You can take this issue as a reference.
54	VizWiz (val)	val	Acc	4,319	-	54.4	-	56.7	63.8	60.6	63.6	66.6	torch 2.2.1 + cuda 12.1
55	VQAv2	val	Acc	214,354	-	76.6	-	78.3	80.3	80.1	80.9	82.1	torch 2.2.1 + cuda 12.1
56	VQAv2	test	Acc	-	78.50		80.0	80.0
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100