ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
VLMocr
celebrity_recognition
object_localization
attribute_recognition
action_recognition
attribute_comparison
nature_relation
physical_relation
social_relation
identity_reasoning
function_reasoning
physical_property_reasoning
structuralized_imagetext_understanding
future_prediction
image_topic
image_emotion
image_sceneimage_style
2
Standard Acc. LLaVA-1.562.984509095.959.146.74087.51008710.318.25290.385.797.869.6
3
CogVLM74.387.2298291.843.257.813.390.610091.324.16.81680.683.397.887
4
Qwen-VL-Chat65.794.732.378989.135.633.334.497.48717.29.12083.985.795.778.3
5
Gemini-Pro74.389.4297291.822.753.313.378.197.489.124.147.72893.583.395.769.6
6
LLaVA-Next-13B71.486.254.88693.959.157.846.765.610076.16.9252893.585.797.880.4
7
LLaVA-Next-34B82.986.266.17683.756.882.24059.492.18717.238.62410071.497.878.3
8
GPT-4V97.156.445.28291.845.588.92062.597.493.524.172.72810088.196.893.5
9
AAD Acc. LLaVA-1.50000202.20004.33.42.340000
10
CogVLM001.62000005.300000000
11
Qwen-VL-Chat11.438.312.92034.7011.1046.944.732.66.92.3406.538.117.20
12
Gemini-Pro2061.719.41671.404.4012.55030.410.301632.326.245.20
13
LLaVA-Next-13B22.938.33.21810.2057.8071.934.234.827.622.7440312.228.3
14
LLaVA-Next-34B65.774.524.26867.32564.426.787.592.176.141.431.86448.490.567.760.9
15
GPT-4V88.689.416.15087.84.557.846.737.594.743.551.731.8607147.691.447.8
16
Dual Acc.LLaVA-1.50000202.20004.30040000
17
CogVLM001.62000005.300000000
18
Qwen-VL-Chat11.438.34.82034.708.902542.128.30046.538.117.20
19
Gemini-Pro17.159.612.91667.302.206.247.428.33.40832.326.245.20
20
LLaVA-Next-13B22.937.23.21810.2042.2040.634.230.43.411.44028.62.223.9
21
LLaVA-Next-34B65.772.3216261.225602046.986.871.76.922.72448.471.467.750
22
GPT-4V88.648.911.34883.72.351.16.72592.143.510.327.3167147.690.343.5
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100