1 of 23

TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing

Presented at AAAI 2026 (Main Technical Track)�

Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim

1

Korea University

MLV Lab

NeurIPS 2024

2 of 23

Table Image Understanding with MLLMs

MLLMs are widely adopted for table image understanding tasks (e.g., question-answering).

[1] Zheng et al. "Multimodal Table Understanding”, ACL 2024 (main)

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

3 of 23

Unique Challenges in Table Image Understanding

However, previous MLLMs do not consider unique characteristics of table image.�Problem 1: Only a small region within a large table image is relevant to the question

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

4 of 23

Unique Challenges in Table Image Understanding

However, previous MLLMs do not consider unique characteristics of table image�Problem 1: Only a small region within a large table image is relevant to the question

Question: What is the time difference between the �runners in lane 4 and lane 9?

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

5 of 23

Unique Challenges in Table Image Understanding

However, previous MLLMs do not consider unique characteristics of table image.�Problem 1: Only a small region within a large table image is relevant to the question

Question: What is the time difference between the runners in lane 4 and lane 9?

Answer: 0.81 (20.57 – 19.76)

As table contains a lot of information, question information is crucial for effective understanding. However, vision encoders of previous MLLMs are not aware of question information.

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

6 of 23

Unique Challenges in Table Image Understanding

However, previous MLLMs do not consider unique characteristics of table image.�Problem 2: A large portion of a table image is redundant

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

7 of 23

Unique Challenges in Table Image Understanding

However, previous MLLMs do not consider unique characteristics of table image.�Problem 2: A large portion of a table image is redundant

White backgrounds areas compose large portion of a table image, which contain no information. These backgrounds are converted to visual tokens, which results in excessive computational cost in LLM.

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

8 of 23

Problem: Uninformative and redundant visual tokens

Visual tokens generated in previous MLLMs are �uninformative (question-agnostic) and redundant (empty tokens), resulting in low performance and high computational cost

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

9 of 23

TabFlash utilizes informative and compact visual tokens!

As a solution, we propose TabFlash architecture, which generates informative (question-aware) and compact (content tokens only), resulting in state-of-the-art performance with less computation.

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

10 of 23

Method 1: Progressive Question Conditioning

To generate question-aware visual tokens, we inject question information to vision encoder (i.e., ViT).

Question injection strategy (red box)

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

11 of 23

Method 1: Progressive Question Conditioning

As early layers of ViT are unstable while latter layers are more stable, we increase question conditioning frequency progressively, considering each layer’s capacity to handle additional information.

Progressive Question Conditioning (red box)

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

12 of 23

Method 1: Progressive Question Conditioning

As early layers of ViT are unstable while latter layers are more stable, we increase question conditioning frequency progressively, considering each layer’s capacity to handle additional information.

As question information are injected to ViT, resulting visual tokens are more informative!

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

13 of 23

Method 2: Norm-based Pruning with Token Focusing

We observe that L2 norm of vision tokens effectively detects ‘background’ tokens. In other words, tokens with low norms typically corresponds to background regions, which are pruned for efficiency.

Norm-based token pruning (red box)

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

14 of 23

Method 2: Norm-based Pruning with Token Focusing

However, we observe that crucial information are being stored on ‘background’ tokens to be pruned.

Results of norm-based token pruning (top) and inference results with retained/pruned tokens

Although background tokens are being pruned as shown in the figure, 13.5 accuracy is achieved with those pruned tokens, indicating presence of crucial information in those tokens.

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

15 of 23

Method 2: Norm-based Pruning with Token Focusing

To concentrate crucial information to retained tokens, token focusing suppresses correct prediction with pruned tokens, and promotes correct prediction with retained tokens.

Token Focusing (red box)

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

16 of 23

Method 2: Norm-based Pruning with Token Focusing

To concentrate crucial information to retained tokens, token focusing suppresses correct prediction with pruned tokens, and promotes correct prediction with retained tokens.

Token Focusing (red box)

With Norm-based pruning and Token Focusing, compact set of visual tokens �with minimal information loss are retained.

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

17 of 23

Results on Table Question-Answering Tasks

TabFlash (3B) achieves state-of-the-art result, outperforming both proprietary and open-source MLLMs

Comparison on seven table QA benchmarks task performance

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

18 of 23

Cost efficiency of TabFlash

TabFlash (3B) achieves best performance with 27% lower FLOPs vs. second-best (InternVL-2.5 3B).

TabFlash (1B) outperforms most MLLMs with exceptionally low FLOPs and memory usage

Performance-cost tradeoff table (left) and figure (right)

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

19 of 23

Progressive Question Conditioning Successfully Injects Question Information to ViT

Improper conditioning even degrades performance; proposed conditioning achieves the best results

Performance by conditioning strategy

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

20 of 23

Token Focusing Concentrates Crucial Information to Retained Tokens

Without Token Focusing, crucial information is still stored on pruned tokens. Token Focusing effectively concentrates those information to retained tokens by suppressing correct prediction with pruned ones

Performance by inference token set with (top) and without (bottom) token focusing loss

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

21 of 23

Results on other table understanding tasks

TabFlash also shows superior performance on other table understanding tasks as well (i.e., table fact verification, table-to-text generation)

Comparison on table fact verification and table-to-text generation tasks

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

22 of 23

Qualitative Example

Attention visualization results show that TabFlash allocates high attention scores to question-relevant regions

Attention visualization of TabFlash and Baseline. Red color denotes high attention scores.

MLV Lab

Korea University

MLV Lab

NeurIPS 2024

23 of 23

Conclusion

  • Previous MLLMs utilizes uninformative and redundant visual tokens, resulting in low performance and high computational cost in table understanding tasks.�
  • We propose progressive question conditioning, which injects question information into ViT.

  • We also propose norm-based pruning with token focusing, which discards background tokens while concentrating crucial information to retained tokens.

  • Combining both methods, we propose TabFlash, which produces informative and compact visual tokens, thereby achieving state-of-the art performance with less computation.

Github

Paper

MLV Lab

Korea University

MLV Lab

NeurIPS 2024