TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing
�Presented at AAAI 2026 (Main Technical Track)�
Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim
1
Korea University
MLV Lab
NeurIPS 2024
Table Image Understanding with MLLMs
MLLMs are widely adopted for table image understanding tasks (e.g., question-answering).
[1] Zheng et al. "Multimodal Table Understanding”, ACL 2024 (main)
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Unique Challenges in Table Image Understanding
However, previous MLLMs do not consider unique characteristics of table image.�Problem 1: Only a small region within a large table image is relevant to the question
�
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Unique Challenges in Table Image Understanding
However, previous MLLMs do not consider unique characteristics of table image�Problem 1: Only a small region within a large table image is relevant to the question
�
Question: What is the time difference between the �runners in lane 4 and lane 9?
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Unique Challenges in Table Image Understanding
However, previous MLLMs do not consider unique characteristics of table image.�Problem 1: Only a small region within a large table image is relevant to the question
�
Question: What is the time difference between the runners in lane 4 and lane 9?
Answer: 0.81 (20.57 – 19.76)
As table contains a lot of information, question information is crucial for effective understanding. However, vision encoders of previous MLLMs are not aware of question information.
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Unique Challenges in Table Image Understanding
However, previous MLLMs do not consider unique characteristics of table image.�Problem 2: A large portion of a table image is redundant
�
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Unique Challenges in Table Image Understanding
However, previous MLLMs do not consider unique characteristics of table image.�Problem 2: A large portion of a table image is redundant
�
White backgrounds areas compose large portion of a table image, which contain no information. These backgrounds are converted to visual tokens, which results in excessive computational cost in LLM.
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Problem: Uninformative and redundant visual tokens
Visual tokens generated in previous MLLMs are �uninformative (question-agnostic) and redundant (empty tokens), resulting in low performance and high computational cost
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
TabFlash utilizes informative and compact visual tokens!
As a solution, we propose TabFlash architecture, which generates informative (question-aware) and compact (content tokens only), resulting in state-of-the-art performance with less computation.
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 1: Progressive Question Conditioning
To generate question-aware visual tokens, we inject question information to vision encoder (i.e., ViT).
Question injection strategy (red box)
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 1: Progressive Question Conditioning
As early layers of ViT are unstable while latter layers are more stable, we increase question conditioning frequency progressively, considering each layer’s capacity to handle additional information.
Progressive Question Conditioning (red box)
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 1: Progressive Question Conditioning
As early layers of ViT are unstable while latter layers are more stable, we increase question conditioning frequency progressively, considering each layer’s capacity to handle additional information.
As question information are injected to ViT, resulting visual tokens are more informative!
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 2: Norm-based Pruning with Token Focusing
We observe that L2 norm of vision tokens effectively detects ‘background’ tokens. In other words, tokens with low norms typically corresponds to background regions, which are pruned for efficiency.
Norm-based token pruning (red box)
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 2: Norm-based Pruning with Token Focusing
However, we observe that crucial information are being stored on ‘background’ tokens to be pruned.
Results of norm-based token pruning (top) and inference results with retained/pruned tokens
Although background tokens are being pruned as shown in the figure, 13.5 accuracy is achieved with those pruned tokens, indicating presence of crucial information in those tokens.
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 2: Norm-based Pruning with Token Focusing
To concentrate crucial information to retained tokens, token focusing suppresses correct prediction with pruned tokens, and promotes correct prediction with retained tokens.
Token Focusing (red box)
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Method 2: Norm-based Pruning with Token Focusing
To concentrate crucial information to retained tokens, token focusing suppresses correct prediction with pruned tokens, and promotes correct prediction with retained tokens.
Token Focusing (red box)
With Norm-based pruning and Token Focusing, compact set of visual tokens �with minimal information loss are retained.
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Results on Table Question-Answering Tasks
TabFlash (3B) achieves state-of-the-art result, outperforming both proprietary and open-source MLLMs
Comparison on seven table QA benchmarks task performance
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Cost efficiency of TabFlash
TabFlash (3B) achieves best performance with 27% lower FLOPs vs. second-best (InternVL-2.5 3B).
TabFlash (1B) outperforms most MLLMs with exceptionally low FLOPs and memory usage
Performance-cost tradeoff table (left) and figure (right)
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Progressive Question Conditioning Successfully Injects Question Information to ViT
Improper conditioning even degrades performance; proposed conditioning achieves the best results
Performance by conditioning strategy
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Token Focusing Concentrates Crucial Information to Retained Tokens
Without Token Focusing, crucial information is still stored on pruned tokens. Token Focusing effectively concentrates those information to retained tokens by suppressing correct prediction with pruned ones
Performance by inference token set with (top) and without (bottom) token focusing loss
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Results on other table understanding tasks
TabFlash also shows superior performance on other table understanding tasks as well (i.e., table fact verification, table-to-text generation)
Comparison on table fact verification and table-to-text generation tasks
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Qualitative Example
Attention visualization results show that TabFlash allocates high attention scores to question-relevant regions
Attention visualization of TabFlash and Baseline. Red color denotes high attention scores.
MLV Lab
Korea University
MLV Lab
NeurIPS 2024
Conclusion
Github
Paper
MLV Lab
Korea University
MLV Lab
NeurIPS 2024