1 of 1

X-ViT: High Performance Linear Vision Transformer without Softmax�Jeonggeun Song, Heung-Chang Lee (Kakao Enterprise)

Abstract

We propose the X-ViT, ViT with a novel self-attention (SA) mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.

Comparison with the existing SA

Conclusion

In this paper, we proposed a simple method that ensures linear complexity for SA without loss of performance. By replacing the softmax function, we removed the quadratic operation using the associative law of matrix multiplication. This type of factorization has typically caused performance degradation in earlier studies. Our X-ViT models show performance on general tasks that are competitive with or better than earlier models. With more optimized structures for dense prediction, we expect our models to become more efficient and perform better.

References

[1] A. El-Nouby, H. Touvron, M. Caron, P. Bojanowski, M. Douze, A. Joulin, I. Laptev, N. Neverova, G. Synnaeve, J. Verbeek, et al. Xcit: Cross-covariance image transformers. arXiv preprint arXiv:2106.09681, 2021.

[2] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.

[3] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.

Figure 2: Comparison of consumption of computational resources at high resolution. (a) Allocated peak GPU memory vs. input resolution. (b) GPU throughput vs. input resolution. GPU throughput axis is log2-scaled.

Table 1: The results of fine-tune at higher resolutions.

Figure 3: Comparison with the transformer-based vision models. Our models show superior performance at most regime of FLOPs and param size.