MAT: Mask-Aware Transformer for Large Hole Image Inpainting
Wenbo Li1, Zhe Lin2, Kun Zhou3, Lu Qi1, Yi Wang4, Jiaya Jia1
1The Chinese University of Hong Kong
2Adobe Inc.
3The Chinese University of Hong Kong (Shenzhen)
4Shanghai AI Laboratory
Image Inpainting
MAT: Mask-aware Transformer for (1) Large holes, (2) High-resolution, (3) Diverse outputs
Introduction
Contributions
Framework
Multi-Head Contextual Attention
U: mask update
S: window shift
Adjusted Transformer Block
Compared to the conventional transformer block, we adjust the architecture:
The conventional transformer block[1]
Our transformer stage and block
[1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." in ICLR, 2020.
Style Manipulation Module
We use a style code to manipulate the output by changing the weight normalization of convolution[1-3]:
[1] Chen, Ting, et al. "On self modulation for generative adversarial networks.” In ICLR, 2018.
[2] Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." In CVPR, 2019.
[3] Karras, Tero, et al. "Analyzing and improving the image quality of stylegan." In CVPR, 2020.
SOTA performance on both small and large masks for Places and CelebA-HQ
When using Full Places Dataset (8M images) for training, our MAT achieves significant improvements.
Thanks for listening!
https://github.com/fenglinglwb/MAT