1 of 18

CV - 8: Многозадачные модели. Примеры решения задачи ответов на вопросы по изображениям (VQA).

Дмитрий Юдин

кандидат технических наук,

старший научный сотрудник AIRI,

заведующий лабораторией интеллектуального транспорта МФТИ

@yuddim

ИНСТРУКЦИИ

Вариант обложки 1

2 of 18

Многозадачность в нейросетевых моделях: концепции

A. Обучение отдельных выходных модулей под каждую из задач (“Голов”, Head)

B. Обучаемые эмбеддинги задачи (Learned Task Embedding)

Источник: А. Петюшко. Эффективные мультимодальные многозадачные модели, 2021

C. Смешанные методы (Mixed Methods)

3 of 18

Специфичные для задачи выходные модули

(Task-specific head)

4 of 18

Современные тренды: TAG¹ - Группировка задач с помощью обновления похожих градиентов (Task grouping via similar gradient update)

[1] Fifty, Christopher, et al. "Efficiently identifying task groupings for multi-task learning." 2021 (Google)

Источник: А. Петюшко. Эффективные мультимодальные многозадачные модели, 2021

5 of 18

Trades: многозадачная модель для 3D трекинга объектов по видео

Wu, Jialian, et al. "Track to detect and segment: An online multi-object tracker." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

6 of 18

Trades: многозадачная модель для 3D трекинга объектов по видео

Источник

7 of 18

Обучаемые эмбеддинги задачи

(Learned Task Embedding)

8 of 18

Easy example

Source: Verlyn Fischer. The Utility of Task Embeddings. Training Adaptable Neural Networks, 2021

https://towardsdatascience.com/the-utility-of-task-embeddings-e00a18133f77

Neural networks learn to map points in one space to points in another space. Task embedding allows networks to learn different mappings

Only one output neuron (not N for each category)

Classification accuracy on ten digits for a network trained on only digits 0 to 6.

Task embedding for the MNIST example:

choice a randomly selected image for each task,
compact representation of random selected image for each task using PCA.

Concatenation

9 of 18

More complicated example: UniT

UniT¹: task fusion via multi-head self attention and cross-attention

[1] Hu, Ronghang, and Amanpreet Singh. "UniT: Multimodal Multitask Learning with a Unified Transformer." 2021 (Facebook).

Performance of UniT on multi-task training over object detection and VQA. Shared is training on COCO + VG + VQAv2

Effect from task embedding is very small but it is promising

10 of 18

Adapters for task embedding fusion: Hyperformer

Mahabadi, Rabeeh Karimi, et al. "Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks." arXiv preprint arXiv:2106.04489 (2021).

https://github.com/rabeehk/hyperformer

Hyperformer adapter architecture. Authors include adapter modules after the two feed-forward layers. The Adapter hypernetwork h^l_A produces the weights (U^l_τ and D^l_τ ) for task-specific adapter modules conditioned on an input task embedding I_τ . Similarly, the layer normalization hypernetwork h^l_LN generates the conditional layer normalization parameters (β_τ and γ_τ ).
During training, autrhors only update layer normalizations in T5, hypernetworks, and task embeddings.
The compact Hyperformer++ shares the same hypernetworks across all layers and tasks and computes the task embedding based on task, layer id, and position of the adapter module

11 of 18

Some Ideas: Task embedding for image segmentation speedup

Baseline: Segformer https://github.com/NVlabs/SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers https://arxiv.org/pdf/2105.15203.pdf

Problems: - For common datasets (e.g. COCO-Stuff full dataset containing images with 172 catecories/classes) output tensor is large! - Output mask shape is less than input image

Problem solution:

12 of 18

Some Ideas: Task embedding for skill fusion of robot with manipulator

BaseLine: ManiSkill https://github.com/haosulab/ManiSkill-Learn

Problems:

Separate model for each task,
No sinchronization possibilities

Task 1:

OpenCabinetDoor

Task 2:

OpenCabinetDrawer

13 of 18

Some Ideas: Task embedding for skill fusion of robot with manipulator

Task ID: OpenCabinetDoor, OpenCabinetDrawer

Problem solution (we avoid model duplication and add synchronization):

14 of 18

Some Ideas: Task embedding for visual question answering

Baseline: MDETR https://github.com/ashkamath/mdetr

Problem: bad quality for some question types, for example counting (e.g. how much people on the image?)

Task ID: Question type: counting, yes/no, category, relations

Problem solution:

15 of 18

Some Ideas: Task embedding for interpretable visual question answering

Baseline: UnCoRd https://arxiv.org/abs/1811.08481

Problem: in Answering Procedure, each detected object is checked with an independent estimator for every property type (color, size, etc.). Hence, the number of estimators is equal to the number of property types

Problem solution: Use a single estimator (CLIP, ViLBERT) with property type embedding

Graph

Answer

Detected objects

16 of 18

Some Ideas: Task embedding usage with Vector Symbolic Architecture (VSA)

Peer Neubert. Using VSA/HDC for designing image descriptors. VSAONLINE workshop, 2022. https://youtu.be/qM6ql1nxql8

17 of 18

Примеры решения задачи ответов на вопросы по изображениям (VQA)

Александр Корчемный, стажер Центра когнитивного моделирования МФТИ

https://github.com/Yessense/airi_cv_2022/tree/master/seminar_8