1 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir Zamir, Dawn Chen,

Leonidas Guibas, Jitendra Malik, Silvio Savarese

http://taskgrouping.stanford.edu

1

2 of 35

Multi-Task Learning

Decreased training time
Decreased inference time
More compact models
Increased prediction accuracy
Increased sample efficiency
Better learned representations

2

Input Image

Semantic Segmentation

Depth Estimation

Surface Normals

SURF Keypoints

Canny Edges

Intro

Study

Framework

Results

3 of 35

Encoder Decoder Architectures

Single-Task

Network

Multi-Task

Network

3

Intro

Study

Framework

Results

4 of 35

Negative Transfer

Often, training each task with a separate network works best

Even much smaller independent networks often outperform a large multi-task network

4

Intro

Why?

Tasks may learn at different rates
One task may dominate learning
Gradients may interfere
The optimization landscape may be more difficult

Intuitively, the relationship between learned tasks is one of the factors that determine how well a multi-task network performs

Intro

Study

Framework

Results

5 of 35

Prior Work

Task Relationships

For transfer learning [1] (compared in paper) [1] Zamir'18.
For natural language processing [2] [2] Bingel'17.
Cross-Task Consistency [3] [3] Zamir'20.

Multi-task loss weighting

To align gradients [4] (compared in paper) [4] Sener'18.
Based on uncertainty [5] [5] Kendall'18.
To balance gradient influence [6] (compared in paper) [6] Chen'18.
Comparison of loss weighting strategies [7,8] (We confirm results) [7] Gong'19. [8] Leang'20.

Architectural approaches to MTL

[9,10,11,12,13, and others]

5

Intro

[9] Duong'15. [10] Kokkinos'17. [11] Mirsa'16. [12] Liu'19. [13] Maninis'19.

Intro

Study

Framework

Results

There is extensive existing literature on multi-task learning.

Since our work analyzes task relationships [click] [click], we are similar to Taskonomy, which does so for transfer learning. [click]

A 2017 work analyzes task relationships for natural language processing [click]

Cross task consistency studies how making multiple task predictions consistent can improve learning [click]

Many works attempt to minimize negative transfer by carefully balancing the scale of each task's loss. [click]

One work computes task weights that align network gradients. [click]

Another uses uncertainty to compute task weights. [click]

Gradnorm attempts to balance the influence of each task on network gradients. [click]

Lastly, two comparisons of task weighting strategies, find no clear winner.

They find that unit task weights often come out on top, which is what we observe. [click][click]

Finally, many works attempt to overcome negative transfer using architectural approaches.

6 of 35

Contributions

We provide an empirical study of a number of factors that influence multi-task learning

Network size
Dataset size
How tasks influence one another when learned together

We introduce a task grouping framework that, when given a set of tasks and a computational budget, empirically discovers which tasks to group together to minimize negative transfer.

6

Intro

Study

Framework

Results

7 of 35

Large-Scale Multi-Task Dataset

Most datasets used in Multi-Task studies are overly simplistic.

Small: often < 200k datapoints
Few tasks: < 4 tasks
Tasks are often artificial: E.g. MultiMNIST, only classification "tasks" etc .

We found MTL experimental trends on such datasets unrepresentative.
We use the Taskonomy dataset

Large: 4.5 million indoor scenes from 600 buildings
26 Tasks. Every image is labeled for all tasks
Diverse vision tasks: 2D tasks, 3D tasks, Semantics, etc.

7

Intro

Study

Framework

Results

8 of 35

Task Sets

Task Set 1

Task Set 2

8

Input Image

Semantic Segmentation

Depth Estimation

Surface Normals

SURF Keypoints

Canny Edges

Surface Normals

Auto Encoder

Occlusion Edges

Reshading

Principal Curvature

Intro

Study

Framework

Results

9 of 35

Experimental Settings

We have 4 experimental settings

We exhaustively train a network for all 2ⁿ-1 subsets of the five tasks (31 networks) in each of the four settings

9

Intro

	Network Size	Training Set Size	Task Set
Setting 1	Small (2.3 billion multiply-adds)	4 million instances	Task Set 1
Setting 2	Large (6.4 billion multiply-adds)	4 million instances	Task Set 1
Setting 3	Large (6.4 billion multiply-adds)	200 thousand instances	Task Set 1
Setting 4	Large (6.4 billion multiply-adds)	4 million instances	Task Set 2

Intro

Study

Framework

Results

10 of 35

Part 2:

Task Grouping Framework

Part1:

Study of Task Interactions

10

11 of 35

Performance vs Number of Tasks (setting 1: small network)

11

Study

Intro

Study

Framework

Results

12 of 35

Smaller/Larger networks

By varying the number of channels in each layer of the encoder, we can train smaller or larger capacity networks for comparison

12

...

Study

Intro

Study

Framework

Results

13 of 35

Performance vs Number of Tasks (setting 1: small network)

13

Study

Intro

Study

Framework

Results

14 of 35

Two-Task Networks (setting 1: small network)

14

Study

Intro

Study

Framework

Results

15 of 35

Negative Correlation with Transfer Learning

15

Average relative task performance

Taskonomy Transfer Affinity

Study

(setting 1: small network)

Intro

Study

Framework

Results

16 of 35

Two-Task Networks (setting 2: control)

16

setting 1 relationships

setting 2 relationships

Study

Intro

Study

Framework

Results

17 of 35

Two-Task Networks (setting 3: small data)

17

setting 1 relationships

setting 3 relationships

setting 2 relationships

Study

Intro

Study

Framework

Results

18 of 35

Two-Task Networks (setting 4: task set 2)

18

setting 1 relationships

setting 4 relationships

setting 2 relationships

setting 3 relationships

Study

Intro

Study

Framework

Results

19 of 35

Key Takeaways

Many common assumptions do not seem to be true

More similar tasks don't necessarily work better together

There seems to be no a priori way to tell which tasks will work well together

MTL doesn't necessarily work better when you have less data
Task relationships are not the same between settings

Task relationships are sensitive to

Dataset size
Network capacity
… and probably other variables

In 15 of 16 two-task models, normals improved the other task's performance

Even though its own performance was usually poor
It was also usually the most helpful task to be trained with (13 out of 15 times)

19

Study

Intro

Study

Framework

Results

So what have we learned?

First off, many common assumptions about multi-task learning do not seem to be true.

More similar tasks don't necessarily work better together.

In fact, there seems to be no a priori way to tell which tasks will work well together.

Multi-task learning doesn't necessarily work better when you have less data.

Task relationships change significantly in different settings.

Furthermore,

task relationships are sensitive to dataset size and network capacity at least.

Finally, the normals task seems to be a great one to train with if you want to improve performance on another task.

In 15 out of 16 of the models that were trained with normals, the other task sees a benefit.

Furthermore, this benefit was higher than the benefit of training with any other task 13 out of 15 times.

This may be because normals have uniform values across surfaces, and preserve 3d edges.

However, the normals task itself tends have poor predictions when paired with another task.

We see that which tasks are learned together is critical for achieving good performance. So now we study how to find the best tasks to learn together.

20 of 35

Part 2:

Task Grouping Framework

Part1:

Study of Task Interactions

20

21 of 35

Inference Time budget

Many computer vision applications require solving multiple tasks in real time
Multi-task learning can help by training networks that solve multiple tasks simultaneously

21

Framework

Intro

Study

Framework

Results

22 of 35

22

Kokkinos 2016

...

Individual Prediction

All-in-one

Hybrid approaches

Framework

Intro

Study

Framework

Results

23 of 35

Task Grouping

Task 1 Task 2 Task 3 Task 4 Task 5

Network 1 Network 2 Network 3

23

Framework

Intro

Study

Framework

Results

24 of 35

Task Grouping

Task 1 Task 2 Task 3 Task 4 Task 5

Network 1 Network 2 Network 3

24

Framework

Intro

Study

Framework

Results

25 of 35

Task Grouping

Task 1 Task 2 Task 3 Task 4 Task 5

Network 1 Network 2 Network 3

25

Framework

The Goal

Assign tasks to networks
Maximize performance
Keeping inference time to a given fixed budget

Intro

Study

Framework

Results

26 of 35

Task Grouping Framework

The Goal

Assign tasks to networks
Maximize performance
Keeping inference time to a given fixed budget

26

Framework

Intro

Study

Framework

Results

27 of 35

Task Grouping Framework

Create a set of candidate networks

We train a network for all 31 subsets of our 5 tasks.
We also include 5 half-sized networks, one per task

Select the best networks for our budget
The problem is NP-hard
This branch and bound like algorithm finds the optimal solution to problems quickly (< 1 sec)

27

Framework

Any other algorithm that produces optimal solutions would work equally well.

Intro

Study

Framework

Results

[Forward]

First, we create a set of candidate networks. That we will choose from to get our grouping.

The candidate set can include networks of any varying size, and with any task combinations.[Forward]

For our experiments, we train a network for all 31 subsets of our 5 tasks.[Forward]

We also include 5 half-sized networks, one for each task.

[Forward]

After training these networks, the idea becomes to pick the subset of networks that together perform the best on the given tasks and fit within the budget.

[Forward]

Picking the optimal subset is np-hard

[Forward]

but many algorithms exist that can quickly find the optimal selection anyway.

We provide one that is based on branch-and-bound, but the problem could be reduced to your favorite np-hard problem and solved with an off-the-shelf solver.

Every optimal algorithm will find the same solution for each budget.

28 of 35

Chosen Solutions (setting 1)

28

S: SemSeg

D: Depth

N: Normals

K: Keypoints

E: Edges

Results

Intro

Study

Framework

Results

29 of 35

Approximation

The above strategy is expensive
Early Stopping Approximation

Instead of fully training the networks, stop after one pass through 20% of the data.
Training phase is sped up 20x

29

Framework

Intro

Study

Framework

Results

But this strategy is expensive. It requires that we train a large candidate set of networks even though we only ultimately use a few

In many situations, it might be worth it to train an exponential number of candidate networks to ensure you have the optimal grouping. You may even want to throw in networks trained with different architectures, task weights or training strategies.

But sometimes this is prohibitive.

To alleviate this burden, you you can train each of the networks a little bit and based on their performance choose which networks to train to convergence.

We show that this works well even when we only do a single pass through only 20% of the data.

This speeds up the training phase by 20 times!

We can see that the networks chosen for various budgets are different for the approximation, but their performance is similar.

30 of 35

Chosen Solutions (setting 1)

30

S: SemSeg

D: Depth

N: Normals

K: Keypoints

E: Edges

Results

Intro

Study

Framework

Results

31 of 35

Qualitative Results

31

Input Image

Results

Semantic Segmentation

Depth Estimation

Surface Normals

SURF Keypoints

Canny Edges

Ground

Truth

Ours

Optimal

Error

(setting 1)

All-in-one

Individual Networks

Sener

et al.

Approximation

ours

Intro

Study

Framework

Results

32 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Standley, Zamir, Chen, Guibas, Malik, Savarese

http://taskgrouping.stanford.edu

32

Results

Intro

Study

Framework

Results

33 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir Zamir, Dawn Chen,

Leonidas Guibas, Jitendra Malik, Silvio Savarese

http://taskgrouping.stanford.edu

33

34 of 35

34

Results

Intro

Study

Framework

Results

35 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir Zamir, Dawn Chen,

Leonidas Guibas, Jitendra Malik, Silvio Savarese

http://taskgrouping.stanford.edu

35

Swiss Federal Institute of Technology