1 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir Zamir, Dawn Chen,

Leonidas Guibas, Jitendra Malik, Silvio Savarese

http://taskgrouping.stanford.edu

1

2 of 35

Multi-Task Learning

  • Decreased training time
  • Decreased inference time
  • More compact models
  • Increased prediction accuracy
  • Increased sample efficiency
  • Better learned representations

2

Input Image

Semantic Segmentation

Depth Estimation

Surface Normals

SURF Keypoints

Canny Edges

Intro

Intro

Study

Framework

Results

3 of 35

Encoder Decoder Architectures

Single-Task

Network

Multi-Task

Network

3

Intro

Intro

Study

Framework

Results

4 of 35

Negative Transfer

  • Often, training each task with a separate network works best
    • Even much smaller independent networks often outperform a large multi-task network

4

Intro

  • Why?
    • Tasks may learn at different rates
    • One task may dominate learning
    • Gradients may interfere
    • The optimization landscape may be more difficult

  • Intuitively, the relationship between learned tasks is one of the factors that determine how well a multi-task network performs

Intro

Study

Framework

Results

5 of 35

Prior Work

  • Task Relationships
    • For transfer learning [1] (compared in paper) [1] Zamir'18.
    • For natural language processing [2] [2] Bingel'17.
    • Cross-Task Consistency [3] [3] Zamir'20.
  • Multi-task loss weighting
    • To align gradients [4] (compared in paper) [4] Sener'18.
    • Based on uncertainty [5] [5] Kendall'18.
    • To balance gradient influence [6] (compared in paper) [6] Chen'18.
    • Comparison of loss weighting strategies [7,8] (We confirm results) [7] Gong'19. [8] Leang'20.
  • Architectural approaches to MTL
    • [9,10,11,12,13, and others]

5

Intro

[9] Duong'15. [10] Kokkinos'17. [11] Mirsa'16. [12] Liu'19. [13] Maninis'19.

Intro

Study

Framework

Results

6 of 35

Contributions

  1. We provide an empirical study of a number of factors that influence multi-task learning
    1. Network size
    2. Dataset size
    3. How tasks influence one another when learned together
  2. We introduce a task grouping framework that, when given a set of tasks and a computational budget, empirically discovers which tasks to group together to minimize negative transfer.

6

Intro

Intro

Study

Framework

Results

7 of 35

Large-Scale Multi-Task Dataset

  • Most datasets used in Multi-Task studies are overly simplistic.
    • Small: often < 200k datapoints
    • Few tasks: < 4 tasks
    • Tasks are often artificial: E.g. MultiMNIST, only classification "tasks" etc .
  • We found MTL experimental trends on such datasets unrepresentative.
  • We use the Taskonomy dataset
    • Large: 4.5 million indoor scenes from 600 buildings
    • 26 Tasks. Every image is labeled for all tasks
    • Diverse vision tasks: 2D tasks, 3D tasks, Semantics, etc.

7

Intro

Intro

Study

Framework

Results

8 of 35

Task Sets

Task Set 1

Task Set 2

8

Input Image

Semantic Segmentation

Depth Estimation

Surface Normals

SURF Keypoints

Canny Edges

Surface Normals

Auto Encoder

Occlusion Edges

Reshading

Principal Curvature

Intro

Study

Framework

Results

9 of 35

Experimental Settings

  • We have 4 experimental settings

  • We exhaustively train a network for all 2n-1 subsets of the five tasks (31 networks) in each of the four settings

9

Intro

Network Size

Training Set Size

Task Set

Setting 1

Small (2.3 billion multiply-adds)

4 million instances

Task Set 1

Setting 2

Large (6.4 billion multiply-adds)

4 million instances

Task Set 1

Setting 3

Large (6.4 billion multiply-adds)

200 thousand instances

Task Set 1

Setting 4

Large (6.4 billion multiply-adds)

4 million instances

Task Set 2

Intro

Study

Framework

Results

10 of 35

Part 2:

Task Grouping Framework

Part1:

Study of Task Interactions

10

11 of 35

Performance vs Number of Tasks (setting 1: small network)

11

Study

Intro

Study

Framework

Results

12 of 35

Smaller/Larger networks

  • By varying the number of channels in each layer of the encoder, we can train smaller or larger capacity networks for comparison

12

...

...

Study

Intro

Study

Framework

Results

13 of 35

Performance vs Number of Tasks (setting 1: small network)

13

Study

Intro

Study

Framework

Results

14 of 35

Two-Task Networks (setting 1: small network)

14

Study

Intro

Study

Framework

Results

15 of 35

Negative Correlation with Transfer Learning

15

Average relative task performance

Taskonomy Transfer Affinity

Study

(setting 1: small network)

Intro

Study

Framework

Results

16 of 35

Two-Task Networks (setting 2: control)

16

setting 1 relationships

setting 2 relationships

Study

Intro

Study

Framework

Results

17 of 35

Two-Task Networks (setting 3: small data)

17

setting 1 relationships

setting 3 relationships

setting 2 relationships

Study

Intro

Study

Framework

Results

18 of 35

Two-Task Networks (setting 4: task set 2)

18

setting 1 relationships

setting 4 relationships

setting 2 relationships

setting 3 relationships

Study

Intro

Study

Framework

Results

19 of 35

Key Takeaways

  • Many common assumptions do not seem to be true
    • More similar tasks don't necessarily work better together
      • There seems to be no a priori way to tell which tasks will work well together
    • MTL doesn't necessarily work better when you have less data
    • Task relationships are not the same between settings
  • Task relationships are sensitive to
    • Dataset size
    • Network capacity
    • … and probably other variables
  • In 15 of 16 two-task models, normals improved the other task's performance
    • Even though its own performance was usually poor
    • It was also usually the most helpful task to be trained with (13 out of 15 times)

19

Study

Intro

Study

Framework

Results

20 of 35

Part 2:

Task Grouping Framework

Part1:

Study of Task Interactions

20

21 of 35

Inference Time budget

  • Many computer vision applications require solving multiple tasks in real time
  • Multi-task learning can help by training networks that solve multiple tasks simultaneously

21

Framework

Intro

Study

Framework

Results

22 of 35

22

Kokkinos 2016

...

Individual Prediction

All-in-one

Hybrid approaches

Framework

Intro

Study

Framework

Results

23 of 35

Task Grouping

Task 1 Task 2 Task 3 Task 4 Task 5

Network 1 Network 2 Network 3

23

Framework

Intro

Study

Framework

Results

24 of 35

Task Grouping

Task 1 Task 2 Task 3 Task 4 Task 5

Network 1 Network 2 Network 3

24

Framework

Intro

Study

Framework

Results

25 of 35

Task Grouping

Task 1 Task 2 Task 3 Task 4 Task 5

Network 1 Network 2 Network 3

25

Framework

  • The Goal
    • Assign tasks to networks
    • Maximize performance
    • Keeping inference time to a given fixed budget

Intro

Study

Framework

Results

26 of 35

Task Grouping Framework

  • The Goal
    • Assign tasks to networks
    • Maximize performance
    • Keeping inference time to a given fixed budget

26

Framework

Intro

Study

Framework

Results

27 of 35

Task Grouping Framework

  • Create a set of candidate networks
    • We train a network for all 31 subsets of our 5 tasks.
    • We also include 5 half-sized networks, one per task
  • Select the best networks for our budget
  • The problem is NP-hard
  • This branch and bound like algorithm finds the optimal solution to problems quickly (< 1 sec)

27

Framework

  • Any other algorithm that produces optimal solutions would work equally well.

Intro

Study

Framework

Results

28 of 35

Chosen Solutions (setting 1)

28

S: SemSeg

D: Depth

N: Normals

K: Keypoints

E: Edges

Results

Intro

Study

Framework

Results

29 of 35

Approximation

  • The above strategy is expensive
  • Early Stopping Approximation
    • Instead of fully training the networks, stop after one pass through 20% of the data.
    • Training phase is sped up 20x

29

Framework

Intro

Study

Framework

Results

30 of 35

Chosen Solutions (setting 1)

30

S: SemSeg

D: Depth

N: Normals

K: Keypoints

E: Edges

Results

Intro

Study

Framework

Results

31 of 35

Qualitative Results

31

Input Image

Results

Semantic Segmentation

Depth Estimation

Surface Normals

SURF Keypoints

Canny Edges

Ground

Truth

Ours

Optimal

Error

Error

(setting 1)

All-in-one

Individual Networks

Sener

et al.

Approximation

ours

Intro

Study

Framework

Results

32 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Standley, Zamir, Chen, Guibas, Malik, Savarese

32

Results

Intro

Study

Framework

Results

33 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir Zamir, Dawn Chen,

Leonidas Guibas, Jitendra Malik, Silvio Savarese

http://taskgrouping.stanford.edu

33

34 of 35

34

Results

Intro

Study

Framework

Results

35 of 35

Which Tasks Should Be Learned Together in Multi-task Learning?

Trevor Standley, Amir Zamir, Dawn Chen,

Leonidas Guibas, Jitendra Malik, Silvio Savarese

http://taskgrouping.stanford.edu

35

Swiss Federal Institute of Technology