1 of 52

Part 1: Multi-Task Learning

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial

Meta Learning Tutorial

2 of 52

Multi-Task Learning (MTL) Agenda

Motivation: learning multiple games and fields
Multi-task learning examples: autonomous vehicles and edge devices
Problem formulation
MTL architectures: hard, soft, hybrid sharing
Multi-objective optimization
Combinatorial optimization
Applications

3 of 52

Multi-Task Learning (MTL) Progress and Motivation

4 of 52

Learning 57 Atari Games

Source: Human-level control through deep reinforcement learning, Mnih et al, Nature 2015

5 of 52

Progress in Atari Games

2015 2018

Montezuma’s revenge and pitfall were at random performance in 2015 and super human in 2018, all 57 games are at super-human performance in 2020

6 of 52

Learning 57 Fields

Source: Measuring Massive Multitask Language Understanding, Hendrycks et al, 9.7.2020

7 of 52

Expected Progress in Learning 57 Fields

2020 2023

2020: Learning US Foreign policy performance is at 70%.

College Chemistry and Physics are the hardest being slightly above random performance using GPT-3.

Learning machine learning has slightly better performance.

Expected progress:

College Chemistry and Physics will be superhuman in 2023. All fields will be super-human in 2025.

Learning to learn courses is already happening.

8 of 52

"In an SR latch built from NOR gates, which condition is not allowed","S=0, R=0","S=0, R=1","S=1, R=0","S=1, R=1",D

"In a 2 pole lap winding dc machine , the resistance of one conductor is 2Ω and total number of conductors is 100. Find the total resistance",200Ω,100Ω,50Ω,10Ω,C

"The coil of a moving coil meter has 100 turns, is 40 mm long and 30 mm wide. The control torque is 240*10-6 N-m on full scale. If magnetic flux density is 1Wb/m2 range of meter is",1 mA.,2 mA.,3 mA.,4 mA.,B

"Two long parallel conductors carry 100 A. If the conductors are separated by 20 mm, the force per meter of length of each conductor will be",100 N.,0.1 N.,1 N.,0.01 N.,B

A point pole has a strength of 4π * 10^-4 weber. The force in newtons on a point pole of 4π * 1.5 * 10^-4 weber placed at a distance of 10 cm from it will be,15 N.,20 N.,7.5 N.,3.75 N.,A

Source: Measuring Massive Multitask Language Understanding, Hendrycks et al, 9.7.2020

9 of 52

Multi-Task Learning (MTL)

10 of 52

Multi-Task Learning

task

data

task

data

task

data

predictor

multi-task

learning

algorithm

predictor

Task A

Task B

Task C

11 of 52

Multi-Task Learning: Self Driving Cars

Multiple tasks: detect cars, pedestrians, signs, lights, curbs, lanes, cross walks, etc.

Tasks (100)
sub-tasks

Source: Tesla AutoPilot

12 of 52

Multi-Task Learning: Edge Devices

High performance, prediction accuracy
Efficient computation, training and inference time
Compact models

13 of 52

Multi-Task Learning (MTL) Architectures

One neural network for learning multiple tasks: all-in-one
Separate networks for each task: individual prediction
Hybrid approach

Combinatorial optimization problem:

Bipartite matching of tasks to networks

14 of 52

Multi-Task Learning (MTL) Questions

How tasks influence one another? Does each task help the other tasks? or is there negative transfer?
How to share weights between different tasks?
How does network size influence MTL?
How does dataset size and distribution of number of samples per task influence MTL?
Are the tasks similar? Heterogeneous?

15 of 52

Multi-Task Learning

Multiple heterogeneous tasks: different importance, difficulty, number of samples, noise level

16 of 52

Shared backbone for multiple tasks with multiple heads

Input example: 𝒙
Tasks: 𝒕 = 1..T
Output of task 𝒕: 𝒚𝒕
Number of data points N
Dataset of iid data points {𝒙, 𝒚1,...,𝒚T} for i = 1...N

Task B

Task A

Task C

task specific layers

shared backbone network

17 of 52

Shared backbone for multiple tasks with multiple heads

Shared backbone network 𝒇
Shared backbone parameters 𝜽s
Task-specific decoder network 𝒈𝒕 with task-specific parameters 𝜽𝒕
Task-specific loss: ℒ𝒕(𝜽) := ℒ𝒕(𝜽s, 𝜽𝒕) :=1/N ∑iℒ𝒕(𝒈𝒕(𝒇(𝒙;𝜽s);𝜽𝒕), 𝒚𝒕)
Linear scalarization total multi-task loss: ℒ(𝜽) = ∑𝒕 𝜶𝒕ℒ𝒕(𝜽)

Task B

Task A

Task C

task specific layers

shared backbone network

18 of 52

Linear Scalarization for MTL

Total multi-task loss

min𝜽ℒ(𝜽) = ∑𝒕 𝜶𝒕ℒ𝒕(𝜽)

Advantages?
Disadvantages?

19 of 52

Linear Scalarization

Total multi-task loss is linear weighted combination

min𝜽ℒ(𝜽) = ∑𝒕 𝜶𝒕ℒ𝒕(𝜽)

Advantages? simple
Disadvantages?

Selecting weights? develop loss weighting strategies?
Performance dependent on weights
Only handles convex part of Pareto front

20 of 52

Linear Scalarization

Total multi-task loss is a linear weighted combination

min𝜽ℒ(𝜽) = ∑𝒕 𝜶𝒕ℒ𝒕(𝜽)

Justification for linear scalarization: solutions may not be comparable. For example solution 𝜽 may be better for task 𝒕1 whereas solution 𝜽‘ is better for task 𝒕2 :

two solutions 𝜽 and 𝜽‘ s.t. ℒ𝒕1(𝜽s,𝜽𝒕1) < ℒ𝒕1(𝜽’s,𝜽’𝒕1) and ℒ𝒕2(𝜽s,𝜽𝒕2) > ℒ𝒕2(𝜽’s,𝜽’𝒕2) for tasks 𝒕1 and 𝒕2

21 of 52

Shared backbone for multiple tasks with multiple heads

Sharing weights in early layers, coupled
Split network into backbone and task-specific layers
Advantages?
Disadvantages?

Task B

Task A

Task C

task specific layers

shared backbone network

22 of 52

Shared backbone for multiple tasks with multiple heads

Sharing weights in early layers, coupled
Split network into backbone and task-specific layers, where to split?
Advantages? efficient runtime
Disadvantages? over-sharing, negative transfer

Task B

Task A

Task C

task specific layers

shared backbone network

23 of 52

Individual network for each task

No sharing weights
Decoupled functionality
Advantages?
Disadvantages?

Task B

Task A

Task C

24 of 52

Individual network for each task

No sharing weights
Decoupled functionality
Advantages? no negative transfer
Disadvantages? inefficient runtime, does not scale well with number of tasks

Task B

Task A

Task C

25 of 52

Negative Transfer

Why does training individual networks often work better than a shared network?

26 of 52

Negative Transfer

Relationships between tasks determines if a shared architecture works
One task may dominate training
Tasks may learn at different rates
Gradients may conflict

27 of 52

Multi-Task Learning and Adversarial Attacks

Models trained on multiple tasks at once are more robust to adversarial attacks on individual tasks

28 of 52

MTL Architectures

29 of 52

Architectures

Hard parameter sharing

Soft parameter sharing

Ad-hoc sharing

Learning to route, branch

30 of 52

Hard Parameter Sharing

Sharing information in early layers, over-sharing.
Split network into task-specific layers, where to split?
Define loss function

Task B

Task A

Task C

task specific layers

hard sharing backbone layers

31 of 52

Multi-Objective Optimization

Optimize collection of possibly conflicting objectives:

minℒ(𝜽s,𝜽1,...,𝜽T) = min(ℒ1(𝜽s,𝜽1),...,ℒT(𝜽s,𝜽T))

𝜽s,𝜽1,...,𝜽T

32 of 52

Multi-Objective Optimization

Tasks 𝒕 = 1..T
Neural network parameters 𝒙

Multi-objective function: 𝒇(𝒙): ℝ -> ℝ

𝒇(𝒙) = (𝒇1(𝒙),...,𝒇T(𝒙))

Objective function of task 𝒕 is task-specific loss:

𝒇𝒕(𝒙): ℝ -> ℝ

33 of 52

Pareto Optimal

For any 𝒙,𝒚 in ℝ 𝒙 dominates 𝒚 iff 𝒇(𝒙) <- 𝒇(𝒚)
A point 𝒙 is Pareto optimal if it is not dominated by any other point
A point 𝒙 is locally Pareto optimal if it is not dominated by any point in a neighborhood of 𝒙

34 of 52

Pareto Frontier

Point C is not on the Pareto frontier because it is dominated by points A and B
Points A and B are not dominated by any other point, and are therefore on the Pareto frontier.

Source: Wikipedia

35 of 52

Pareto Stationary

If each 𝒇𝒕(𝒙) is continuously differentiable a point 𝒙 is Pareto stationary if there exists 𝜶 in ℝ such that

𝜶𝒕 >= 0, ∑𝒕 𝜶𝒕 = 1 and ∑𝒕 𝜶𝒕 ∇𝒇𝒕(𝒙) = 0

All Pareto optimal points are Pareto stationary.

36 of 52

MTL Algorithm

Gradient descent on task-specific parameters
Solving optimization problem:

min { ||∑𝒕 𝜶𝒕 ∇𝒇𝒕(𝒙)|| | ∑𝒕 𝜶𝒕 = 1, 𝜶𝒕 >= 0 for all 𝒕 }

min { ||∑𝒕 𝜶𝒕 ∇𝜽sℒ𝒕(𝜽s, 𝜽𝒕)|| | ∑𝒕 𝜶𝒕 = 1, 𝜶𝒕 >= 0 for all 𝒕 }

𝜶1,...,𝜶T

Source: Multi-task learning as multiobjective optimization, Sener and Koltun, 2018

37 of 52

Soft Parameter Sharing

Sharing information in early layers
Does not scale well with number of tasks

Task B

Task A

Task C

task specific layers

soft sharing backbone

38 of 52

Ad-hoc Sharing

Compute task relatedness
Iteratively group network
Better performance than soft or hard sharing

Task B

Task A

Task C

task specific layers

shared backbone

39 of 52

Learn Shared Architecture

Directed acyclic graph
Nodes represent computational operations
Edges represent data flows
Differential branching operations

Source: Learning to branch for multi-task learning, Guo et al, 2020

40 of 52

Layer Routing

Learn separate execution paths for different tasks

Source: AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning, Sun et al, 2019

41 of 52

Taskonomy Dataset

4.5 million indoor scenes from 600 buildings
26 diverse tasks, every image is labeled for all tasks

Source: Taskonomy: Disentangling Task Transfer Learning, Zamir et al, 2018

42 of 52

Transfer Relationships between tasks

Taskonomy

Source: Taskonomy: Disentangling Task Transfer Learning, Zamir et al, 2018

43 of 52

Multi-Task Learning

Which tasks should and should not be learned together in one network when employing multi-task learning?

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

44 of 52

Multi-Task Learning

Transfer relationships may not predict multi-task relationships

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

transfer learning affinities

MTL learning affinities

45 of 52

MTL: Combinatorial Optimization Problem

46 of 52

MTL: Combinatorial Optimization Problem

Bipartite matching of tasks to networks given budget

tasks

networks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

47 of 52

MTL: Combinatorial Optimization Problem

NP-hard

tasks

networks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

48 of 52

MTL: Combinatorial Optimization Problem

Approximate solution

tasks

networks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

49 of 52

MTL: Combinatorial Optimization Problem

Tasks T = {t1,...,tk}
Inference budget b, total time to complete all tasks
Neural network n, with inference cost time cn
Loss for each task L(n, ti)

infinity if network does not solve task

Solution S is a set of networks that together solve all tasks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

50 of 52

MTL: Combinatorial Optimization Problem

Computation cost of solution: cost(S) = ∑(n in S) cn
Loss of solution on task is lowest loss on task among S

L(S, ti) = min(n in S) L(n, ti)

Overall performance of solution

L(S) = ∑(ti in T) L(S, ti)

Find solution with lowest overall loss and cost within budget

S* = argmin S:cost(S)<b L(S)

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

51 of 52

Multi-Task Learning

Combinatorial optimization problem:

Bipartite matching of tasks to networks given budget
NP-hard problem
Approximation

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

52 of 52

Meta Learning Tutorial

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial