1 of 52

Part 1: Multi-Task Learning

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial

Meta Learning Tutorial

2 of 52

Multi-Task Learning (MTL) Agenda

  • Motivation: learning multiple games and fields
  • Multi-task learning examples: autonomous vehicles and edge devices
  • Problem formulation
  • MTL architectures: hard, soft, hybrid sharing
  • Multi-objective optimization
  • Combinatorial optimization
  • Applications

3 of 52

Multi-Task Learning (MTL) Progress and Motivation

4 of 52

Learning 57 Atari Games

Source: Human-level control through deep reinforcement learning, Mnih et al, Nature 2015

5 of 52

Progress in Atari Games

2015 2018

Montezuma’s revenge and pitfall were at random performance in 2015 and super human in 2018, all 57 games are at super-human performance in 2020

6 of 52

Learning 57 Fields

Source: Measuring Massive Multitask Language Understanding, Hendrycks et al, 9.7.2020

7 of 52

Expected Progress in Learning 57 Fields

2020 2023

2020: Learning US Foreign policy performance is at 70%.

College Chemistry and Physics are the hardest being slightly above random performance using GPT-3.

Learning machine learning has slightly better performance.

Expected progress:

College Chemistry and Physics will be superhuman in 2023. All fields will be super-human in 2025.

Learning to learn courses is already happening.

8 of 52

EE

"In an SR latch built from NOR gates, which condition is not allowed","S=0, R=0","S=0, R=1","S=1, R=0","S=1, R=1",D

"In a 2 pole lap winding dc machine , the resistance of one conductor is 2Ω and total number of conductors is 100. Find the total resistance",200Ω,100Ω,50Ω,10Ω,C

"The coil of a moving coil meter has 100 turns, is 40 mm long and 30 mm wide. The control torque is 240*10-6 N-m on full scale. If magnetic flux density is 1Wb/m2 range of meter is",1 mA.,2 mA.,3 mA.,4 mA.,B

"Two long parallel conductors carry 100 A. If the conductors are separated by 20 mm, the force per meter of length of each conductor will be",100 N.,0.1 N.,1 N.,0.01 N.,B

A point pole has a strength of 4π * 10^-4 weber. The force in newtons on a point pole of 4π * 1.5 * 10^-4 weber placed at a distance of 10 cm from it will be,15 N.,20 N.,7.5 N.,3.75 N.,A

Source: Measuring Massive Multitask Language Understanding, Hendrycks et al, 9.7.2020

9 of 52

Multi-Task Learning (MTL)

10 of 52

Multi-Task Learning

task

data

task

data

task

data

predictor

multi-task

learning

algorithm

predictor

predictor

Task A

Task B

Task C

11 of 52

Multi-Task Learning: Self Driving Cars

  • Multiple tasks: detect cars, pedestrians, signs, lights, curbs, lanes, cross walks, etc.
    • Tasks (100)
    • sub-tasks

Source: Tesla AutoPilot

12 of 52

Multi-Task Learning: Edge Devices

  • High performance, prediction accuracy
  • Efficient computation, training and inference time
  • Compact models

13 of 52

Multi-Task Learning (MTL) Architectures

  • One neural network for learning multiple tasks: all-in-one
  • Separate networks for each task: individual prediction
  • Hybrid approach

  • Combinatorial optimization problem:
    • Bipartite matching of tasks to networks

14 of 52

Multi-Task Learning (MTL) Questions

  • How tasks influence one another? Does each task help the other tasks? or is there negative transfer?
  • How to share weights between different tasks?
  • How does network size influence MTL?
  • How does dataset size and distribution of number of samples per task influence MTL?
  • Are the tasks similar? Heterogeneous?

15 of 52

Multi-Task Learning

  • Multiple heterogeneous tasks: different importance, difficulty, number of samples, noise level

16 of 52

Shared backbone for multiple tasks with multiple heads

  • Input example: 𝒙
  • Tasks: 𝒕 = 1..T
  • Output of task 𝒕: 𝒚𝒕
  • Number of data points N
  • Dataset of iid data points {𝒙, 𝒚1,...,𝒚T} for i = 1...N

Task B

Task A

Task C

i

i

i

task specific layers

shared backbone network

17 of 52

Shared backbone for multiple tasks with multiple heads

  • Shared backbone network 𝒇
  • Shared backbone parameters 𝜽s
  • Task-specific decoder network 𝒈𝒕 with task-specific parameters 𝜽𝒕
  • Task-specific loss: ℒ𝒕(𝜽) := ℒ𝒕(𝜽s, 𝜽𝒕) :=1/N ∑i𝒕(𝒈𝒕(𝒇(𝒙;𝜽s);𝜽𝒕), 𝒚𝒕)
  • Linear scalarization total multi-task loss: ℒ(𝜽) = ∑𝒕 𝜶𝒕𝒕(𝜽)

Task B

Task A

Task C

task specific layers

shared backbone network

i

i

18 of 52

Linear Scalarization for MTL

  • Total multi-task loss

min𝜽ℒ(𝜽) = ∑𝒕 𝜶𝒕𝒕(𝜽)

  • Advantages?
  • Disadvantages?

19 of 52

Linear Scalarization

  • Total multi-task loss is linear weighted combination

min𝜽ℒ(𝜽) = ∑𝒕 𝜶𝒕𝒕(𝜽)

  • Advantages? simple
  • Disadvantages?
    • Selecting weights? develop loss weighting strategies?
    • Performance dependent on weights
    • Only handles convex part of Pareto front

20 of 52

Linear Scalarization

  • Total multi-task loss is a linear weighted combination

min𝜽ℒ(𝜽) = ∑𝒕 𝜶𝒕𝒕(𝜽)

  • Justification for linear scalarization: solutions may not be comparable. For example solution 𝜽 may be better for task 𝒕1 whereas solution 𝜽‘ is better for task 𝒕2 :

two solutions 𝜽 and 𝜽‘ s.t. ℒ𝒕1(𝜽s,𝜽𝒕1) < ℒ𝒕1(𝜽’s,𝜽’𝒕1) and ℒ𝒕2(𝜽s,𝜽𝒕2) > ℒ𝒕2(𝜽’s,𝜽’𝒕2) for tasks 𝒕1 and 𝒕2

21 of 52

Shared backbone for multiple tasks with multiple heads

  • Sharing weights in early layers, coupled
  • Split network into backbone and task-specific layers
  • Advantages?
  • Disadvantages?

Task B

Task A

Task C

task specific layers

shared backbone network

22 of 52

Shared backbone for multiple tasks with multiple heads

  • Sharing weights in early layers, coupled
  • Split network into backbone and task-specific layers, where to split?
  • Advantages? efficient runtime
  • Disadvantages? over-sharing, negative transfer

Task B

Task A

Task C

task specific layers

shared backbone network

23 of 52

Individual network for each task

  • No sharing weights
  • Decoupled functionality
  • Advantages?
  • Disadvantages?

Task B

Task A

Task C

24 of 52

Individual network for each task

  • No sharing weights
  • Decoupled functionality
  • Advantages? no negative transfer
  • Disadvantages? inefficient runtime, does not scale well with number of tasks

Task B

Task A

Task C

25 of 52

Negative Transfer

  • Why does training individual networks often work better than a shared network?

26 of 52

Negative Transfer

  • Relationships between tasks determines if a shared architecture works
  • One task may dominate training
  • Tasks may learn at different rates
  • Gradients may conflict

27 of 52

Multi-Task Learning and Adversarial Attacks

  • Models trained on multiple tasks at once are more robust to adversarial attacks on individual tasks

28 of 52

MTL Architectures

29 of 52

Architectures

  • Hard parameter sharing

  • Soft parameter sharing

  • Ad-hoc sharing

  • Learning to route, branch

30 of 52

Hard Parameter Sharing

  • Sharing information in early layers, over-sharing.
  • Split network into task-specific layers, where to split?
  • Define loss function

Task B

Task A

Task C

task specific layers

hard sharing backbone layers

31 of 52

Multi-Objective Optimization

  • Optimize collection of possibly conflicting objectives:

minℒ(𝜽s,𝜽1,...,𝜽T) = min(ℒ1(𝜽s,𝜽1),...,ℒT(𝜽s,𝜽T))

𝜽s,𝜽1,...,𝜽T

𝜽s,𝜽1,...,𝜽T

32 of 52

Multi-Objective Optimization

  • Tasks 𝒕 = 1..T
  • Neural network parameters 𝒙

  • Multi-objective function: 𝒇(𝒙): ℝ -> ℝ

𝒇(𝒙) = (𝒇1(𝒙),...,𝒇T(𝒙))

  • Objective function of task 𝒕 is task-specific loss:

𝒇𝒕(𝒙): ℝ -> ℝ

n

T

n

33 of 52

Pareto Optimal

  • For any 𝒙,𝒚 in ℝ 𝒙 dominates 𝒚 iff 𝒇(𝒙) <- 𝒇(𝒚)
  • A point 𝒙 is Pareto optimal if it is not dominated by any other point
  • A point 𝒙 is locally Pareto optimal if it is not dominated by any point in a neighborhood of 𝒙

n

34 of 52

Pareto Frontier

  • Point C is not on the Pareto frontier because it is dominated by points A and B
  • Points A and B are not dominated by any other point, and are therefore on the Pareto frontier.

Source: Wikipedia

35 of 52

Pareto Stationary

  • If each 𝒇𝒕(𝒙) is continuously differentiable a point 𝒙 is Pareto stationary if there exists 𝜶 in ℝ such that

𝜶𝒕 >= 0, 𝒕 𝜶𝒕 = 1 and 𝒕 𝜶𝒕 ∇𝒇𝒕(𝒙) = 0

  • All Pareto optimal points are Pareto stationary.

T

36 of 52

MTL Algorithm

  • Gradient descent on task-specific parameters
  • Solving optimization problem:

min { ||𝒕 𝜶𝒕 ∇𝒇𝒕(𝒙)|| | 𝒕 𝜶𝒕 = 1, 𝜶𝒕 >= 0 for all 𝒕 }

min { ||𝒕 𝜶𝒕 𝜽s𝒕(𝜽s, 𝜽𝒕)|| | 𝒕 𝜶𝒕 = 1, 𝜶𝒕 >= 0 for all 𝒕 }

𝜶1,...,𝜶T

𝜶1,...,𝜶T

Source: Multi-task learning as multiobjective optimization, Sener and Koltun, 2018

37 of 52

Soft Parameter Sharing

  • Sharing information in early layers
  • Does not scale well with number of tasks

Task B

Task A

Task C

task specific layers

soft sharing backbone

38 of 52

Ad-hoc Sharing

  • Compute task relatedness
  • Iteratively group network
  • Better performance than soft or hard sharing

Task B

Task A

Task C

task specific layers

shared backbone

39 of 52

Learn Shared Architecture

  • Directed acyclic graph
  • Nodes represent computational operations
  • Edges represent data flows
  • Differential branching operations

Source: Learning to branch for multi-task learning, Guo et al, 2020

40 of 52

Layer Routing

  • Learn separate execution paths for different tasks

Source: AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning, Sun et al, 2019

41 of 52

Taskonomy Dataset

  • 4.5 million indoor scenes from 600 buildings
  • 26 diverse tasks, every image is labeled for all tasks

Source: Taskonomy: Disentangling Task Transfer Learning, Zamir et al, 2018

42 of 52

Transfer Relationships between tasks

  • Taskonomy

Source: Taskonomy: Disentangling Task Transfer Learning, Zamir et al, 2018

43 of 52

Multi-Task Learning

  • Which tasks should and should not be learned together in one network when employing multi-task learning?

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

44 of 52

Multi-Task Learning

  • Transfer relationships may not predict multi-task relationships

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

transfer learning affinities

MTL learning affinities

45 of 52

MTL: Combinatorial Optimization Problem

46 of 52

MTL: Combinatorial Optimization Problem

  • Bipartite matching of tasks to networks given budget

1

2

3

t1

t2

t3

t4

t5

tasks

networks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

47 of 52

MTL: Combinatorial Optimization Problem

  • NP-hard

1

2

3

t1

t2

t3

t4

t5

tasks

networks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

48 of 52

MTL: Combinatorial Optimization Problem

  • Approximate solution

1

2

3

t1

t2

t3

t4

t5

tasks

networks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

49 of 52

MTL: Combinatorial Optimization Problem

  • Tasks T = {t1,...,tk}
  • Inference budget b, total time to complete all tasks
  • Neural network n, with inference cost time cn
  • Loss for each task L(n, ti)

infinity if network does not solve task

  • Solution S is a set of networks that together solve all tasks

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

50 of 52

MTL: Combinatorial Optimization Problem

  • Computation cost of solution: cost(S) = ∑(n in S) cn
  • Loss of solution on task is lowest loss on task among S

L(S, ti) = min(n in S) L(n, ti)

  • Overall performance of solution

L(S) = ∑(ti in T) L(S, ti)

  • Find solution with lowest overall loss and cost within budget

S* = argmin S:cost(S)<b L(S)

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

51 of 52

Multi-Task Learning

  • Combinatorial optimization problem:
    • Bipartite matching of tasks to networks given budget
    • NP-hard problem
    • Approximation

Source: Which Tasks Should Be Learned Together in Multi-task Learning? Standley et al, 2020

52 of 52

Meta Learning Tutorial

Iddo Drori Joaquin Vanschoren

MIT TU Eindhoven

AAAI 2021

https://sites.google.com/mit.edu/aaai2021metalearningtutorial