Transfer of Learning
“You need a lot of a data if you want to train/use CNNs”
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
Transfer Learning
“You need a lot of a data if you want to train/use CNNs”
NOT ALWAYS
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
�The Unreasonable Effectiveness of Deep Features
Classes separate in the deep representations and transfer to many tasks.�[DeCAF] [Zeiler-Fergus]
Can be used as a generic feature
(“CNN code” = 4096-D vector before classifier)
query image
nearest neighbors in the “code” space
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
Transfer of Learning
Psychological point of view
– How individuals would transfer in one context to another context that share similar characteristics.
Transfer Learning
Machine learning community
A Motivating Example
Task 𝑻1 in environment 𝑬1
A Motivating Example (cont.)
To train the robot from scratch?
Expensive & time consuming!
Task 𝑻1
?
Environment changes 𝑬2
?
New robot
?
Task 𝑻2
Transfer Learning (cont.)
Source Task
Target Task
𝒇𝑻
𝒇𝑺
Directly apply
Adaptively Transfer
What if machines have transfer learning ability?
Transfer Learning (cont.)
Traditional Machine Learning
Transfer Learning
training domains
test domains
training domains
test domains
domain A
domain B
domain C
Transfer Learning (cont.)
Source Domain
/ Task Data
Target Domain
/ Task Data
Predictive Models
Sufficient labeled training data
Unlabeled training/with a few labeled data
Transfer Learning Algorithms
Target Domain
/ Task Data
Testing
Other Motivating Examples (cont.)
Sentiment classifier
~ 82 %
Classification Accuracy
Sentiment classifier w/o transfer learning
~ 70%
Sentiment classifier w/ transfer learning
~ 77 %
Electronics | Video Games |
(1) Compact; easy to operate; very good picture quality; looks sharp! | (2) A very good game! It is action packed and full of excitement. I am very much hooked on this game. |
(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp. | (4) Very realistic shooting action and good plots. We played this and were hooked. |
(5) It is also quite blurry in very dark settings. I will never buy HP again. | (6) The game is so boring. I am extremely unhappy and will probably never buy UbiSoft again. |
Product reviews on different domains
Cross-Project Defect Prediction
Program with defect information
Predictive Model (Machine learning)
Future defects
OR
Cross-Project Defect Prediction
– “We ran 622 cross-project predictions and found only 3.4%
actually worked.”
Worked, 3.4%
Not worked, 96.6%
Difference between Projects
Defect Prediction Process
OR
Using the same set of metrics
2 | 3 | 40 | … | 19 | Y or N |
…
Metric 1 Metric 2 Metric 3
Metric 𝑚
instance 𝑖:
Distributions of feature values (data distributions) are different
Transfer Learning Settings
Transfer Learning
Heterogeneous Transfer Learning
Hetero
geneous
Feature Space
Homogeneous Transfer Learning
Homog
eneous
Unsupervised Transfer Learning
Semi-Supervised Transfer Learning
Supervised Transfer Learning
Transfer Learning �Contd…
National Institute of Technology Raipur
20
20-03-2023
Transfer Learning �Contd…
National Institute of Technology Raipur
21
20-03-2023
Transfer Learning Approaches
Instance-based Approaches
Feature-based Approaches
Parameter-based Approaches
Relational Approaches
Homogeneous Transfer learning approaches are developed and proposed to handle situations where the domains are of the same feature space.
In Homogeneous Transfer learning, domains have only a slight difference in marginal distributions. These approaches adapt the domains by correcting the sample selection bias or covariate shift.
Covariate shift refers to the change in the distribution of the input variables present in the training and the test data. It is the most common type of shift.
Instance transfer
It covers a simple scenario in which there is a large amount of labeled data in the source domain and a limited number in the target domain. Both the domains and feature spaces differ only in marginal distributions. In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables.
For example, suppose we need to build a model to diagnose cancer in a specific region where the elderly are the majority. Limited target-domain instances are given, and relevant data are available from another region where young people are the majority. Directly transferring all the data from another region may be unsuccessful since the marginal distribution difference exists, and the elderly have a higher risk of cancer than younger people.
In this scenario, it is natural to consider adapting the marginal distributions. Instance-based Transfer learning reassigns weights to the source domain instances in the loss function.
The parameter-based transfer learning approaches transfer the knowledge at the model/parameter level.
This approach involves transferring knowledge through the shared parameters of the source and target domain learner models. One way to transfer the learned knowledge can be by creating multiple source learner models and optimally combining the re-weighted learners similar to ensemble learners to form an improved target learner.
The idea behind parameter-based methods is that a well-trained model on the source domain has learned a well-defined structure, and if two tasks are related, this structure can be transferred to the target model. In general, there are two ways to share the weights in deep learning models: soft weight sharing and hard weight sharing.
In soft weight sharing, the model is expected to be close to the already learned features and is usually penalized if its weights deviate significantly from a given set of weights.
In hard weight sharing, we share the exact weights among different models.
Cross-Project Defect Prediction
“Training data is often not available in early phases, either because a company do not track or it is the first release of a product”
[Singh et al., Cross-project Defect Prediction, -12]
“For many new projects we may not have enough historical data to train prediction models.”
[Rahman, Posnett, and Devanbu, Recalling the “Imprecision” of Cross-project Defect Prediction, ICSE-12]
Future Direction
Deep RL
Transfer Learning for Deep RL
Transfer Learning with CNNs
1. Train on
Imagenet
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
Transfer Learning with CNNs
1. Train on
Imagenet
2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier
i.e. swap the Softmax layer at the end
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
Transfer Learning with CNNs
1. Train on
Imagenet
2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier
i.e. swap the Softmax layer at the end
3. If you have medium sized dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers
retrain bigger portion of the network, or even all of it.
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
Transfer Learning with CNNs
1. Train on
Imagenet
2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier
i.e. swap the Softmax layer at the end
3. If you have medium sized dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers
retrain bigger portion of the network, or even all of it.
tip: use only ~1/10th of the original learning rate in finetuning to player, and ~1/100th on intermediate layers
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
best model
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
11.2% top 5 error in ILSVRC 2013
->
7.3% top 5 error
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases)
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases)
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases)
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Note:
Most memory is in early CONV
Most params are
in late FC
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
One-shot Learning
One-shot learning is a classification task where we have one or a few examples to learn from and classify many new examples in the future.
This is the case of face recognition, where people's faces must be classified correctly with different facial expressions, lighting conditions, accessories, and hairstyles, and the model has one or a few template photos as input.
For one-shot learning, we need to fully rely on the knowledge transfer from the base model trained on a few examples we have for a class.
Zero-shot Learning
If transfer learning is applied excessively using zero instances of a class and does not depend on labeled data samples, then the corresponding strategy is called Zero-shot learning.
Zero-shot learning needs additional data during the training phase to understand the unseen data.
Zero-shot learning focuses on the traditional input variable, x, the traditional output variable, y, and the task-specific random variable. Zero-shot learning comes in handy in scenarios such as machine translation, where we may not have labels in the target language.
Outline
these scenarios is clearly infeasible or impractical.
- How to learn classifiers for new classes for which you have no (training) data?
Supervised (conventional) Learning
Classifier f(x)
class: horse class: elephant
??
x
y
Zero-Shot Learning
– Traditional concept makes no sense
horse
– Training Data (x,y)
Classifier f(x)
elephant
??
Zebra is not seen before: How to minimize error for things not seen before
Airport Security Context
– Fine grained
classification
Key Idea: Leverage structure in descriptions
Source domain Target domain
Seen
classes
Unseen
classes
?
What if we are given thematic information during training? Can we recognize new class from thematic information?
Key Idea: Reduction to Standard Binary Classification
– Predict whether or not they are associated
7
Classifiers
f(x,d)
𝒙𝑖
A zebra is an animal that looks like a horse.
It has stripes like a tiger does. It has black and white stripes on its body.
𝒅𝑗
No match to image
Yes, description matches image
With thematic info we can pose it as conventional learning with unconventional outputs for classifiers.
Key Idea 2: Latent Topic Model
12/19/2016 8
What if themes/attributes are unknown?
Can we infer these themes from generic information about other classes?
Experiments: Benchmark datasets
Dataset | # instances | # attributes | # seen/unseen classes |
aP&Y | 15,339 | 64 (continuous) | 20 / 12 |
AwA | 30,475 | 85 (continuous) | 40 / 10 |
CUB-200-2011 | 11,788 | 312 (binary) | 150 / 50 |
SUN Attribute | 14,340 | 102 (binary) | 707 / 10 |
Performance Comparison
Method | aP&Y | AwA | CUB-200-2011 | SUN Attribute | Average |
Akata et al. CVPR’15 | - | 61.9 | 40.3 | - | - |
Lampert et al. PAMI’14 | 38.16 | 57.23 | - | 72.00 | - |
R.-Paredes and Torr ICML’15 | 24.22±2.89 | 75.32±2.28 | - | 82.10±0.32 | |
SSE, ICCV’15 | 46.23±0.53 | 76.33±0.83 | 30.41±0.20 | 82.50±1.32 | 58.87 |
SDL, arXiv’15 | 50.35±2.97 | 79.12±0.53 | 41.78±0.52 | 83.83±0.29 | 63.77 |
Zero Shot Inference
Represent archive
Detection and tracking create probabilistic archive graph
Problem reduced to subgraph matching
Semantic Query Graph
Semantic
Gap
User @RealUser ∙ 10h
Going to give Tom his backpack
Outline
- How to learn classifiers for new classes for which you have no (training) data?
threats. Match/Identify thematic properties of new classes.
Summary
- ConvNets stack CONV,POOL,FC layers
- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
- Typical architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
Fei-Fei Li & Andrej Karpathy & Justin Johnson
Lecture 7 -
27 Jan 2016
How to Calculate Marginal Distribution Probability
Example question: Calculate the marginal distribution of pet preference among men and women:�Solution:�Step 1: Count the total number of people. In this case the total is given in the right hand column (22 people).�Step 2: Count the number of people who prefer each pet type and then turn the ratio into a probability: