1 of 62

Transfer of Learning

2 of 62

“You need a lot of a data if you want to train/use CNNs”

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

3 of 62

Transfer Learning

“You need a lot of a data if you want to train/use CNNs”

NOT ALWAYS

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

4 of 62

�The Unreasonable Effectiveness of Deep Features

Classes separate in the deep representations and transfer to many tasks.�[DeCAF] [Zeiler-Fergus]

5 of 62

Can be used as a generic feature

(“CNN code” = 4096-D vector before classifier)

query image

nearest neighbors in the “code” space

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

6 of 62

Transfer of Learning

Psychological point of view

The study of dependency of human conduct, learning or performance on prior experience.

– How individuals would transfer in one context to another context that share similar characteristics.

7 of 62

Transfer Learning

Machine learning community

Inspired by human’s transfer of learning ability
The ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel tasks/domains, which share some commonality

8 of 62

A Motivating Example

Goal: to train a robot to accomplish Task 𝑻₁in an indoor environment 𝑬₁using machine learning techniques:

Sufficient training data required: sensor readings to measure the environment as well as human supervision, i.e. labels
A predictive model can be learned, and used in the same environment

Task 𝑻₁in environment 𝑬₁

9 of 62

A Motivating Example (cont.)

Limitations of traditional machine learning techniques:

Performance highly relies on whether sufficient labeled data is available to build a predictive model
When environment changes (e.g., new domain or new task), the learned predictive model performs poorly

A new predictive model has to be rebuilt from scratch

10 of 62

To train the robot from scratch?

Expensive & time consuming!

Task 𝑻₁

?

Environment changes 𝑬₂

?

New robot

?

Task 𝑻₂

11 of 62

Transfer Learning (cont.)

Source Task

Target Task

𝒇_𝑻

Assumption: training and test data are assumed to be

Represented in the same feature space, AND
Follow the same data distribution

In practice: training and test data come from different domains

Represented in different feature spaces, OR
Follow different data distributions

𝒇_𝑺

Directly apply

Adaptively Transfer

What if machines have transfer learning ability?

12 of 62

Transfer Learning (cont.)

Traditional Machine Learning

Transfer Learning

training domains

test domains

training domains

test domains

domain A

domain B

domain C

13 of 62

Transfer Learning (cont.)

Given a target domain/task, transfer learning aims to

identify the commonality between the target domain/task and previous domains/tasks
transfer knowledge from the previous domains/tasks to the target one such that human supervision on the target domain/task can be dramatically reduced.

Source Domain

/ Task Data

Target Domain

/ Task Data

Predictive Models

Sufficient labeled training data

Unlabeled training/with a few labeled data

Transfer Learning Algorithms

Target Domain

/ Task Data

Testing

14 of 62

Other Motivating Examples (cont.)

Sentiment analysis: users may use different sentiment words across different domains.

Sentiment classifier

~ 82 %

Classification Accuracy

Sentiment classifier w/o transfer learning

~ 70%

Sentiment classifier w/ transfer learning

~ 77 %

Electronics	Video Games
(1) Compact; easy to operate; very good picture quality; looks sharp!	(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.
(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.	(4) Very realistic shooting action and good plots. We played this and were hooked.
(5) It is also quite blurry in very dark settings. I will never buy HP again.	(6) The game is so boring. I am extremely unhappy and will probably never buy UbiSoft again.

Product reviews on different domains

15 of 62

Cross-Project Defect Prediction

Program with defect information

Predictive Model (Machine learning)

Future defects

OR

16 of 62

Cross-Project Defect Prediction

[Zimmerman etal. FSE-09]

– “We ran 622 cross-project predictions and found only 3.4%

actually worked.”

Worked, 3.4%

Not worked, 96.6%

17 of 62

Difference between Projects

Development processes can be very different [Zimmermann etal., FSE-09]

The way the systems are being developed
Operating systems and environments
Tools and IDEs used for development
Coding styles
etc.

18 of 62

Defect Prediction Process

OR

Using the same set of metrics

2	3	40	…	19	Y or N

…

Metric 1 Metric 2 Metric 3

Metric 𝑚

instance 𝑖:

Distributions of feature values (data distributions) are different

19 of 62

Transfer Learning Settings

Transfer Learning

Heterogeneous Transfer Learning

Hetero

geneous

Feature Space

Homogeneous Transfer Learning

Homog

eneous

Unsupervised Transfer Learning

Semi-Supervised Transfer Learning

Supervised Transfer Learning

20 of 62

Transfer Learning �Contd…

Improving the learning by training the model from source domain with some predefined features , and predict the data from another domain contains the different feature space known ag the target domain
In TL the domain of the training set data is not same as that the domain of testing set data
The data in one domain has not the same data distribution or the same feature space as compared to the other data domain
Different elements can be transferred from the source domain into the target domain e.g .instances, feature representation, model parameters and relational knowledge

National Institute of Technology Raipur

20

20-03-2023

21 of 62

Transfer Learning �Contd…

TL can reduces the class imbalance of the instances by transferring from other diverse domains to a better balance class in the target domain.
Provides a solution for the data heterogeneity challenge
Transfer of instances from different domains can reduce the challenge of handling dirty and noisy data along with data uncertainty

National Institute of Technology Raipur

21

20-03-2023

22 of 62

Transfer Learning Approaches

Instance-based Approaches

Feature-based Approaches

Parameter-based Approaches

Relational Approaches

23 of 62

Homogeneous Transfer Learning

Homogeneous Transfer learning approaches are developed and proposed to handle situations where the domains are of the same feature space.

In Homogeneous Transfer learning, domains have only a slight difference in marginal distributions. These approaches adapt the domains by correcting the sample selection bias or covariate shift.

Covariate shift refers to the change in the distribution of the input variables present in the training and the test data. It is the most common type of shift.

Instance transfer

It covers a simple scenario in which there is a large amount of labeled data in the source domain and a limited number in the target domain. Both the domains and feature spaces differ only in marginal distributions. In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables.

For example, suppose we need to build a model to diagnose cancer in a specific region where the elderly are the majority. Limited target-domain instances are given, and relevant data are available from another region where young people are the majority. Directly transferring all the data from another region may be unsuccessful since the marginal distribution difference exists, and the elderly have a higher risk of cancer than younger people.

In this scenario, it is natural to consider adapting the marginal distributions. Instance-based Transfer learning reassigns weights to the source domain instances in the loss function.

24 of 62

Parameter transfer

The parameter-based transfer learning approaches transfer the knowledge at the model/parameter level.

This approach involves transferring knowledge through the shared parameters of the source and target domain learner models. One way to transfer the learned knowledge can be by creating multiple source learner models and optimally combining the re-weighted learners similar to ensemble learners to form an improved target learner.

The idea behind parameter-based methods is that a well-trained model on the source domain has learned a well-defined structure, and if two tasks are related, this structure can be transferred to the target model. In general, there are two ways to share the weights in deep learning models: soft weight sharing and hard weight sharing.

In soft weight sharing, the model is expected to be close to the already learned features and is usually penalized if its weights deviate significantly from a given set of weights.

In hard weight sharing, we share the exact weights among different models.

25 of 62

Feature-representation transfer
Feature-based approaches transform the original features to create a new feature representation. This approach can further be divided into two subcategories, i.e., asymmetric and symmetric Feature-based Transfer Learning.
Asymmetric approaches transform the source features to match the target ones. In other words, we take the features from the source domain and fit them into the target feature space. There can be some information loss in this process due to the marginal difference in the feature distribution.
Symmetric approaches find a common latent feature space and then transform both the source and the target features into this new feature representation.�
Relational-knowledge transfer
Relational-based transfer learning approaches mainly focus on learning the relations between the source and a target domain and using this knowledge to derive past knowledge and use it in the current context.
Such approaches transfer the logical relationship or rules learned in the source domain to the target domain.
For example, if we learn the relationship between different elements of the speech in a male voice, it can help significantly to analyze the sentence in another voice.

26 of 62

Heterogeneous Transfer Learning
Transfer learning involves deriving representations from a previous network to extract meaningful features from new samples for an inter-related task. However, these approaches forget to account for the difference in the feature spaces between the source and target domains.
It is often challenging to collect labeled source domain data with the same feature space as the target domain, and Heterogeneous Transfer learning methods are developed to address such limitations.
This technique aims to solve the issue of source and target domains having differing feature spaces and other concerns like differing data distributions and label spaces. Heterogeneous Transfer Learning is applied in cross-domain tasks such as cross-language text categorization, text-to-image classification, and many others.

27 of 62

Cross-Project Defect Prediction

“Training data is often not available in early phases, either because a company do not track or it is the first release of a product”

[Singh et al., Cross-project Defect Prediction, -12]

“For many new projects we may not have enough historical data to train prediction models.”

[Rahman, Posnett, and Devanbu, Recalling the “Imprecision” of Cross-project Defect Prediction, ICSE-12]

28 of 62

Future Direction

Transfer learning for deep reinforcement learning

Deep RL

Transfer Learning for Deep RL

29 of 62

Transfer Learning for Deep Learning
Domains like natural language processing and image recognition are considered to be the hot areas of research for transfer learning. There are also many models that achieved state-of-the-art performance.
These pre-trained neural networks/models form the basis of transfer learning in the context of deep learning and are referred to as deep transfer learning.
Off-the-shelf pre-trained models as feature extractors
To understand the flow of deep learning models, it's essential to understand what they are made up of.
Deep learning systems are layered architectures that learn different features at different layers. Initial layers compile higher-level features that narrow down to fine-grained features as we go deeper into the network.

30 of 62

These layers are finally connected to the last layer (usually a fully connected layer, in the case of supervised learning) to get the final output. This opens the scope of using popular pre-trained networks (such as Oxford VGG Model, Google Inception Model, Microsoft ResNet Model) without its final layer as a fixed feature extractor for other tasks.

31 of 62

32 of 62

The key idea here is to leverage the pre-trained model's weighted layers to extract features, but not update the model's weights during training with new data for the new task.
The pre-trained models are trained on a large and general enough dataset and will effectively serve as a generic model of the visual world.

33 of 62

Freezing vs. Fine-tuning
One logical way to increase the model's performance even further is to re-train (or "fine-tune") the weights of the top layers of the pre-trained model alongside the training of the classifier you added.
This will force the weights to be updated from generic feature maps the model has learned from the source task. Fine-tuning will allow the model to apply past knowledge in the target domain and re-learn some things again.
Moreover, one should try to fine-tune a small number of top layers rather than the entire model. The first few layers learn elementary and generic features that generalize to almost all types of data.
Therefore, it's wise to freeze these layers and reuse the basic knowledge derived from the past training. As we go higher up, the features are increasingly more specific to the dataset on which the model was trained. Fine-tuning aims to adapt these specialized features to work with the new dataset, rather than overwrite the generic learning.

34 of 62

35 of 62

Transfer Learning with CNNs

1. Train on

Imagenet

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

36 of 62

Transfer Learning with CNNs

1. Train on

Imagenet

2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier

i.e. swap the Softmax layer at the end

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

37 of 62

Transfer Learning with CNNs

1. Train on

Imagenet

2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier

i.e. swap the Softmax layer at the end

3. If you have medium sized dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers

retrain bigger portion of the network, or even all of it.

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

38 of 62

Transfer Learning with CNNs

1. Train on

Imagenet

2. If small dataset: fix all weights (treat CNN as fixed feature extractor), retrain only the classifier

i.e. swap the Softmax layer at the end

3. If you have medium sized dataset, “finetune” instead: use the old weights as initialization, train the full network or only some of the higher layers

retrain bigger portion of the network, or even all of it.

tip: use only ~1/10th of the original learning rate in finetuning to player, and ~1/100th on intermediate layers

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

39 of 62

Case Study: VGGNet

[Simonyan and Zisserman, 2014]

best model

Only 3x3 CONV stride 1, pad 1

and 2x2 MAX POOL stride 2

11.2% top 5 error in ILSVRC 2013

->

7.3% top 5 error

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

40 of 62

INPUT: [224x224x3] memory: 224*224*3=150K params: 0

CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728

CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864

POOL2: [112x112x64] memory: 112*112*64=800K params: 0

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456

POOL2: [56x56x128] memory: 56*56*128=400K params: 0

CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912

CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824

POOL2: [28x28x256] memory: 28*28*256=200K params: 0

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296

POOL2: [14x14x512] memory: 14*14*512=100K params: 0

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296

POOL2: [7x7x512] memory: 7*7*512=25K params: 0

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216

FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

(not counting biases)

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

41 of 62

INPUT: [224x224x3] memory: 224*224*3=150K params: 0

CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728

CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864

POOL2: [112x112x64] memory: 112*112*64=800K params: 0

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456

POOL2: [56x56x128] memory: 56*56*128=400K params: 0

CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912

CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824

POOL2: [28x28x256] memory: 28*28*256=200K params: 0

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296

POOL2: [14x14x512] memory: 14*14*512=100K params: 0

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296

POOL2: [7x7x512] memory: 7*7*512=25K params: 0

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216

FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

(not counting biases)

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)

TOTAL params: 138M parameters

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

42 of 62

INPUT: [224x224x3] memory: 224*224*3=150K params: 0

CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728

CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864

POOL2: [112x112x64] memory: 112*112*64=800K params: 0

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456

POOL2: [56x56x128] memory: 56*56*128=400K params: 0

CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912

CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824

POOL2: [28x28x256] memory: 28*28*256=200K params: 0

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296

POOL2: [14x14x512] memory: 14*14*512=100K params: 0

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296

POOL2: [7x7x512] memory: 7*7*512=25K params: 0

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216

FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

(not counting biases)

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)

TOTAL params: 138M parameters

Note:

Most memory is in early CONV

Most params are

in late FC

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

One-shot Learning

One-shot learning is a classification task where we have one or a few examples to learn from and classify many new examples in the future.

This is the case of face recognition, where people's faces must be classified correctly with different facial expressions, lighting conditions, accessories, and hairstyles, and the model has one or a few template photos as input.

For one-shot learning, we need to fully rely on the knowledge transfer from the base model trained on a few examples we have for a class.

Zero-shot Learning

If transfer learning is applied excessively using zero instances of a class and does not depend on labeled data samples, then the corresponding strategy is called Zero-shot learning.

Zero-shot learning needs additional data during the training phase to understand the unseen data.

Zero-shot learning focuses on the traditional input variable, x, the traditional output variable, y, and the task-specific random variable. Zero-shot learning comes in handy in scenarios such as machine translation, where we may not have labels in the target language.

48 of 62

Outline

Conventional (Supervised) Machine Learning:

Large amount of training data required to train high accuracy classifiers.

Challenge

Diverse range of objects, object attributes (size, materials, chemistry, composition).
Very few (or negligible) positive examples for many scenarios. Data collection for all

these scenarios is clearly infeasible or impractical.

Approach: Zero-Shot Learning

- How to learn classifiers for new classes for which you have no (training) data?

Relevance to TSA:

Luggage inspection: homemade explosives

New classes of threats for which we don’t have parametric models/samples
Variations: chemical formula, concentration, processes
Discovery of new explosive classes and how to relate to what seen before

Video forensics: suspicious activity detection…

How does it work? Identify latent structural thematic properties of known classes

Predict classifiers for new classes based on how threats manifest in latent space

49 of 62

Supervised (conventional) Learning

Conventional Learning

Training Data

Images 🡪 Class-Labels
Xray images 🡪 Threat/non-threat
Video 🡪 what activity

Learning Problem

Train classifier with training data
Accurate prediction of class-labels for new images during test-time

Classifier f(x)

class: horse class: elephant

??

x

y

50 of 62

Zero-Shot Learning

Labeled images of Horses, elephants
Existing Explosive/Non-Explosive data
Video: Existing Activity Classes

Learning Problem:

Learn a classifier for new classes that not seen in training data.
Zebra class, New Explosives, New suspicious activity…

– Traditional concept makes no sense

horse

Zero-Shot Learning

– Training Data (x,y)

Classifier f(x)

elephant

??

Zebra is not seen before: How to minimize error for things not seen before

51 of 62

Airport Security Context

Millions of types of homemade threats:

– Fine grained

classification

Myriad Scanner Outputs

52 of 62

Key Idea: Leverage structure in descriptions

Source domain Target domain

Seen

classes

Unseen

classes

?

What if we are given thematic information during training? Can we recognize new class from thematic information?

53 of 62

Key Idea: Reduction to Standard Binary Classification

View attributes/themes (d) and image (x) as two pieces of puzzle

– Predict whether or not they are associated

7

Classifiers

f(x,d)

𝒙_𝑖

A zebra is an animal that looks like a horse.

It has stripes like a tiger does. It has black and white stripes on its body.

𝒅_𝑗

No match to image

Yes, description matches image

With thematic info we can pose it as conventional learning with unconventional outputs for classifiers.

54 of 62

Key Idea 2: Latent Topic Model

12/19/2016 ₈

What if themes/attributes are unknown?

Can we infer these themes from generic information about other classes?

55 of 62

Experiments: Benchmark datasets

Dataset	# instances	# attributes	# seen/unseen classes
aP&Y	15,339	64 (continuous)	20 / 12
AwA	30,475	85 (continuous)	40 / 10
CUB-200-2011	11,788	312 (binary)	150 / 50
SUN Attribute	14,340	102 (binary)	707 / 10

56 of 62

Performance Comparison

Method	aP&Y	AwA	CUB-200-2011	SUN Attribute	Average
Akata et al. CVPR’15	-	61.9	40.3	-	-
Lampert et al. PAMI’14	38.16	57.23	-	72.00	-
R.-Paredes and Torr ICML’15	24.22±2.89	75.32±2.28	-	82.10±0.32
SSE, ICCV’15	46.23±0.53	76.33±0.83	30.41±0.20	82.50±1.32	58.87
SDL, arXiv’15	50.35±2.97	79.12±0.53	41.78±0.52	83.83±0.29	63.77

57 of 62

Zero Shot Inference

Represent archive

Detection and tracking create probabilistic archive graph

Problem reduced to subgraph matching

Semantic Query Graph

Semantic

Gap

User @RealUser ∙ 10h

Going to give Tom his backpack

58 of 62

Outline

Conventional (Supervised) Machine Learning:

Large amount of training data required to train high accuracy classifiers.

Challenge

Diverse range of objects, object attributes (size, materials, chemistry, composition).
Very few (or negligible) positive examples for many scenarios. Data collection for all these scenarios is clearly infeasible or impractical.

Approach: Zero-Shot Learning

- How to learn classifiers for new classes for which you have no (training) data?

Intuition:

Leverage known classes to identify latent structural thematic properties of threats/non-

threats. Match/Identify thematic properties of new classes.

Relevance to TSA:

Luggage inspection: homemade explosives

New classes of threats for which we don’t have parametric models/samples
Variations: chemical formula, concentration, processes
Discovery of new explosive classes and how to relate to what seen before

Video forensics: suspicious activity detection…

59 of 62

Deep Transfer Learning applications
Transfer learning helps data scientists to learn from the knowledge gained from a previously used machine learning model for a similar task.
This is the reason why this technique has now become applied in several fields we've listed below.
NLP
NLP is one of the most attractive applications of transfer learning. Transfer learning uses the knowledge of pre-trained AI models that can understand linguistic structures to solve cross-domain tasks. Everyday NLP tasks like next word prediction, question-answering, machine translation use deep learning models like BERT, XLNet, Albert, TF Universal Model, etc.
Computer Vision
Transfer learning is also applied in Image Processing.
Deep Neural Networks are used to solve image-related tasks as they can work well identifying complex features of the image. The dense layers contain the logic for detecting the image; thus, tuning the higher layers will not affect the base logic. Image Recognition, Object Detection, noise removal from images, etc., are typical application areas of Transfer learning because all image-related tasks require basic knowledge and pattern detection of familiar images.
💡 Pro tip: Read YOLO: Real-Time Object Detection Explained.
Audio/Speech
Transfer learning algorithms are used to solve Audio/Speech related tasks like speech recognition or speech-to-text translation.
When we say "Siri" or"Hey Google!", the primary AI model developed for English speech recognition is busy processing our commands at the backend.
Interestingly, a pre-trained AI model developed for English speech recognition forms the basis for a French speech recognition model.
Transfer Learning in a nutshell: Key takeaways
Finally, let's do a quick recap of everything we've learned today. Here's a bullet-point summary of the things we've covered:
Transfer learning models focus on storing knowledge gained while solving one problem and applying it to a different but related problem.
Instead of training a neural network from scratch, many pre-trained models can serve as the starting point for training. These pre-trained models give a more reliable architecture and save time and resources.
Transfer learning is used in scenarios where there is not enough data for training or when we want better results in a short amount of time.
Transfer learning involves selecting a source model similar to the target domain, adapting the source model to the target model before transferring the knowledge, and training the source model to achieve the target model.
It is common to fine-tune the higher-level layers of the model while freezing the lower levels as the basic knowledge is the same that is transferred from the source task to the target task of the same domain.
In tasks with a small amount of data, if the source model is too similar to the target model, there might be an issue of overfitting. To prevent the transfer learning model from overfitting, it is essential to tune the learning rate, freeze some layers from the source model, or add linear classifiers while training the target model can help avoid this issue.

60 of 62

Summary

- ConvNets stack CONV,POOL,FC layers

- Trend towards smaller filters and deeper architectures

- Trend towards getting rid of POOL/FC layers (just CONV)

- Typical architectures look like

[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX

where N is usually up to ~5, M is large, 0 <= K <= 2.

but recent advances such as ResNet/GoogLeNet challenge this paradigm

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 7 -

27 Jan 2016

61 of 62

62 of 62

How to Calculate Marginal Distribution Probability

Example question: Calculate the marginal distribution of pet preference among men and women:�Solution:�Step 1: Count the total number of people. In this case the total is given in the right hand column (22 people).�Step 2: Count the number of people who prefer each pet type and then turn the ratio into a probability:

People who prefer cats: 7/22 = .32
People who prefer fish: 7/22 = .32
People who prefer dogs: 8/22 = .36