Data Engineering (AI5308/AI4005)
Apr 6: Training Data: Data Imbalance and Data Augmentation (Ch. 4)
Sundong Kim
Course website: https://sundong.kim/courses/dataeng23sp/�Contents from CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu
Eugene Yan’s page
2
Pop Quiz Results
Task: You want to build a model to classify whether a tweet spreads misinformation.
3
Questions 1
Suppose you receive a continuous flow of tweets with an unknown quantity, and you don’t have enough memory to store all of them. How can you sample 10 million tweets in such a way that each tweet has an equal probability of being chosen?
4
Questions 2
Now you have a set of 10 million tweets, and there are from 10,000 different users over a period of 24 months. However, all the tweets are unlabeled and you want to label a portion of them to train a classifier. How would you select a sample of 100,000 tweets to label?
5
Questions 3
You have 100K labels from 20 annotators and need to estimate their quality. What is the appropriate number of labels to examine, and how should they be sampled?
6
Ch. 4: Training Data
7
Class Imbalance
8
Class imbalance is the norm
9
People are more interested in unusual/potentially catastrophic events
Image from PyImageSearch
Class Imbalance
10
Why is class imbalance hard?
11
Why is class imbalance hard?
12
Why is class imbalance hard?
13
Why is class imbalance hard?
14
Asymmetric cost of errors: regression
15
Thanks Eugene Yan for this example!
Asymmetric cost of errors: regression
100% error difference
16
OK
Not OK
Thanks Eugene Yan for this example!
How to deal with class imbalance
17
Model A vs. Model B confusion matrices
18
Model A | Actual CANCER | Actual NORMAL |
Predicted CANCER | 10 | 10 |
Predicted NORMAL | 90 | 890 |
Model B | Actual CANCER | Actual NORMAL |
Predicted CANCER | 90 | 90 |
Predicted NORMAL | 10 | 810 |
Poll:
Which model would you choose?
Choose the right metrics
Model A vs. Model B confusion matrices
19
Model A | Actual CANCER | Actual NORMAL |
Predicted CANCER | 10 | 10 |
Predicted NORMAL | 90 | 890 |
Model B | Actual CANCER | Actual NORMAL |
Predicted CANCER | 90 | 90 |
Predicted NORMAL | 10 | 810 |
Model B has a better chance of telling if you have cancer
Both have the same accuracy: 90%
Symmetric metrics vs. asymmetric metrics
20
Symmetric metrics | Asymmetric metrics |
Treat all classes the same | Measures a model’s performance w.r.t to a class |
Accuracy | F1, recall, precision, ROC |
Class imbalance: asymmetric metrics
21
| CANCER (1) | NORMAL (0) | Accuracy | Precision | Recall | F1 |
Model A | 10/100 | 890/900 | 0.9 | 0.5 | 0.1 | 0.17 |
Model B | 90/100 | 810/900 | 0.9 | 0.5 | 0.9 | 0.64 |
Model A | Actual CANCER | Actual NORMAL |
Predicted CANCER | 10 | 10 |
Predicted NORMAL | 90 | 890 |
Model B | Actual CANCER | Actual NORMAL |
Predicted CANCER | 90 | 90 |
Predicted NORMAL | 10 | 810 |
Class imbalance: asymmetric metrics
22
| CANCER (1) | NORMAL (0) | Accuracy | Precision | Recall | F1 |
Model A | 10/100 | 890/900 | 0.9 | 0.5 | 0.1 | 0.17 |
Model B | 90/100 | 810/900 | 0.9 | 0.5 | 0.9 | 0.64 |
⚠ F1 score for CANCER as 1 is different from F1 score for NORMAL as 1 ⚠
2. Data-level methods: Resampling
23
Undersampling | Oversampling |
Remove samples from the majority class | Add more examples to the minority class |
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t1
2. Data-level methods: Resampling
24
Undersampling | Oversampling |
Remove samples from the majority class | Add more examples to the minority class |
Can cause overfitting | Can cause loss of information |
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t1
Undersampling: Tomek Links
25
Image from https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Oversampling: SMOTE
26
Image from Analytics Vidhya
Oversampling: SMOTE
27
Image from Analytics Vidhya
Both SMOTE and Tomek links only work on low-dimensional data!
3. Algorithm-level methods
28
3. Algorithm-level methods
29
Cost-sensitive learning
30
Class-balance loss
31
Non-weighted loss
Weighted loss
model.fit(features, labels, epochs=10, batch_size=32, class_weight={“fraud”: 0.9, “normal”: 0.1})
Focal loss
32
Focal Loss for Dense Object Detection (Lin et al., 2017)
Focal loss
33
Focal Loss for Dense Object Detection (Lin et al., 2017)
1. Data Augmentation
34
“Data augmentation is the new feature engineering”
- Josh Wills, prev Director of Data Engineering @ Slack
Data augmentation: Goals
35
Data augmentation
36
Label-preserving:�Computer Vision
Random cropping, flipping, erasing, etc.
37
Image from An Efficient Multi-Scale Focusing Attention Network�for Person Re-Identification (Huang et al., 2021)
Label-preserving: NLP
38
Original sentences | I’m so happy to see you. |
Generated sentences | I’m so glad to see you. I’m so happy to see y’all. I’m very happy to see you. |
Perturbation: Neural networks can be sensitive to noise
can be misclassified by changing just one pixel�(Su et al., 2017)
39
Perturbation:�Computer Vision
40
Whale
Turtle�noise by DeepFool
Turtle�noise by fast gradient sign
Perturbation: NLP
41
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)
Data Synthesis: NLP
42
Template | Find me a [CUISINE] restaurant within [NUMBER] miles of [LOCATION]. |
Generated queries |
|
Data Synthesis: Computer Vision
43
mixup: Beyond Empirical Risk Minimization (Zhang et al., 2017)
Data Synthesis: Computer Vision
44
https://forums.fast.ai/t/mixup-data-augmentation/22764
Data Augmentation: GAN
Example: kidney segmentation with�data augmentation by CycleGAN
45
Data Augmentation: Tabular Dataset
46
Data Augmentation: Tabular Dataset
47
48
�
49
Customs Import Declaration Datasets �→ CTGAN + Maintaining Correlations �
�
50
WCO BACUDA Conference 2022
�
51
Group of 4, 20 minutes
52
53
54
A survey on Image Data Augmentation for Deep Learning (Connor Shorten & Taghi M. Khoshgoftaar, 2019)
MLOps at Naver Shopping (Apr 6, 16:00, S7, 2F)
Data Engineering
Next class: Feature Engineering (Ch.5)
https://sundong.kim/courses/dataeng23sp/ | Sundong Kim