Large datasets often have unreliable labels, the wrongly tagged labels which get introduced while collecting and recording of datasets. Due to bad quality of dataset the accuracies of our supervised learning algorithms is hampered.
In this presentation we’ll walkthrough our approach to identify and eliminate mislabeled training instances.
3 of 10
Mislabeled Data
<Example of mislabeled data >
Random errors
Deep Learning algorithms are quite robust to random errors
Systematic errors
Less robust to systematic errors
4 of 10
Existing Solutions
To manually look at all the data and relabel it after fixing errors.
Time consuming
Error prone
Impossible for very large datasets
Rule based cleaning
Make rules for label mapping from incorrect to correct labels
Works for systematic errors, but not for random errors
5 of 10
Our Approach
Manually create a small subset of cleaned labels.
Train the network on clean and noisy labels.
Use an additional network to learn about the noise distribution and create a mapping from noisy to clean labels.
6 of 10
Detailed approach
<Distribution of incorrectly Labeled dataset>
<Add image/animation>
7 of 10
Working Example
<Link to notebook with code and example >
8 of 10
Performance Improvement
F1 score increased from 0.75 to 0.83
<Examples of correctly labeled data using our approach>
<Examples of correctly classified utterance after cleaning data>
9 of 10
Conclusion
While building practical systems, we should be investing more time in manual error analysis.
Approach to reduce noise in large dataset using small subset of cleaned labels.