JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 10

Dataset Denoising:

Improving accuracy of NLP Classifier

Khaleeque Ansari

2 of 10

Problem Statement

Large datasets often have unreliable labels, the wrongly tagged labels which get introduced while collecting and recording of datasets. Due to bad quality of dataset the accuracies of our supervised learning algorithms is hampered.
In this presentation we’ll walkthrough our approach to identify and eliminate mislabeled training instances.

3 of 10

Mislabeled Data

<Example of mislabeled data >
Random errors

Deep Learning algorithms are quite robust to random errors

Systematic errors

Less robust to systematic errors

4 of 10

Existing Solutions

To manually look at all the data and relabel it after fixing errors.

Time consuming
Error prone
Impossible for very large datasets

Rule based cleaning

Make rules for label mapping from incorrect to correct labels
Works for systematic errors, but not for random errors

5 of 10

Our Approach

Manually create a small subset of cleaned labels.
Train the network on clean and noisy labels.
Use an additional network to learn about the noise distribution and create a mapping from noisy to clean labels.

6 of 10

Detailed approach

<Distribution of incorrectly Labeled dataset>
<Add image/animation>

7 of 10

Working Example

<Link to notebook with code and example >

8 of 10

Performance Improvement

F1 score increased from 0.75 to 0.83
<Examples of correctly labeled data using our approach>
<Examples of correctly classified utterance after cleaning data>

9 of 10

Conclusion

While building practical systems, we should be investing more time in manual error analysis.
Approach to reduce noise in large dataset using small subset of cleaned labels.

10 of 10

Thank You

Q & A