1 of 10

Dataset Denoising:

Improving accuracy of NLP Classifier

Khaleeque Ansari

2 of 10

Problem Statement

  • Large datasets often have unreliable labels, the wrongly tagged labels which get introduced while collecting and recording of datasets. Due to bad quality of dataset the accuracies of our supervised learning algorithms is hampered.
  • In this presentation we’ll walkthrough our approach to identify and eliminate mislabeled training instances.

3 of 10

Mislabeled Data

  • <Example of mislabeled data >
  • Random errors
    • Deep Learning algorithms are quite robust to random errors
  • Systematic errors
    • Less robust to systematic errors

4 of 10

Existing Solutions

  • To manually look at all the data and relabel it after fixing errors.
    • Time consuming
    • Error prone
    • Impossible for very large datasets
  • Rule based cleaning
    • Make rules for label mapping from incorrect to correct labels
    • Works for systematic errors, but not for random errors

5 of 10

Our Approach

  • Manually create a small subset of cleaned labels.
  • Train the network on clean and noisy labels.
  • Use an additional network to learn about the noise distribution and create a mapping from noisy to clean labels.

6 of 10

Detailed approach

  • <Distribution of incorrectly Labeled dataset>
  • <Add image/animation>

7 of 10

Working Example

  • <Link to notebook with code and example >

8 of 10

Performance Improvement

  • F1 score increased from 0.75 to 0.83
  • <Examples of correctly labeled data using our approach>
  • <Examples of correctly classified utterance after cleaning data>

9 of 10

Conclusion

  • While building practical systems, we should be investing more time in manual error analysis.
  • Approach to reduce noise in large dataset using small subset of cleaned labels.

10 of 10

Thank You

Q & A