1 of 11

Defectors: A Large, Diverse Python Dataset for Defect Prediction

Parvez Mahbub

Dalhousie University

parvezmrobin@dal.ca

Ohiduzzaman Shuvo

Dalhousie University

oh599627@dal.ca

Mohammad Masudur Rahman

Dalhousie University�masud.rahman@dal.ca

2 of 11

WHs

Why

  • Most of the existing datasets used in defect prediction not large enough
  • Datasets also suffer from the class imbalance problem containing only 5%-26% defective instances

What

  • Defectors — a large-scale dataset, containing both source code and their changes
  • 24 popular Python projects
  • 18 domains
  • ≈ 213K source code documents
  • ≈ 93K defective
  • ≈ 120K defect-free
  • Near 1:1 ratio in training set

3 of 11

How do we do it?

Dataset Construction

4 of 11

Project Selection

    • 25 matured and maintained repos
    • 11 by labeled PR
    • 12 by labeled issue
    • 2 by PR title pattern

Bug-fix Commit Collection

    • ≈ 51K bug-fix commits

Bug Inducing Commit Collection

    • using SZZ algorithm
    • implementation by the PyDriller tool

Bug Inducing Commits Filtration

    • By The Number of Linked Bug Inducing Commits
    • By The Number of Linked Bug-fix Commits
    • By The Size of Changed Code
    • By File Type
    • By The Nature of Change

Sampling Defect-free Commits

    • sample defect-free commits with 95% confidence level and a 5% margin of error
    • Take at least the same amount
    • Apply the filtration

Formalizing The Dataset

    • Two variants – just-in-time and line-level defect prediction
    • For each, two type of splits – random and timewise
    • Training splits maintain near 1:1 ratio, and validation and test splits maintain the original distribution

5 of 11

Takeaway

  • Size: To the best of our knowledge, Defectors is the largest defect prediction dataset
  • Class Balance: Maintains a near 1:1 ratio
  • Diversity in Application: 24 projects from 18 application domains and 24 organizations
  • Diversity in Platform: Based on Python projects

6 of 11

Thank You!�Question?

7 of 11

Dataset Construction

Project Selection

  • Ensure > 2000 PRs
  • Find bug-related labels (e.g., bug)
  • Find PRs with such labels
    • 11 repos with > 100 PRs
  • Find issues with such labels
    • 12 repos with > 100 issues
  • Find patterns latest 200 PR titles
    • 2 repos

Bug-fix Commit Collection

  • Collect the PRs merge commits
  • 51K bug-fix commits from 25 projects

8 of 11

Dataset Construction Contd.

Bug Inducing Commit Collection

  • Collect the bug-inducing commits using SZZ algorithm [1]
  • use an implementation of the SZZ algorithm by the PyDriller tool [2]

Bug Inducing Commits Filtration

  •  

[1] J. ́Sliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?” ACM sigsoft software engineering notes, vol. 30, no. 4, pp. 1–5, 2005.

[2] D. Spadini, M. Aniche, and A. Bacchelli, “Pydriller: Python framework for mining software repositories,” in Proceedings of the 2018 26th, ESEC/FSE, 2018, pp. 908–911

9 of 11

Dataset Construction Contd.

Collecting and Sampling Defect-free Commits

  • For each project, sample defect-free commits with 95% confidence level and a 5% margin of error
  • If this sample size is less than the number of defective commits, then increase the sample size to achieve parity

Formalizing The Dataset

  • Two variants – just-in-time and line-level defect prediction
  • For each, two type of splits – random and timewise
  • Training splits maintain near 1:1 ratio, and validation and test splits maintain the original distribution

10 of 11

11 of 11