1 of 11

Defectors: A Large, Diverse Python Dataset for Defect Prediction

Parvez Mahbub Dalhousie University parvezmrobin@dal.ca	Ohiduzzaman Shuvo Dalhousie University oh599627@dal.ca	Mohammad Masudur Rahman Dalhousie University�masud.rahman@dal.ca

2 of 11

WHs

Why

Most of the existing datasets used in defect prediction not large enough
Datasets also suffer from the class imbalance problem containing only 5%-26% defective instances

What

Defectors — a large-scale dataset, containing both source code and their changes
24 popular Python projects
18 domains
≈ 213K source code documents
≈ 93K defective
≈ 120K defect-free
Near 1:1 ratio in training set

3 of 11

How do we do it?

Dataset Construction

4 of 11

Project Selection

25 matured and maintained repos
11 by labeled PR
12 by labeled issue
2 by PR title pattern

Bug-fix Commit Collection

≈ 51K bug-fix commits

Bug Inducing Commit Collection

using SZZ algorithm
implementation by the PyDriller tool

Bug Inducing Commits Filtration

By The Number of Linked Bug Inducing Commits
By The Number of Linked Bug-fix Commits
By The Size of Changed Code
By File Type
By The Nature of Change

Sampling Defect-free Commits

sample defect-free commits with 95% confidence level and a 5% margin of error
Take at least the same amount
Apply the filtration

Formalizing The Dataset

Two variants – just-in-time and line-level defect prediction
For each, two type of splits – random and timewise
Training splits maintain near 1:1 ratio, and validation and test splits maintain the original distribution

5 of 11

Takeaway

Size: To the best of our knowledge, Defectors is the largest defect prediction dataset
Class Balance: Maintains a near 1:1 ratio
Diversity in Application: 24 projects from 18 application domains and 24 organizations
Diversity in Platform: Based on Python projects

6 of 11

Thank You!�Question?

https://parvezmrobin.dev

7 of 11

Dataset Construction

Project Selection

Ensure > 2000 PRs
Find bug-related labels (e.g., bug)
Find PRs with such labels

11 repos with > 100 PRs

Find issues with such labels

12 repos with > 100 issues

Find patterns latest 200 PR titles

2 repos

Bug-fix Commit Collection

Collect the PRs merge commits
≈ 51K bug-fix commits from 25 projects

8 of 11

Dataset Construction Contd.

Bug Inducing Commit Collection

Collect the bug-inducing commits using SZZ algorithm [1]
use an implementation of the SZZ algorithm by the PyDriller tool [2]

Bug Inducing Commits Filtration

[1] J. ́Sliwerski, T. Zimmermann, and A. Zeller, “When do changes induce fixes?” ACM sigsoft software engineering notes, vol. 30, no. 4, pp. 1–5, 2005.

[2] D. Spadini, M. Aniche, and A. Bacchelli, “Pydriller: Python framework for mining software repositories,” in Proceedings of the 2018 26th, ESEC/FSE, 2018, pp. 908–911

9 of 11