1
Estimating Source Code Defectiveness: Linking defect reports with programming constructs usage metrics.
Dr. Ritu Kapur
Prof. Balwinder Sodhi
Applied and Contemporary Software Engineering (ACSE) Lab
Indian Institute of Technology (IIT) Ropar
Published in ACM Transactions on Software Engineering and Methodology (TOSEM) 2020.
Problem Scenario
[1]. Malheiros, Yuri, et al. "A source code recommender system to support newcomers." Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual. IEEE, 2012.
[2]. Leonard‐Barton, Dorothy. "Core capabilities and core rigidities: A paradox in managing new product development." Strategic management journal 13.S1 (1992): 111-125.
[3]. Dehaghani, Sayed Mehdi Hejazi, and Nafiseh Hajrahimi. "Which factors affect software projects maintenance cost more?." Acta Informatica Medica 21.1 (2013): 63.
2
Defectiveness - Defining the concept
in source code
3
Poor or Inefficient coding style
Better or efficient coding style
Defects related to Non-functional aspects: Examples
4
Hypothesis
5
PROCON Metrics
6
Programming Construct | Lexical Property | Program Fragments | PROCON metrics’ values | ||||
Program 1 | Program 2 | max | min | avg | stdDev | ||
if | Count | 2 | 1 | 2 | 0 | 1.5 | 0.71 |
Depth | 9, 11 | 9 | 11 | 0 | 9.66 | 1.155 | |
Length | 79, 118 | 65 | 118 | 0 | 87.3 | 27.465 |
Program Fragment 1
Program Fragment 2
ANTLR generated Abstract Syntax Tree (AST)
7
Defect Estimation Scenarios
8
Broad Idea of our approach
9
Details of source files present in various datasets
10
Language | Total file count | Total bug-linked file count | Files without defect information | Total files for training the model |
C | 11718 | 3224 | 3468 | 6448 (=3224*2) |
C++ | 7202 | 174 | 6827 | 348 (=174*2) |
Java | 7500 | 318 | 7076 | 636 (=318*2) |
Python | 4023 | 1440 | 2048 | 2750 (=1375*2) |
Details of Defect reports
11
Performance Evaluation: Detecting the defectiveness
12
ML model
Key Inference from Defect Estimation experiments
13
Comparison with the State-of-the-art methods: at Approach level
14
Comparison with the State-of-the-art methods: at Dataset level
15
Tool: Defect Estimator for Source Code (DESCo)
16
Source file: https://github.com/apache/httpd/blob/trunk/server/request.c, Defect report: https://bz.apache.org/Bugzilla/show_bug.cgi?id=45187
Conclusion
17
Publications out of this work
18
19
Thank You!
Questions?
@DrRituKapur
dev.ritu.kapur@gmail.com
dr-ritu-kapur-36174454
https://sites.google.com/view/ritu-kapur
Awards and Achievements
20
Problem #2:
Improving the performance of CRUSO
21
Problem #4: Improving the performance of CRUSO
22
[1]. Sharma, Shipra, and Balwinder Sodhi. "Using Stack Overflow content to assist in code review." Software: Practice and Experience 49.8 (2019): 1255-1277.
Tool Input Interface
23
Tool Output Interface
24
CRUSO-P detects semantically similar code fragments
25
CRUSO-P detects semantically similar code fragments (cont..)
26
Basic tenets behind our system
27
Existing methods for representing source code
28
Proposed System architecture
29
GitHub source files and SO posts considered
30
Language | Training corpus lines of code (measured by cloc) | Testing Corpus | |||||
Files | Blank | Comment | SLOC | Test file pairs | Relevant corpus size | Models tested | |
C | 32099 | 2908784 | 2490163 | 14908295 | 5000 | 30036 | 21 |
C# | 8112 | 303416 | 198693 | 2342959 | 5000 | 7076 | 21 |
Java | 142266 | 437851 | 659172 | 2157881 | 5000 | 127568 | 21 |
JavaScript | 15737 | 177587 | 226724 | 1259902 | 5000 | 12485 | 21 |
Python | 6012 | 300109 | 412452 | 1248494 | 5000 | 5378 | 20 |
Language | Question posts | ||||
Max(μ) | Min(μ) | Avg(μ) | StdDev(μ) | count of posts | |
C | 2880 | -24 | 1.4619 | 18.9996 | 43306 |
C# | 1463 | -17 | 1.9487 | 10.0811 | 219821 |
Java | 23665 | -31 | 1.7041 | 45.9577 | 294629 |
JavaScript | 2689 | -31 | 1.2953 | 9.3526 | 245863 |
Python | 1738 | -17 | 2.4483 | 13.3160 | 100109 |
Language | Answer posts | ||||
Max(μ) | Min(μ) | Avg(μ) | StdDev(μ) | count of posts | |
C | 1803 | -28 | 2.715 | 27.767 | 9788 |
C# | 1197 | -9 | 2.645 | 11.5883 | 60505 |
Java | 30924 | -8 | 2.9954 | 120.601 | 67450 |
JavaScript | 988 | -8 | 1.9928 | 12.0559 | 66096 |
Python | 776 | -6 | 3.2657 | 12.9174 | 25167 |
Experimental Summary
31
Experiment | Research question addressed | Major findings |
Experiment #1 |
|
→ uninclined |
Experiment #2 | Performance comparison of CRUSO-P with CRUSO
| Improvement:
|
Experiment #1: Performance in detecting source code similarity
32
Language | Number of repositories | Avg. Number of Commits | Avg. Number of Contributors | Avg. Number of Releases | Avg. Number of Stars | Total Number of source files | Count of randomly selected source files |
Java | 30 | 18905.7 | 357.7 | 188 | 6697.83 | 142266 | 4000 |
Python | 9 | 107621 | 494.6 | 270.8 | 8445.17 | 6012 | 4000 |
C | 8 | 105013.8 | 536.8 | 266.4 | 9966.67 | 32099 | 4000 |
C# | 10 | 19526.1 | 521.6 | 226.7 | 10125.67 | 6000 | 4000 |
JavaScript | 7 | 15648.8 | 386.4 | 185.7 | 6588.7 | 15000 | 4000 |
Language | With no Transformation | With Function Relocation | With Variable Renaming | ||||||
Avg(𝛂) | ACC | F1 | Avg(𝛂tf1) | ACC | F1 | Avg(𝛂tf2) | ACC | F1 | |
Java | 0.99 | 0.991 | 0.991 | 0.950 | 0.995 | 0.880 | 0.818 | 0.994 | 0.827 |
Python | 0.995 | 0.993 | 0.9798 | 0.990 | 0.994 | 0.846 | 0.907 | 0.995 | 0.862 |
C | 0.994 | 0.994 | 0.992 | 0.956 | 0.997 | 0.922 | 0.923 | 0.996 | 0.886 |
C# | 0.988 | 0.988 | 0.700 | 0.959 | 0.995 | 0.856 | 0.7745 | 0.992 | 0.785 |
JavaScript | 0.984 | 0.993 | 0.818 | 0.951 | 0.994 | 0.852 | 0.755 | 0.993 | 0.797 |
Addtl. Exp: Do PVA input parameters affect the performance?
33
Experiment #1: Performance Evaluation of CRUSO-P
34
Language | Thresholds | Performance | Number of posts | Avg. Time taken (in seconds) | ||
Avg(𝛂) | StdDev(𝛂) | ACC | F1 score | |||
C | 0.963 | 0.0704 | 0.992 | 0.992 | 5000 | 402 |
C# | 0.954 | 0.0979 | 0.8559 | 0.8365 | 5000 | 413 |
Java | 0.97 | 0.0668 | 0.993 | 0.993 | 5000 | 389 |
JavaScript | 0.967 | 0.0719 | 0.8766 | 0.8612 | 5000 | 451 |
Python | 0.9617 | 0.0764 | 0.991 | 0.9909 | 5000 | 368 |
Experiment #2: Comparison of CRUSO-P with CRUSO
35
Tool | Programming Language vs. Response time (in seconds) | Avg. Response time (in seconds) | Storage (in MBs) | Accuracy | ||||
C | C# | Java | JavaScript | Python | ||||
CRUSO-P | 1.09 | 13.15 | 11.47 | 4.35 | 1.35 | 6.28 | 121.53 | 99.3% |
CRUSO | 284.74 | 291.09 | 289.15 | 281.81 | 292.8 | 287.92 | 14239 | 94% |
Defect Estimator for Source Code -- DESCo
36
[1]. Ritu Kapur and Balwinder Sodhi. A Defect Estimator for Source code: Linking defect reports with programming constructs usage metrics. ACM Transactions on Software Engineering and Methodology (TOSEM) 29.2 (2020): 1-35.
[2]. Wang, Song, Taiyue Liu, and Lin Tan. "Automatically learning semantic features for defect prediction." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.
Evaluation Metrics
37
Performance Evaluation: Estimating the defect properties
38
ML model
Performance Evaluation: Estimating the defect properties (Cont.)
39
Abstract Software categories derived from MavenCentral
40
Paragraph Vectors Algorithm
41
[1]. Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.
Correlation metrics
42
[1]. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1–4. Springer, 2009.
[2]. Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
[3]. Wayne W Daniel. The spearman rank correlation coefficient. Biostatistics: A Foundation for Analysis in the Health Sciences, 1987.
Effect of SDEE metrics on the overall development effort
43
ANTLR generated Abstract Syntax Tree (AST)
44
Results and Observations: Magnitude of Relative Errors
45
SDEE tool
46
Comparison of Similarity detection methods
47
[1]. Trstenjak, Bruno, Sasa Mikac, and Dzenana Donko. "KNN with TF-IDF based framework for text categorization." Procedia Engineering 69 (2014): 1356-1364.
Relational Schema of the SDEE dataset
48
Software description similarity detection model
49
Measure | Value (in %) |
Accuracy | 99.3 |
Precision | 99.24 |
Recall | 98.62 |
F1 Score | 99.3 |
ROC area | 98.62 |
Establishing the Mapping
50
[1]. Github: http://github.com/, accessed October 18, 2017.
[2]. Apache Bugzilla: https://bz.apache.org/bugzilla/ , accessed October 18, 2017.
Mapping
Open Source Projects
in various
Version Control Systems (VCSs)
(e.g. Github [1])
Defect Reports
in various
Open Bug Repositories (OBRs)
(e.g. Bugzilla [2])
For projects having both Source Code Information at VCSs & Bug Information at OBRs
Preferences of choosing various Programming Constructs
Metadata information from Bug Reports
Does a particular programming style has any effect on the Quality of Software?
Architecture of DESCo System
51
Problem Formulation: Selecting the best performing model
52
D: Dataset
L: Set of considered programming languages
𝜆: A specific programming language, s.t. 𝜆 ∈ L.
E: Set of evaluation metrics
𝚫: ML model
A: Set of ML algorithms
P: Set of input parameter combinations of A
𝛼: An ML algorithm, s.t. 𝛼 ∈ A.
𝜋: A parameter combination, s.t. 𝜋 ∈ P.
Parameter Combinations of different ML techniques
53
Support Vector Machine (SVM)
54
Deep Belief Network
[1]. Wang, Song, Taiyue Liu, and Lin Tan. "Automatically learning semantic features for defect prediction." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.
55
. . .
f
. . .
Features hl
Output
Layer l
Hidden
Layer l-1
. . . . . .
. . . . . .
Hidden
Layer 1
Input
Layer
. . .
File1[ . . . if foo for bar . . . ]
File2[ . . . foo for if bar . . . ]
hl-1
h1
x
Random Forest
56
Decision Tree Learning Algorithm
57
age?
<= 30
31…40
> 40
student?
yes
credit rating?
no
no
yes
yes
excellent
fair
yes
no
Example: Decision Tree Learning Algorithm
58
Defining the PROCON metrics
59
Software Metric | Description |
maxXCount | maximum number of times a construct X is used in source code |
minXCount | minimum number of times a construct X is used in source code |
avgXCount | average number of times a construct X is used in source code |
stdDevXCount | standard deviation of number of times a construct X is used in source code |
maxXDepth | maximum depth at which a construct X is used in the AST of source code |
minXDepth | minimum depth at which a construct X is used in the AST of source code |
avgXDepth | average depth at which a construct X is used in the AST of source code |
stdDevXDepth | standard deviation of depth at which a construct X is used in the AST of source code |
maxXLength | maximum lexical length of the body of construct X used in a source code |
minXLength | minimum lexical length of the body of construct X used in a source code |
avgXLength | average lexical length of the body of construct X used in a source code |
stdDevXLength | standard deviation of lexical length of the body of construct X used in a source code |
Back
Establishing the mapping
60
PROCON dataset builder
61
Parameter Combinations of different ML techniques
62
Relational schema of the SOPostsDB
63