1 of 63

1

Estimating Source Code Defectiveness: Linking defect reports with programming constructs usage metrics.

Dr. Ritu Kapur

dev.ritu.kapur@gmail.com

Prof. Balwinder Sodhi

sodhi@iitrpr.ac.in

Applied and Contemporary Software Engineering (ACSE) Lab

Indian Institute of Technology (IIT) Ropar

Published in ACM Transactions on Software Engineering and Methodology (TOSEM) 2020.

2 of 63

Problem Scenario

Software Development requires significant involvement of experts

40% of the Software Life Cost is related to Maintenance^[3]

Dependency on Experts

Experienced developers are required to train the newcomers^[1]
Cost of a skilled software developer is high
Shortcoming: Success of a software project gets restricted

Skill set and Experience of team members^[2]

[1]. Malheiros, Yuri, et al. "A source code recommender system to support newcomers." Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual. IEEE, 2012.

[2]. Leonard‐Barton, Dorothy. "Core capabilities and core rigidities: A paradox in managing new product development." Strategic management journal 13.S1 (1992): 111-125.

[3]. Dehaghani, Sayed Mehdi Hejazi, and Nafiseh Hajrahimi. "Which factors affect software projects maintenance cost more?." Acta Informatica Medica 21.1 (2013): 63.

2

3 of 63

Defectiveness - Defining the concept

Includes two significant things

Likelihood of finding any defects

in source code

Properties of the potential defects

Severity
Type

Focus: Non-Functional requirement aspects

such as quality, reliability, performance, etc.

3

Poor or Inefficient coding style

Better or efficient coding style

4 of 63

Defects related to Non-functional aspects: Examples

Performance: Batik inside of Eclipse using bridge memory heap causes crash^[1]
Security: XML vulnerabilities in Python^[2]
Accessibility: Unlocking a file owned by another user doesn't work^[3]
Programming Language Compatibility: build.xml isn't forward compatible with Java6^[4]
Maintainability: FAQ references third-party libraries that have been refactored or renamed^[5]

4

[1]. https://bz.apache.org/bugzilla/show_bug.cgi?id=34631

[2]. https://bugs.python.org/issue17239

[3]. https://bz.apache.org/bugzilla/show_bug.cgi?id=36981

[4].https://bz.apache.org/bugzilla/show_bug.cgi?id=43608

[5]. https://bz.apache.org/bugzilla/show_bug.cgi?id=36003

5 of 63

Hypothesis

Programming constructs used while coding affect software quality

Programming Construct (PROCON) Metrics → a novel Source Code Representation

5

6 of 63

PROCON Metrics

Lexical properties associated with the usage of programming constructs

Count
Depth
Length

6

Programming Construct	Lexical Property	Program Fragments		PROCON metrics’ values
Programming Construct	Lexical Property	Program 1	Program 2	max	min	avg	stdDev
if	Count	2	1	2	0	1.5	0.71
	Depth	9, 11	9	11	0	9.66	1.155
	Length	79, 118	65	118	0	87.3	27.465

Program Fragment 1

Program Fragment 2

7 of 63

ANTLR generated Abstract Syntax Tree (AST)

7

8 of 63

Defect Estimation Scenarios

Phase 1: Is the input source file likely-to-be-defective or not?
Phase 2: What are the possible defect characteristics?

specific priority or severity

viz., critical, high, medium, low, etc.

specific type

viz., enhancement, bug-fix, etc.

specific operating system (OS)
specific hardware
specific level of user-exposure

via the number of comments

8

9 of 63

Broad Idea of our approach

Major software artifacts

PROCON dataset
DESCo tool

9

10 of 63

Details of source files present in various datasets

10

Language	Total file count	Total bug-linked file count	Files without defect information	Total files for training the model
C	11718	3224	3468	6448 (=3224*2)
C++	7202	174	6827	348 (=174*2)
Java	7500	318	7076	636 (=318*2)
Python	4023	1440	2048	2750 (=1375*2)

11 of 63

Details of Defect reports

11

12 of 63

Performance Evaluation: Detecting the defectiveness

Best results:

Python dataset trained using Supervised Deep Belief Network Classifier
MeanAcc value of 95.96% (with 𝛔 = 0.0022)

Suitable <ML technique -- parameter> combinations: SVM with radial kernel, LSVM, Logistic Regression, Gaussian, and Supervised DBN Classification models
Not advisable: SVM classifier and Nu-SVM classifier both with a poly kernel, and MLP Classifier

12

ML model

13 of 63

Key Inference from Defect Estimation experiments

Inference: Different ML algorithms emerge as best-performing for predicting different types of defect characteristics.

For predicting the Medium exposure defects:

ML models trained using LSVM on C++ source files
MeanAcc value of 77.7% (with 𝛔 = 0.1435)

For predicting the defects annotated with a specific OS type:

SVM classifier trained on Java dataset
MeanAcc value of 68.8% (with 𝛔 = 0.115)

13

14 of 63

Comparison with the State-of-the-art methods: at Approach level

DESCo outperforms the state-of-the-art approach

50/54 parameter combinations of ML models

Highest MeanAcc value: 80.8% (𝛔 = 0.047) with SVM classifier trained on Java dataset.
Improvement: 44.9% over the state-of-the-art approach (point ‘AB’ in plot).

14

15 of 63

Comparison with the State-of-the-art methods: at Dataset level

Observations

DESCo with PROCON: Highest MeanAcc of 84.47%

RF classifier with 100 estimators

Supervised DBN Classifier with PROMISE: Highest MeanAcc of 65.2%

Improvement of 19.46%

Supervised DBN Classifier with Semantic dataset: Highest MeanAcc of 70.9%

Improvement of 29.9%

15

16 of 63

Tool: Defect Estimator for Source Code (DESCo)

16

Source file: https://github.com/apache/httpd/blob/trunk/server/request.c, Defect report: https://bz.apache.org/Bugzilla/show_bug.cgi?id=45187

17 of 63

Conclusion

Propose PROCON metrics

To capture the programming construct usage information

Present in various software artifacts
Such as VCS repositories and defect reporting websites.

Developed Knowledge warehouses

By extracting the relevant information from various online sources

Specific to the considered software development tasks

Comprises:

ML models
PROCON Dataset
Software used for manipulating them

Expert system for defect prediction - DESCo

DESCo with PROCON: Highest MeanAcc of 84.47%
Outperforms an existing defect prediction method
Defect Prediction performance improves with PROCON dataset

17

18 of 63

Publications out of this work

Ritu Kapur and Balwinder Sodhi. "Towards a Knowledge Warehouse and Expert System for the Automation of SDLC tasks." 2019 IEEE/ACM International Conference on Software and System Processes (ICSSP). IEEE, 2019.
Ritu Kapur and Balwinder Sodhi. A Defect Estimator for Source code: Linking defect reports with programming constructs usage metrics. ACM Transactions on Software Engineering and Methodology (TOSEM ) 29.2 (2020): 1-35. (ESEC/FSE’21 Journal-First )

18

19 of 63

19

Thank You!

Questions?

@DrRituKapur

dev.ritu.kapur@gmail.com

dr-ritu-kapur-36174454

https://sites.google.com/view/ritu-kapur

20 of 63

Awards and Achievements

Awards

Received Microsoft Research Travel Grant Award for presenting my work at ICSE/ICSSP 2019.
Received ACM SIGSOFT CAPS Award for partially supporting the registration at ICSE/ICSSP 2019.

Internship

Summer Internship program in Institute of Software Engineering at University of Stuttgart, Germany.

20

21 of 63

Problem #2:

Improving the performance of CRUSO

21

22 of 63

Problem #4: Improving the performance of CRUSO

Proposed Solution: CRUSO-P - a PVA-based alternative

CRUSO^[1] - a code review assisting tool

Detects relevant StackOverflow (SO) posts using Winnowing
Estimates the defectiveness by using the content of SO posts

Central Idea: Replacing Winnowing algorithm with PVA
Cause of inefficiency: variable-length fingerprints → large number of comparisons

Considerable amount of memory usage and response time

Programming Languages considered:

C, C#, Java, JavaScript, and Python

22

[1]. Sharma, Shipra, and Balwinder Sodhi. "Using Stack Overflow content to assist in code review." Software: Practice and Experience 49.8 (2019): 1255-1277.

23 of 63

Tool Input Interface

23

24 of 63

Tool Output Interface

24

25 of 63

CRUSO-P detects semantically similar code fragments

25

26 of 63

CRUSO-P detects semantically similar code fragments (cont..)

26

27 of 63

Basic tenets behind our system

Professional programmer support forums such as StackOverflow, contain useful information to problems faced by programmers when developing software

Code fragments
Associated Q/A discussions

If: A given source code under review c, is found to be sufficiently similar to the code fragments c_p present in SO posts

Then: We can infer the defectiveness of c by analysing the Q/A discussions and metadata of posts p.

27

28 of 63

Existing methods for representing source code

Methods:

Fingerprint-based methods

Fingerprint: compact collection of integers
Winnowing algorithm, Hash-based algorithms→ MD5 and SHA

Abstract syntax tree (AST)-based methods

Capture the syntactic details of the programming constructs usage

Software metrics-based methods

Capture source-code specific information

Limitations:

Fingerprint methods:

High processing cost and storage requirements
Variable length fingerprints of almost same size as source code

AST-based and Software metrics-based methods:

Focus on {defective, unpredictable}
No work on the detection of unlikely to be defective source code

28

29 of 63

Proposed System architecture

Preparing SO posts database and code vectors

Extract code, text, metadata fields from SO posts
Source code similarity detection model:

Train PVA model using 188200+ GitHub source files

Reference Dataset:

PVA vector representation for code fragments present in SO posts

Determining defectiveness of a source code under review

Using the text sentiment and metadata fields

29

30 of 63

GitHub source files and SO posts considered

30

Language	Training corpus lines of code (measured by cloc)				Testing Corpus
Language	Files	Blank	Comment	SLOC	Test file pairs	Relevant corpus size	Models tested
C	32099	2908784	2490163	14908295	5000	30036	21
C#	8112	303416	198693	2342959	5000	7076	21
Java	142266	437851	659172	2157881	5000	127568	21
JavaScript	15737	177587	226724	1259902	5000	12485	21
Python	6012	300109	412452	1248494	5000	5378	20

Language	Question posts
Language	Max(μ)	Min(μ)	Avg(μ)	StdDev(μ)	count of posts
C	2880	-24	1.4619	18.9996	43306
C#	1463	-17	1.9487	10.0811	219821
Java	23665	-31	1.7041	45.9577	294629
JavaScript	2689	-31	1.2953	9.3526	245863
Python	1738	-17	2.4483	13.3160	100109
Language	Answer posts
Language	Max(μ)	Min(μ)	Avg(μ)	StdDev(μ)	count of posts
C	1803	-28	2.715	27.767	9788
C#	1197	-9	2.645	11.5883	60505
Java	30924	-8	2.9954	120.601	67450
JavaScript	988	-8	1.9928	12.0559	66096
Python	776	-6	3.2657	12.9174	25167

31 of 63

Experimental Summary

31

Experiment	Research question addressed	Major findings
Experiment #1	Highest Accuracy achieved by CRUSO-P? Is CRUSO-P inclined to any specific programming language?	99.2% for C programming language No → uninclined
Experiment #2	Performance comparison of CRUSO-P with CRUSO Response time Storage reduction Accuracy	Improvement: 97.8% 99.6% 5.6%

32 of 63

Experiment #1: Performance in detecting source code similarity

Accuracy

With no transformation > 98% → CRUSO-P is performs well for all prog. lang.
With transformation > 98% → CRUSO-P is resilient to source code obfuscations.

32

Language	Number of repositories	Avg. Number of Commits	Avg. Number of Contributors	Avg. Number of Releases	Avg. Number of Stars	Total Number of source files	Count of randomly selected source files
Language	Number of repositories	Avg. Number of Commits	Avg. Number of Contributors	Avg. Number of Releases	Avg. Number of Stars	Total Number of source files	Count of randomly selected source files
Java	30	18905.7	357.7	188	6697.83	142266	4000
Python	9	107621	494.6	270.8	8445.17	6012	4000
C	8	105013.8	536.8	266.4	9966.67	32099	4000
C#	10	19526.1	521.6	226.7	10125.67	6000	4000
JavaScript	7	15648.8	386.4	185.7	6588.7	15000	4000

Language	With no Transformation			With Function Relocation			With Variable Renaming
Language	Avg(𝛂)	ACC	F1	Avg(𝛂^tf1)	ACC	F1	Avg(𝛂^tf2)	ACC	F1
Java	0.99	0.991	0.991	0.950	0.995	0.880	0.818	0.994	0.827
Python	0.995	0.993	0.9798	0.990	0.994	0.846	0.907	0.995	0.862
C	0.994	0.994	0.992	0.956	0.997	0.922	0.923	0.996	0.886
C#	0.988	0.988	0.700	0.959	0.995	0.856	0.7745	0.992	0.785
JavaScript	0.984	0.993	0.818	0.951	0.994	0.852	0.755	0.993	0.797

33 of 63

Addtl. Exp: Do PVA input parameters affect the performance?

Performance

Improves with increase in vector size
Declines with increase in epochs count
Highest Accuracy: 99.4% for C programming language

<#epochs, #vectors> =<5,10>

33

34 of 63

Experiment #1: Performance Evaluation of CRUSO-P

Procedure:

Test CRUSO-P on a collection of randomly selected SO posts
Record the Accuracy and F1 score values

Observations:

Highest F1 score: 99.3% for Java programming language
Accuracy values above 85% for all the programming languages

Performs well for all the programming languages

34

Language	Thresholds		Performance		Number of posts	Avg. Time taken (in seconds)
Language	Avg(𝛂)	StdDev(𝛂)	ACC	F1 score	Number of posts	Avg. Time taken (in seconds)
C	0.963	0.0704	0.992	0.992	5000	402
C#	0.954	0.0979	0.8559	0.8365	5000	413
Java	0.97	0.0668	0.993	0.993	5000	389
JavaScript	0.967	0.0719	0.8766	0.8612	5000	451
Python	0.9617	0.0764	0.991	0.9909	5000	368

35 of 63

Experiment #2: Comparison of CRUSO-P with CRUSO

Procedure:

Test CRUSO-P on a collection of randomly selected source files
Record the Accuracy and F1 score values

Observations:

Avg. response time: 6.28 seconds

97.82% speed improvement over CRUSO

Vectors table of CRUSO-P occupies: 121.53 MBs

99.15% storage reduction over CRUSO

Highest Accuracy of CRUSO-P: 99.3%

5.6% improvement from CRUSO

35

Tool	Programming Language vs. Response time (in seconds)					Avg. Response time (in seconds)	Storage (in MBs)	Accuracy
Tool	C	C#	Java	JavaScript	Python	Avg. Response time (in seconds)	Storage (in MBs)	Accuracy
CRUSO-P	1.09	13.15	11.47	4.35	1.35	6.28	121.53	99.3%
CRUSO	284.74	291.09	289.15	281.81	292.8	287.92	14239	94%

36 of 63

Defect Estimator for Source Code -- DESCo

Automated solution for prediction of defects and their characteristics^[1]
Most Related Work: Abstract Syntax Tree (AST)-based methods^[2]

Token vectors capture the program construct occurrences

Structural and Contextual information is preserved

ML models are trained on token vectors

Deep Belief Networks
Support Vector Machines

Novelty in our work:

Detection of the characteristic of defects
PROCON metrics

Existing Gaps: Focus on mere occurrences of subset of programming constructs

36

[1]. Ritu Kapur and Balwinder Sodhi. A Defect Estimator for Source code: Linking defect reports with programming constructs usage metrics. ACM Transactions on Software Engineering and Methodology (TOSEM) 29.2 (2020): 1-35.

[2]. Wang, Song, Taiyue Liu, and Lin Tan. "Automatically learning semantic features for defect prediction." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.

37 of 63

Evaluation Metrics

F1 Score: Performance of a Model
Receiver Operating Characteristic (ROC) curve Area

Quality of output
Larger area under curve better quality output
Maximum: 1
Minimum: 0

37

38 of 63

Performance Evaluation: Estimating the defect properties

38

ML model

39 of 63

Performance Evaluation: Estimating the defect properties (Cont.)

39

40 of 63

Abstract Software categories derived from MavenCentral

Software Library
Software Utilities and Plugin
Software tool
Software metrics
Software driving engine
Software Framework
Software Middleware
Software Client
Software Server
Software Driver
Software File System
Others

40

41 of 63

Paragraph Vectors Algorithm

Unsupervised algorithm
Learns fixed-length feature representations

from variable-length pieces of texts
Such as sentences, paragraphs, and documents.

Each document is represented by a dense vector

Trained to predict words in the document

Document vectors are unique among documents

Word vectors are shared

41

[1]. Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.

42 of 63

Correlation metrics

Pearson Correlation^[1]

Degree of linear correlation

Kendall rank Correlation^[2]

Strength of dependence between two variables

Spearman rank Correlation^[3]

Degree of association between two variables

42

[1]. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1–4. Springer, 2009.

[2]. Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.

[3]. Wayne W Daniel. The spearman rank correlation coefficient. Biostatistics: A Foundation for Analysis in the Health Sciences, 1987.

43 of 63

Effect of SDEE metrics on the overall development effort

43

SLOC modifications (SLOC_m) performed on a software repository vs Effort expended in developing the software repository.
Skill factor of a developer vs Effort expended in developing the software repository.
Productivity of a developer vs Effort expended in developing the software repository.

44 of 63

ANTLR generated Abstract Syntax Tree (AST)

44

45 of 63

Results and Observations: Magnitude of Relative Errors

45

Back

46 of 63

SDEE tool

46

To obtain the effort estimate for developing a newly-envisioned software $z$, the SDEE tool performs the following steps:

Using the \emph{vectorizer} module and the best performing software description similarity model $\hat{M}$, it converts the software project description inputs into a PVA vector $v$.
Using the \emph{similar records fetcher} and the SDEE dataset, it fetches the top-K similar software matches to $v$ from the dataset. We use the cosine similarity\footnote{\url{http://bit.ly/2ODWoEy}} to measure the similarity between the PVA vectors.
Using the software development effort values for the most similar software match, it provides the effort estimate for developing $z$. The effort values of top-k functionally-similar software matches is displayed as the output.

\end{enumerate}

47 of 63

Comparison of Similarity detection methods

47

[1]. Trstenjak, Bruno, Sasa Mikac, and Dzenana Donko. "KNN with TF-IDF based framework for text categorization." Procedia Engineering 69 (2014): 1356-1364.

Evaluation criteria: Percentage of correct predictions made by model
Observation: Improvement of 49.91% over existing method^[1]

48 of 63

Relational Schema of the SDEE dataset

48

49 of 63

Software description similarity detection model

49

Measure	Value (in %)
Accuracy	99.3
Precision	99.24
Recall	98.62
F1 Score	99.3
ROC area	98.62

50 of 63

Establishing the Mapping

50

[1]. Github: http://github.com/, accessed October 18, 2017.

[2]. Apache Bugzilla: https://bz.apache.org/bugzilla/ , accessed October 18, 2017.

Mapping

Open Source Projects

in various

Version Control Systems (VCSs)

(e.g. Github ^[1])

Defect Reports

in various

Open Bug Repositories (OBRs)

(e.g. Bugzilla ^[2])

For projects having both Source Code Information at VCSs & Bug Information at OBRs

Preferences of choosing various Programming Constructs

Metadata information from Bug Reports

Does a particular programming style has any effect on the Quality of Software?

51 of 63

Architecture of DESCo System

51

Back

52 of 63

Problem Formulation: Selecting the best performing model

For a given defect estimation scenario 𝜓 and a given dataset D_ƛ:

52

D: Dataset

L: Set of considered programming languages

𝜆: A specific programming language, s.t. 𝜆 ∈ L.

E: Set of evaluation metrics

𝚫: ML model

A: Set of ML algorithms

P: Set of input parameter combinations of A

𝛼: An ML algorithm, s.t. 𝛼 ∈ A.

𝜋: A parameter combination, s.t. 𝜋 ∈ P.

53 of 63

Parameter Combinations of different ML techniques

53

54 of 63

Support Vector Machine (SVM)

A Classification technique -- Model to best split two sections

c
c
c
c

How is it the Best Split?

Distance between the pts. and the line is max.

Formal Definition:

Hyperplane that best splits the data

Because it is as far as possible from the support vectors
Or it maximizes the margin

Constraint Optimization Problem

54

55 of 63

Deep Belief Network

Definition:

Multiple layers of hidden units
Connections between the layers
No connections among the nodes in same layer

DBN can learn to probabilistically reconstruct the input
Example with prog. constructs^[1]

[1]. Wang, Song, Taiyue Liu, and Lin Tan. "Automatically learning semantic features for defect prediction." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.

55

. . .

f

. . .

Features h^l

Output

Layer l

Hidden

Layer l-1

. . . . . .

Hidden

Layer 1

Input

Layer

. . .

File1[ . . . if foo for bar . . . ]

File2[ . . . foo for if bar . . . ]

h^l-1

h¹

x

56 of 63

Random Forest

Definition:

Uses: Classification or Regression
Divides the features into subsets
Training: Constructs a multitude of decision trees
Testing: Outputs the class that is the mode of the result classes

56

57 of 63

Decision Tree Learning Algorithm

Decision Tree uses a tree like graph or model of decisions

Each internal node represents a decision or a “test” on an attribute
Each branch represents an output of the test
Each leaf node represents a class label
Path from root to leaves represent classification rules

Algorithm

Start: All the training examples are at the root
Partitioning: Training examples are partitioned based on selected attributes
Selecting Test attributes: On the basis of some heuristic or statistical measure (e.g. Information Gain)

Example->

57

age?

<= 30

31…40

> 40

student?

yes

credit rating?

no

yes

excellent

fair

yes

no

58 of 63

Example: Decision Tree Learning Algorithm

Classes

Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info(D) = I(9,5) = -(9/14)log₂(9/14) - (5/14)log₂(5/14) = 0.940

Finding Information Gain of attributes

Gain(age) = Info(D) - Info_age(D)
Now, Info_age(D) = (5/14)I(2,3) + (4/14)I(4,0) + (5/14)I(3,2) = 0.694
So, Gain(age) = 0.940 - 0.694 = 0.246
Gain(income) = 0.029, Gain(student) = 0.151 & Gain(credit rating)= 0.048
So, the attribute selected for split-> Age

Similarly at each step we keep splitting in such manner

With age <=30, we next had to decide among {Income, Student, Credit_Rating}

Student node achieved maximum gain

58

59 of 63

Defining the PROCON metrics

59

Software Metric	Description
maxXCount	maximum number of times a construct X is used in source code
minXCount	minimum number of times a construct X is used in source code
avgXCount	average number of times a construct X is used in source code
stdDevXCount	standard deviation of number of times a construct X is used in source code
maxXDepth	maximum depth at which a construct X is used in the AST of source code
minXDepth	minimum depth at which a construct X is used in the AST of source code
avgXDepth	average depth at which a construct X is used in the AST of source code
stdDevXDepth	standard deviation of depth at which a construct X is used in the AST of source code
maxXLength	maximum lexical length of the body of construct X used in a source code
minXLength	minimum lexical length of the body of construct X used in a source code
avgXLength	average lexical length of the body of construct X used in a source code
stdDevXLength	standard deviation of lexical length of the body of construct X used in a source code

Back

60 of 63

Establishing the mapping

60

61 of 63

PROCON dataset builder

61

62 of 63

Parameter Combinations of different ML techniques

62

63 of 63

Relational schema of the SOPostsDB

63