1 of 63

1

Estimating Source Code Defectiveness: Linking defect reports with programming constructs usage metrics.

Prof. Balwinder Sodhi

sodhi@iitrpr.ac.in

Applied and Contemporary Software Engineering (ACSE) Lab

Indian Institute of Technology (IIT) Ropar

Published in ACM Transactions on Software Engineering and Methodology (TOSEM) 2020.

2 of 63

Problem Scenario

  • Software Development requires significant involvement of experts
    • 40% of the Software Life Cost is related to Maintenance[3]
  • Dependency on Experts
    • Experienced developers are required to train the newcomers[1]
    • Cost of a skilled software developer is high
    • Shortcoming: Success of a software project gets restricted
      • Skill set and Experience of team members[2]

[1]. Malheiros, Yuri, et al. "A source code recommender system to support newcomers." Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual. IEEE, 2012.

[2]. Leonard‐Barton, Dorothy. "Core capabilities and core rigidities: A paradox in managing new product development." Strategic management journal 13.S1 (1992): 111-125.

[3]. Dehaghani, Sayed Mehdi Hejazi, and Nafiseh Hajrahimi. "Which factors affect software projects maintenance cost more?." Acta Informatica Medica 21.1 (2013): 63.

2

3 of 63

Defectiveness - Defining the concept

  • Includes two significant things
    • Likelihood of finding any defects

in source code

    • Properties of the potential defects
      • Severity
      • Type
  • Focus: Non-Functional requirement aspects
    • such as quality, reliability, performance, etc.

3

Poor or Inefficient coding style

Better or efficient coding style

4 of 63

Defects related to Non-functional aspects: Examples

  • Performance: Batik inside of Eclipse using bridge memory heap causes crash[1]
  • Security: XML vulnerabilities in Python[2]
  • Accessibility: Unlocking a file owned by another user doesn't work[3]
  • Programming Language Compatibility: build.xml isn't forward compatible with Java6[4]
  • Maintainability: FAQ references third-party libraries that have been refactored or renamed[5]

4

5 of 63

Hypothesis

  • Programming constructs used while coding affect software quality

  • Programming Construct (PROCON) Metrics → a novel Source Code Representation

5

6 of 63

PROCON Metrics

  • Lexical properties associated with the usage of programming constructs
    • Count
    • Depth
    • Length

6

Programming

Construct

Lexical

Property

Program Fragments

Program 1

Program 2

max

min

avg

stdDev

if

Count

2

1

2

0

1.5

0.71

Depth

9, 11

9

11

0

9.66

1.155

Length

79, 118

65

118

0

87.3

27.465

Program Fragment 1

Program Fragment 2

7 of 63

ANTLR generated Abstract Syntax Tree (AST)

7

8 of 63

Defect Estimation Scenarios

  • Phase 1: Is the input source file likely-to-be-defective or not?
  • Phase 2: What are the possible defect characteristics?
    • specific priority or severity
      • viz., critical, high, medium, low, etc.
    • specific type
      • viz., enhancement, bug-fix, etc.
    • specific operating system (OS)
    • specific hardware
    • specific level of user-exposure
      • via the number of comments

8

9 of 63

Broad Idea of our approach

  • Major software artifacts
    • PROCON dataset
    • DESCo tool

9

10 of 63

Details of source files present in various datasets

10

Language

Total file count

Total bug-linked

file count

Files without defect information

Total files for training the model

C

11718

3224

3468

6448 (=3224*2)

C++

7202

174

6827

348 (=174*2)

Java

7500

318

7076

636 (=318*2)

Python

4023

1440

2048

2750 (=1375*2)

11 of 63

Details of Defect reports

11

12 of 63

Performance Evaluation: Detecting the defectiveness

  • Best results:
    • Python dataset trained using Supervised Deep Belief Network Classifier
    • MeanAcc value of 95.96% (with 𝛔 = 0.0022)
  • Suitable <ML technique -- parameter> combinations: SVM with radial kernel, LSVM, Logistic Regression, Gaussian, and Supervised DBN Classification models
  • Not advisable: SVM classifier and Nu-SVM classifier both with a poly kernel, and MLP Classifier

12

ML model

13 of 63

Key Inference from Defect Estimation experiments

  • Inference: Different ML algorithms emerge as best-performing for predicting different types of defect characteristics.
    • For predicting the Medium exposure defects:
      • ML models trained using LSVM on C++ source files
      • MeanAcc value of 77.7% (with 𝛔 = 0.1435)
    • For predicting the defects annotated with a specific OS type:
      • SVM classifier trained on Java dataset
      • MeanAcc value of 68.8% (with 𝛔 = 0.115)

13

14 of 63

Comparison with the State-of-the-art methods: at Approach level

  • DESCo outperforms the state-of-the-art approach
    • 50/54 parameter combinations of ML models
  • Highest MeanAcc value: 80.8% (𝛔 = 0.047) with SVM classifier trained on Java dataset.
  • Improvement: 44.9% over the state-of-the-art approach (point ‘AB’ in plot).

14

15 of 63

Comparison with the State-of-the-art methods: at Dataset level

  • Observations
    • DESCo with PROCON: Highest MeanAcc of 84.47%
      • RF classifier with 100 estimators
    • Supervised DBN Classifier with PROMISE: Highest MeanAcc of 65.2%
      • Improvement of 19.46%
    • Supervised DBN Classifier with Semantic dataset: Highest MeanAcc of 70.9%
      • Improvement of 29.9%

15

16 of 63

Tool: Defect Estimator for Source Code (DESCo)

16

17 of 63

Conclusion

  • Propose PROCON metrics
    • To capture the programming construct usage information
      • Present in various software artifacts
      • Such as VCS repositories and defect reporting websites.
  • Developed Knowledge warehouses
    • By extracting the relevant information from various online sources
      • Specific to the considered software development tasks
    • Comprises:
      • ML models
      • PROCON Dataset
      • Software used for manipulating them
  • Expert system for defect prediction - DESCo
    • DESCo with PROCON: Highest MeanAcc of 84.47%
    • Outperforms an existing defect prediction method
    • Defect Prediction performance improves with PROCON dataset

17

18 of 63

Publications out of this work

  1. Ritu Kapur and Balwinder Sodhi. "Towards a Knowledge Warehouse and Expert System for the Automation of SDLC tasks." 2019 IEEE/ACM International Conference on Software and System Processes (ICSSP). IEEE, 2019.
  2. Ritu Kapur and Balwinder Sodhi. A Defect Estimator for Source code: Linking defect reports with programming constructs usage metrics. ACM Transactions on Software Engineering and Methodology (TOSEM) 29.2 (2020): 1-35. (ESEC/FSE’21 Journal-First)

18

19 of 63

19

Thank You!

Questions?

@DrRituKapur

dev.ritu.kapur@gmail.com

dr-ritu-kapur-36174454

https://sites.google.com/view/ritu-kapur

20 of 63

Awards and Achievements

  • Awards
    • Received Microsoft Research Travel Grant Award for presenting my work at ICSE/ICSSP 2019.
    • Received ACM SIGSOFT CAPS Award for partially supporting the registration at ICSE/ICSSP 2019.
  • Internship
    • Summer Internship program in Institute of Software Engineering at University of Stuttgart, Germany.

20

21 of 63

Problem #2:

Improving the performance of CRUSO

21

22 of 63

Problem #4: Improving the performance of CRUSO

  • Proposed Solution: CRUSO-P - a PVA-based alternative
    • CRUSO[1] - a code review assisting tool
      • Detects relevant StackOverflow (SO) posts using Winnowing
      • Estimates the defectiveness by using the content of SO posts
    • Central Idea: Replacing Winnowing algorithm with PVA
    • Cause of inefficiency: variable-length fingerprints → large number of comparisons
      • Considerable amount of memory usage and response time
    • Programming Languages considered:
      • C, C#, Java, JavaScript, and Python

22

[1]. Sharma, Shipra, and Balwinder Sodhi. "Using Stack Overflow content to assist in code review." Software: Practice and Experience 49.8 (2019): 1255-1277.

23 of 63

Tool Input Interface

23

24 of 63

Tool Output Interface

24

25 of 63

CRUSO-P detects semantically similar code fragments

25

26 of 63

CRUSO-P detects semantically similar code fragments (cont..)

26

27 of 63

Basic tenets behind our system

  • Professional programmer support forums such as StackOverflow, contain useful information to problems faced by programmers when developing software
    • Code fragments
    • Associated Q/A discussions
  • If: A given source code under review c, is found to be sufficiently similar to the code fragments cp present in SO posts
    • Then: We can infer the defectiveness of c by analysing the Q/A discussions and metadata of posts p.

27

28 of 63

Existing methods for representing source code

  • Methods:
    • Fingerprint-based methods
      • Fingerprint: compact collection of integers
      • Winnowing algorithm, Hash-based algorithms→ MD5 and SHA
    • Abstract syntax tree (AST)-based methods
      • Capture the syntactic details of the programming constructs usage
    • Software metrics-based methods
      • Capture source-code specific information
  • Limitations:
    • Fingerprint methods:
      • High processing cost and storage requirements
      • Variable length fingerprints of almost same size as source code
    • AST-based and Software metrics-based methods:
      • Focus on {defective, unpredictable}
      • No work on the detection of unlikely to be defective source code

28

29 of 63

Proposed System architecture

  • Preparing SO posts database and code vectors
    • Extract code, text, metadata fields from SO posts
    • Source code similarity detection model:
      • Train PVA model using 188200+ GitHub source files
    • Reference Dataset:
      • PVA vector representation for code fragments present in SO posts
  • Determining defectiveness of a source code under review
    • Using the text sentiment and metadata fields

29

30 of 63

GitHub source files and SO posts considered

30

Language

Training corpus lines of code (measured by cloc)

Testing Corpus

Files

Blank

Comment

SLOC

Test file pairs

Relevant corpus size

Models tested

C

32099

2908784

2490163

14908295

5000

30036

21

C#

8112

303416

198693

2342959

5000

7076

21

Java

142266

437851

659172

2157881

5000

127568

21

JavaScript

15737

177587

226724

1259902

5000

12485

21

Python

6012

300109

412452

1248494

5000

5378

20

Language

Question posts

Max(μ)

Min(μ)

Avg(μ)

StdDev(μ)

count of posts

C

2880

-24

1.4619

18.9996

43306

C#

1463

-17

1.9487

10.0811

219821

Java

23665

-31

1.7041

45.9577

294629

JavaScript

2689

-31

1.2953

9.3526

245863

Python

1738

-17

2.4483

13.3160

100109

Language

Answer posts

Max(μ)

Min(μ)

Avg(μ)

StdDev(μ)

count of posts

C

1803

-28

2.715

27.767

9788

C#

1197

-9

2.645

11.5883

60505

Java

30924

-8

2.9954

120.601

67450

JavaScript

988

-8

1.9928

12.0559

66096

Python

776

-6

3.2657

12.9174

25167

31 of 63

Experimental Summary

31

Experiment

Research question addressed

Major findings

Experiment #1

  • Highest Accuracy achieved by CRUSO-P?
  • Is CRUSO-P inclined to any specific programming language?
  • 99.2% for C programming language
  • No

→ uninclined

Experiment #2

Performance comparison of CRUSO-P with CRUSO

  • Response time
  • Storage reduction
  • Accuracy

Improvement:

  • 97.8%
  • 99.6%
  • 5.6%

32 of 63

Experiment #1: Performance in detecting source code similarity

  • Accuracy
    • With no transformation > 98% → CRUSO-P is performs well for all prog. lang.
    • With transformation > 98% → CRUSO-P is resilient to source code obfuscations.

32

Language

Number of repositories

Avg. Number of Commits

Avg. Number of Contributors

Avg. Number of Releases

Avg. Number of Stars

Total Number of source files

Count of randomly selected source files

Java

30

18905.7

357.7

188

6697.83

142266

4000

Python

9

107621

494.6

270.8

8445.17

6012

4000

C

8

105013.8

536.8

266.4

9966.67

32099

4000

C#

10

19526.1

521.6

226.7

10125.67

6000

4000

JavaScript

7

15648.8

386.4

185.7

6588.7

15000

4000

Language

With no Transformation

With Function Relocation

With Variable Renaming

Avg(𝛂)

ACC

F1

Avg(𝛂tf1)

ACC

F1

Avg(𝛂tf2)

ACC

F1

Java

0.99

0.991

0.991

0.950

0.995

0.880

0.818

0.994

0.827

Python

0.995

0.993

0.9798

0.990

0.994

0.846

0.907

0.995

0.862

C

0.994

0.994

0.992

0.956

0.997

0.922

0.923

0.996

0.886

C#

0.988

0.988

0.700

0.959

0.995

0.856

0.7745

0.992

0.785

JavaScript

0.984

0.993

0.818

0.951

0.994

0.852

0.755

0.993

0.797

33 of 63

Addtl. Exp: Do PVA input parameters affect the performance?

  • Performance
    • Improves with increase in vector size
    • Declines with increase in epochs count
    • Highest Accuracy: 99.4% for C programming language
      • <#epochs, #vectors> =<5,10>

33

34 of 63

Experiment #1: Performance Evaluation of CRUSO-P

  • Procedure:
    • Test CRUSO-P on a collection of randomly selected SO posts
    • Record the Accuracy and F1 score values
  • Observations:
    • Highest F1 score: 99.3% for Java programming language
    • Accuracy values above 85% for all the programming languages
      • Performs well for all the programming languages

34

Language

Thresholds

Performance

Number of posts

Avg. Time taken

(in seconds)

Avg(𝛂)

StdDev(𝛂)

ACC

F1 score

C

0.963

0.0704

0.992

0.992

5000

402

C#

0.954

0.0979

0.8559

0.8365

5000

413

Java

0.97

0.0668

0.993

0.993

5000

389

JavaScript

0.967

0.0719

0.8766

0.8612

5000

451

Python

0.9617

0.0764

0.991

0.9909

5000

368

35 of 63

Experiment #2: Comparison of CRUSO-P with CRUSO

  • Procedure:
    • Test CRUSO-P on a collection of randomly selected source files
    • Record the Accuracy and F1 score values
  • Observations:
    • Avg. response time: 6.28 seconds
      • 97.82% speed improvement over CRUSO
    • Vectors table of CRUSO-P occupies: 121.53 MBs
      • 99.15% storage reduction over CRUSO
    • Highest Accuracy of CRUSO-P: 99.3%
      • 5.6% improvement from CRUSO

35

Tool

Programming Language vs. Response time

(in seconds)

Avg. Response time

(in seconds)

Storage

(in MBs)

Accuracy

C

C#

Java

JavaScript

Python

CRUSO-P

1.09

13.15

11.47

4.35

1.35

6.28

121.53

99.3%

CRUSO

284.74

291.09

289.15

281.81

292.8

287.92

14239

94%

36 of 63

Defect Estimator for Source Code -- DESCo

  • Automated solution for prediction of defects and their characteristics[1]
  • Most Related Work: Abstract Syntax Tree (AST)-based methods[2]
    • Token vectors capture the program construct occurrences
      • Structural and Contextual information is preserved
    • ML models are trained on token vectors
  • Novelty in our work:
    • Detection of the characteristic of defects
    • PROCON metrics
  • Existing Gaps: Focus on mere occurrences of subset of programming constructs

36

[1]. Ritu Kapur and Balwinder Sodhi. A Defect Estimator for Source code: Linking defect reports with programming constructs usage metrics. ACM Transactions on Software Engineering and Methodology (TOSEM) 29.2 (2020): 1-35.

[2]. Wang, Song, Taiyue Liu, and Lin Tan. "Automatically learning semantic features for defect prediction." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.

37 of 63

Evaluation Metrics

  • F1 Score: Performance of a Model
  • Receiver Operating Characteristic (ROC) curve Area
    • Quality of output
    • Larger area under curve better quality output
    • Maximum: 1
    • Minimum: 0

37

38 of 63

Performance Evaluation: Estimating the defect properties

38

ML model

39 of 63

Performance Evaluation: Estimating the defect properties (Cont.)

39

40 of 63

Abstract Software categories derived from MavenCentral

  • Software Library
  • Software Utilities and Plugin
  • Software tool
  • Software metrics
  • Software driving engine
  • Software Framework
  • Software Middleware
  • Software Client
  • Software Server
  • Software Driver
  • Software File System
  • Others

40

41 of 63

Paragraph Vectors Algorithm

  • Unsupervised algorithm
  • Learns fixed-length feature representations
    • from variable-length pieces of texts
    • Such as sentences, paragraphs, and documents.
  • Each document is represented by a dense vector
    • Trained to predict words in the document
  • Document vectors are unique among documents
    • Word vectors are shared

41

[1]. Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.

42 of 63

Correlation metrics

  • Pearson Correlation[1]
    • Degree of linear correlation
  • Kendall rank Correlation[2]
    • Strength of dependence between two variables
  • Spearman rank Correlation[3]
    • Degree of association between two variables

42

[1]. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1–4. Springer, 2009.

[2]. Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.

[3]. Wayne W Daniel. The spearman rank correlation coefficient. Biostatistics: A Foundation for Analysis in the Health Sciences, 1987.

43 of 63

Effect of SDEE metrics on the overall development effort

43

  • SLOC modifications (SLOCm) performed on a software repository vs Effort expended in developing the software repository.
  • Skill factor of a developer vs Effort expended in developing the software repository.
  • Productivity of a developer vs Effort expended in developing the software repository.

44 of 63

ANTLR generated Abstract Syntax Tree (AST)

44

45 of 63

Results and Observations: Magnitude of Relative Errors

45

46 of 63

SDEE tool

46

47 of 63

Comparison of Similarity detection methods

47

[1]. Trstenjak, Bruno, Sasa Mikac, and Dzenana Donko. "KNN with TF-IDF based framework for text categorization." Procedia Engineering 69 (2014): 1356-1364.

  • Evaluation criteria: Percentage of correct predictions made by model
  • Observation: Improvement of 49.91% over existing method[1]

48 of 63

Relational Schema of the SDEE dataset

48

49 of 63

Software description similarity detection model

49

Measure

Value (in %)

Accuracy

99.3

Precision

99.24

Recall

98.62

F1 Score

99.3

ROC area

98.62

50 of 63

Establishing the Mapping

50

[1]. Github: http://github.com/, accessed October 18, 2017.

[2]. Apache Bugzilla: https://bz.apache.org/bugzilla/ , accessed October 18, 2017.

Mapping

Open Source Projects

in various

Version Control Systems (VCSs)

(e.g. Github [1])

Defect Reports

in various

Open Bug Repositories (OBRs)

(e.g. Bugzilla [2])

For projects having both Source Code Information at VCSs & Bug Information at OBRs

Preferences of choosing various Programming Constructs

Metadata information from Bug Reports

Does a particular programming style has any effect on the Quality of Software?

51 of 63

Architecture of DESCo System

51

52 of 63

Problem Formulation: Selecting the best performing model

  • For a given defect estimation scenario 𝜓 and a given dataset Dƛ:

52

D: Dataset

L: Set of considered programming languages

𝜆: A specific programming language, s.t. 𝜆 ∈ L.

E: Set of evaluation metrics

𝚫: ML model

A: Set of ML algorithms

P: Set of input parameter combinations of A

𝛼: An ML algorithm, s.t. 𝛼 ∈ A.

𝜋: A parameter combination, s.t. 𝜋 ∈ P.

53 of 63

Parameter Combinations of different ML techniques

53

54 of 63

Support Vector Machine (SVM)

  • A Classification technique -- Model to best split two sections

  • c
  • c
  • c
  • c

  • How is it the Best Split?
    • Distance between the pts. and the line is max.
  • Formal Definition:
    • Hyperplane that best splits the data
      • Because it is as far as possible from the support vectors
      • Or it maximizes the margin
  • Constraint Optimization Problem

54

55 of 63

Deep Belief Network

  • Definition:
    • Multiple layers of hidden units
    • Connections between the layers
    • No connections among the nodes in same layer
  • DBN can learn to probabilistically reconstruct the input
  • Example with prog. constructs[1]

[1]. Wang, Song, Taiyue Liu, and Lin Tan. "Automatically learning semantic features for defect prediction." Proceedings of the 38th International Conference on Software Engineering. ACM, 2016.

55

. . .

f

. . .

Features hl

Output

Layer l

Hidden

Layer l-1

. . . . . .

. . . . . .

Hidden

Layer 1

Input

Layer

. . .

File1[ . . . if foo for bar . . . ]

File2[ . . . foo for if bar . . . ]

hl-1

h1

x

56 of 63

Random Forest

  • Definition:
    • Uses: Classification or Regression
    • Divides the features into subsets
    • Training: Constructs a multitude of decision trees
    • Testing: Outputs the class that is the mode of the result classes

56

57 of 63

Decision Tree Learning Algorithm

  • Decision Tree uses a tree like graph or model of decisions
    • Each internal node represents a decision or a “test” on an attribute
    • Each branch represents an output of the test
    • Each leaf node represents a class label
    • Path from root to leaves represent classification rules
  • Algorithm
    • Start: All the training examples are at the root
    • Partitioning: Training examples are partitioned based on selected attributes
    • Selecting Test attributes: On the basis of some heuristic or statistical measure (e.g. Information Gain)
  • Example->

57

age?

<= 30

31…40

> 40

student?

yes

credit rating?

no

no

yes

yes

excellent

fair

yes

no

58 of 63

Example: Decision Tree Learning Algorithm

  • Classes
    • Class P: buys_computer = “yes”
    • Class N: buys_computer = “no”
    • Info(D) = I(9,5) = -(9/14)log2(9/14) - (5/14)log2(5/14) = 0.940
  • Finding Information Gain of attributes
    • Gain(age) = Info(D) - Infoage(D)
    • Now, Infoage(D) = (5/14)I(2,3) + (4/14)I(4,0) + (5/14)I(3,2) = 0.694
    • So, Gain(age) = 0.940 - 0.694 = 0.246
    • Gain(income) = 0.029, Gain(student) = 0.151 & Gain(credit rating)= 0.048
    • So, the attribute selected for split-> Age
  • Similarly at each step we keep splitting in such manner
    • With age <=30, we next had to decide among {Income, Student, Credit_Rating}
      • Student node achieved maximum gain

58

59 of 63

Defining the PROCON metrics

59

Software Metric

Description

maxXCount

maximum number of times a construct X is used in source code

minXCount

minimum number of times a construct X is used in source code

avgXCount

average number of times a construct X is used in source code

stdDevXCount

standard deviation of number of times a construct X is used in source code

maxXDepth

maximum depth at which a construct X is used in the AST of source code

minXDepth

minimum depth at which a construct X is used in the AST of source code

avgXDepth

average depth at which a construct X is used in the AST of source code

stdDevXDepth

standard deviation of depth at which a construct X is used in the AST of source code

maxXLength

maximum lexical length of the body of construct X used in a source code

minXLength

minimum lexical length of the body of construct X used in a source code

avgXLength

average lexical length of the body of construct X used in a source code

stdDevXLength

standard deviation of lexical length of the body of construct X used in a source code

Back

60 of 63

Establishing the mapping

60

61 of 63

PROCON dataset builder

61

62 of 63

Parameter Combinations of different ML techniques

62

63 of 63

Relational schema of the SOPostsDB

63