1 of 29

TRACEABILITY TRANSFORMED: GENERATING MORE ACCURATE LINKS WITH PRE-TRAINED BERT MODELS�

Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, Jane Cleland-Huang

2 of 29

TRACES, WHAT ARE THEY?

Links between software artifacts
Software artifacts

Code
Test cases
Requirements
Design definitions

As a project grows, the cost of creating and maintaining trace links is expensive.

3 of 29

ARTIFACTS

NATURAL LANGUAGE ARTIFACTS

Feature requests
Requirements
Design definitions

PROGRAMMING LANGUAGE ARTIFACTS

Code files/snippets
Function definitions
Code change sets

4 of 29

PAST WORK

Techniques that rely on word matching
Some deep learning, but only trained on small datasets

Why aren’t the previous models doing that well?

sparsity of training data
applying multi-layer nn in a large industrial project as training and utilizing deep learning nn is significantly slower than traditional information retrieval techniques.

5 of 29

THEIR SOLUTION/MODEL

T-BERT

Based on Google’s BERT (Bidirectional Encoder Representations from Transformers)

Used to generate trace links between natural language artifacts (NLA) and programming language artifacts (PLA)

6 of 29

TRAINING (3 PHASES)

Pre - training

Intermediate training

Finetuning

7 of 29

PRE-TRAINING

Leverage Microsoft’s pre-trained BERT on code.
‘Replaced Token Detection’

8 of 29

INTERMEDIATE TRAINING

The intermediate task they trained their model on is based on CodeSearchNet
Binary classification task
Used to determine is does this function description describe this python code snippet
Much bigger dataset
Total 3 different architectures tried here.

9 of 29

THREE VARIANTS (INTERMEDIATE TRAINING)

Single

Twin

Siamese

10 of 29

TWIN

11 of 29

SIAMESE

12 of 29

SINGLE

13 of 29

FINETUNING

Finally training is done on predicting trace links between issue’s discussion and commits
For each type of artifact, we concatenated the text to formulate input sequences for the T-BERT model (i.e., issue summary + issue description and then commit message + code change set.)

14 of 29

FINAL PIPELINE

The T-BERT model was implemented with PyTorch V.1.6.0 and HuggingFace Transformer library V.2.8.0.

15 of 29

DATASETS

CodeSearchNet
Their own novel dataset

16 of 29

CodeSearchNet

functions and their associated doc-strings for six different programming languages
Only python for this paper

17 of 29

THEIR OWN NOVEL DATASET

three open-sourced projects in Github (Pgcli, Flask and Keras)
retrieved issues and their discussions as the source artifacts (NLA) and commits as target artifacts (PLA).
short issue summary and the longer issue description
golden link set’ from the commit messages by using issue tags embedded into commit messages added by committers.
dataset is split into ten folds, of which eight were used for training, one for development, and one for testing.

18 of 29

ONLINE NEGATIVE SAMPLING (ONS)

An alternative to dynamic random negative sampling (DRNS)
Guo et al., observed a glass ceiling in terms of achieved trace link performance
They hypothesis that at earlier stages of training, the model can effectively learn rules for distinguishing positive and negative examples
DRNS does well in providing those types of naïve examples
They want to offer the model higher quality negative samples
Generate negative examples at the batch level vs. at beginning of epoch
Inspired from Face Recognition domain

19 of 29

TRAINING THE MODEL

8 epoch in intermediate training
400 epoch in fine-tuning
For each task, batch size of 8 and a batch accumulation step of 8.
Initial learning rate as 1E-05 and they applied a linear scheduler to control the learning rate at run time

20 of 29

Mean average precision (map@3)

Mean reciprocal rank (MRR)

Precision@k

F-Score

The F-1 score assigns equal weights to precision and recall, while the F-2 score favors recall over precision.

MRR and Precision@K ignore recall and focus on evaluating whether the search result can find interesting results for a user

A trace model with high Precision@K means users are more likely to find at least one related target artifact in the top K results

MRR focuses on the first effective results for a query, while ignoring the overall ranking.

evaluates the ranking of relevant artifacts over retrieved ones.

Quick word on the metrics

21 of 29

EVALUATION ON CODESEARCHNET

22 of 29

EVALUATION ON TRACEABILITY

23 of 29

RESEARCH QUESTIONS

RQ1: Given three variants of T-BERT models, based on single, twin, and siamese BERT relation classifiers, which is the best architecture for addressing NLA-PLA traceability with respect to both accuracy and efficiency?
RQ2: Which training technique improves accuracy without suffering from the previously observed glass ceiling ?
RQ3: Can T-BERT transfer knowledge from a resource-rich retrieval task to enhance the accuracy and performance of the downstream NLA-PLA tracing challenge?

24 of 29

RQ1: TWIN, SINGLE OR SIAMESE?�

Single seems to do better, but

25 of 29

RQ2: TRAINING TECHNIQUES TO BREAK THE GLASS CEILING? (ONS)�

ONS seems to be the winning strategy here

26 of 29

27 of 29

RQ3: transfer knowledge?�

Transfer learning does lead to better results

28 of 29

EVALUATION ON TRACEABILITY

29 of 29

LIMITATIONS

Only in python
Only 3 OSS projects
More hyperparameter tuning needed (lack of hardware resources)