1 of 29

TRACEABILITY TRANSFORMED: GENERATING MORE ACCURATE LINKS WITH PRE-TRAINED BERT MODELS

Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, Jane Cleland-Huang

2 of 29

TRACES, WHAT ARE THEY?

  • Links between software artifacts
  • Software artifacts
    • Code
    • Test cases
    • Requirements
    • Design definitions

  • As a project grows, the cost of creating and maintaining trace links is expensive.

3 of 29

ARTIFACTS

NATURAL LANGUAGE ARTIFACTS

  • Feature requests
  • Requirements
  • Design definitions

PROGRAMMING LANGUAGE ARTIFACTS

  • Code files/snippets
  • Function definitions
  • Code change sets

4 of 29

PAST WORK

  • Techniques that rely on word matching
  • Some deep learning, but only trained on small datasets

  • Why aren’t the previous models doing that well?
    • sparsity of training data
    • applying multi-layer nn in a large industrial project as training and utilizing deep learning nn is significantly slower than traditional information retrieval techniques.

5 of 29

THEIR SOLUTION/MODEL

T-BERT

Based on Google’s BERT (Bidirectional Encoder Representations from Transformers)

Used to generate trace links between natural language artifacts (NLA) and programming language artifacts (PLA)

6 of 29

TRAINING (3 PHASES)

Pre - training

Intermediate training

Finetuning

7 of 29

PRE-TRAINING

  • Leverage Microsoft’s pre-trained BERT on code.
  • ‘Replaced Token Detection’

8 of 29

INTERMEDIATE TRAINING

  • The intermediate task they trained their model on is based on CodeSearchNet
  • Binary classification task
  • Used to determine is does this function description describe this python code snippet
  • Much bigger dataset
  • Total 3 different architectures tried here.

9 of 29

THREE VARIANTS (INTERMEDIATE TRAINING)

Single

Twin

Siamese

10 of 29

TWIN

11 of 29

SIAMESE

12 of 29

SINGLE

13 of 29

FINETUNING

  • Finally training is done on predicting trace links between issue’s discussion and commits
  • For each type of artifact, we concatenated the text to formulate input sequences for the T-BERT model (i.e., issue summary + issue description and then commit message + code change set.)

14 of 29

FINAL PIPELINE

The T-BERT model was implemented with PyTorch V.1.6.0 and HuggingFace Transformer library V.2.8.0.

15 of 29

DATASETS

  • CodeSearchNet
  • Their own novel dataset

16 of 29

CodeSearchNet

  • functions and their associated doc-strings for six different programming languages
  • Only python for this paper

17 of 29

THEIR OWN NOVEL DATASET

  • three open-sourced projects in Github (Pgcli, Flask and Keras)
  • retrieved issues and their discussions as the source artifacts (NLA) and commits as target artifacts (PLA).
  • short issue summary and the longer issue description
  • golden link set’ from the commit messages by using issue tags embedded into commit messages added by committers.
  • dataset is split into ten folds, of which eight were used for training, one for development, and one for testing.

18 of 29

ONLINE NEGATIVE SAMPLING (ONS)

  • An alternative to dynamic random negative sampling (DRNS)
  • Guo et al., observed a glass ceiling in terms of achieved trace link performance
  • They hypothesis that at earlier stages of training, the model can effectively learn rules for distinguishing positive and negative examples
  • DRNS does well in providing those types of naïve examples
  • They want to offer the model higher quality negative samples
  • Generate negative examples at the batch level vs. at beginning of epoch
  • Inspired from Face Recognition domain

19 of 29

TRAINING THE MODEL

  • 8 epoch in intermediate training
  • 400 epoch in fine-tuning
  • For each task, batch size of 8 and a batch accumulation step of 8.
  • Initial learning rate as 1E-05 and they applied a linear scheduler to control the learning rate at run time

20 of 29

Mean average precision (map@3)

Mean reciprocal rank (MRR)

Precision@k

F-Score

The F-1 score assigns equal weights to precision and recall, while the F-2 score favors recall over precision.

MRR and Precision@K ignore recall and focus on evaluating whether the search result can find interesting results for a user

A trace model with high Precision@K means users are more likely to find at least one related target artifact in the top K results

MRR focuses on the first effective results for a query, while ignoring the overall ranking.

evaluates the ranking of relevant artifacts over retrieved ones.

Quick word on the metrics

21 of 29

EVALUATION ON CODESEARCHNET

22 of 29

EVALUATION ON TRACEABILITY

23 of 29

RESEARCH QUESTIONS

  • RQ1: Given three variants of T-BERT models, based on single, twin, and siamese BERT relation classifiers, which is the best architecture for addressing NLA-PLA traceability with respect to both accuracy and efficiency?
  • RQ2: Which training technique improves accuracy without suffering from the previously observed glass ceiling ?
  • RQ3: Can T-BERT transfer knowledge from a resource-rich retrieval task to enhance the accuracy and performance of the downstream NLA-PLA tracing challenge?

24 of 29

RQ1: TWIN, SINGLE OR SIAMESE?�

Single seems to do better, but

25 of 29

RQ2: TRAINING TECHNIQUES TO BREAK THE GLASS CEILING? (ONS)�

  • ONS seems to be the winning strategy here

26 of 29

27 of 29

RQ3: transfer knowledge?�

  • Transfer learning does lead to better results

28 of 29

EVALUATION ON TRACEABILITY

29 of 29

LIMITATIONS

  • Only in python
  • Only 3 OSS projects
  • More hyperparameter tuning needed (lack of hardware resources)