1 of 8

EXPLORING USEFULNESS OF C COMMENTS USING SILVER STANDARD MODELS

Aritra Mitra

Computer Science and Engineering, IIT Kharagpur

1

2 of 8

TABLE OF CONTENTS

3 Introduction

4 Methods

5 Research Results

7 Conclusion and Discussions

2

3 of 8

INTRODUCTION

  • Classify comments as ‘useful’ or ‘not useful’.
  • Train models using provided dataset.
  • Augment using LLM
    • GPT-4o-mini used
  • Train models using augmented dataset and compare results.

3

4 of 8

METHODS

4

SilverCodeBERT

    • Uses CodeBERT for comment (NL) and context (PL)
    • Suggested in 40% of trials

SilverDoubleBERT

    • Uses BERT for comment (NL) and CodeBERT for context (PL)
    • Suggested in 35% of trials

SilverLSTM

    • Uses bi-LSTM for comment (NL) and context (PL)
    • Suggested in 25% of trials

5 of 8

RESULTS

  • Augmentation using GPT-4o-mini
    • 754 instances (485 ‘useful’, 269 ‘not useful’)
  • Augmented instances labelled using I/O, Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting.

5

6 of 8

RESULTS

Observations

  • All of the models performed better with the augmented data.
  • Increase in F1-scores indicates better identification and classification of useful comments.

Insights

  • Transformer-based models performed better than the bi-LSTM based model
    • Subword-level semantics have more importance than character-level semantics
  • DoubleBERT-based approach performed better than CodeBERT
    • Comments inherit semantics of Natural Language over Programming Language.

6

7 of 8

CONCLUSION AND DISCUSSION

Recap

  • Binary classification of Comments into ‘useful’ and ‘not useful’
  • Used models generated by gpt-4o-mini (ChatGPT) to perform classification (‘silver standard models’)
  • Augmented dataset using gpt-4o-mini and compared results.

Conclusion

  • Achieved significant separability using silver standard models
  • These models make way to gold standard models in the task at hand.

Discussion

  • LLM-generated data is helpful in increasing variance, thus reducing bias.
    • This results in better F1-score when there is a lack of training data.
  • However, gold standard data has more reliability and credibility than LLM-generated data.

7

8 of 8

THANK YOU!

Any questions?

8