1 of 8

EXPLORING USEFULNESS OF C COMMENTS USING SILVER STANDARD MODELS

Aritra Mitra

Computer Science and Engineering, IIT Kharagpur

2 of 8

TABLE OF CONTENTS

3 Introduction

4 Methods

5 Research Results

7 Conclusion and Discussions

3 of 8

INTRODUCTION

Classify comments as ‘useful’ or ‘not useful’.
Train models using provided dataset.
Augment using LLM

GPT-4o-mini used

Train models using augmented dataset and compare results.

4 of 8

METHODS

SilverCodeBERT

Uses CodeBERT for comment (NL) and context (PL)
Suggested in 40% of trials

SilverDoubleBERT

Uses BERT for comment (NL) and CodeBERT for context (PL)
Suggested in 35% of trials

SilverLSTM

Uses bi-LSTM for comment (NL) and context (PL)
Suggested in 25% of trials

5 of 8

RESULTS

Augmentation using GPT-4o-mini

754 instances (485 ‘useful’, 269 ‘not useful’)

Augmented instances labelled using I/O, Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting.

6 of 8

RESULTS

Observations

All of the models performed better with the augmented data.
Increase in F1-scores indicates better identification and classification of useful comments.

Insights

Transformer-based models performed better than the bi-LSTM based model

Subword-level semantics have more importance than character-level semantics

DoubleBERT-based approach performed better than CodeBERT

Comments inherit semantics of Natural Language over Programming Language.

7 of 8

CONCLUSION AND DISCUSSION

Recap

Binary classification of Comments into ‘useful’ and ‘not useful’
Used models generated by gpt-4o-mini (ChatGPT) to perform classification (‘silver standard models’)
Augmented dataset using gpt-4o-mini and compared results.

Conclusion

Achieved significant separability using silver standard models
These models make way to gold standard models in the task at hand.

Discussion

LLM-generated data is helpful in increasing variance, thus reducing bias.

This results in better F1-score when there is a lack of training data.

However, gold standard data has more reliability and credibility than LLM-generated data.

8 of 8

THANK YOU!

Any questions?