1 of 8

FIRE 2023 – IRSE��SOFTWARE METADATA CLASSIFICATION BASED ON GENERATIVE ARTIFICIAL INTELLIGENCE�

Seetharam Killivalavan – II Year

Dr. Durairaj Thenmozhi – Assistant Professor

Department Of Computer Science

Sri Sivasubramania Nadar College of Engineering

FITRRE

2 of 8

INTRODUCTION

Objective:

Augmenting a binary code comment quality classification model for improved accuracy.

Input Data:

9048 labeled pairs of C code and comments, categorized as Useful or Not Useful.
Generated code-comment pairs that augment the seed data.

Enhancement Goal:

Developing a classification model that depicts the enhancement in model accuracy with the augmented dataset.

Significance:

Addressing the need for heightened accuracy in code comment quality classification.

Enhancing Binary Code Comment Quality Classification

3 of 8

MULTI-SOURCE DATA COLLECTION

Data Source 1: OpenAI API - Curie engine

Utilizing the power of the Curie engine through the OpenAI API.
Generating synthetic code-comment pairs, simulating real-world coding scenarios.
Leveraging Curie's natural language understanding for authentic and relevant examples.

Data Source 2: GitHub Repositories

Tapping into the vast landscape of GitHub repositories.
Extracting real-world code snippets and associated comments from open-source projects.
Capturing the nuances of actual development practices for authenticity.

4 of 8

DATA PROCESSING AND BERT EMBEDDINGS

Data Processing Layer:

Transitioning to the data processing layer for enhanced understanding.
Feeding code-comment pairs from both sources as input prompts to the Curie engine
.

BERT Embeddings:

Introducing BERT (Bidirectional Encoder Representations from Transformers) embeddings.
Providing a contextualized representation of code and comment for improved interpretation.

5 of 8

THE STRENGTH OF OUR ARCHITECTURE

Holistic Data Approach:

Blending synthetic and real-world examples for a comprehensive dataset that mirrors coding diversity.

Contextual Understanding with BERT:

BERT embeddings enhance the model's contextual grasp, refining assessments for accuracy.

Scalability and Adaptability:

Designed for seamless scalability, accommodating future data sources and technological advancements.

6 of 8

LABEL GENERATION AND DATASET CONSTRUCTION

Label Generation with Curie Engine:

Presenting prompts to the Curie Engine, incorporating both code and comment.
Generating labels indicating whether a comment is deemed "Useful" or "Not Useful" for the associated code.
Sophisticated interplay of machine learning and natural language processing.

Meticulous Dataset Construction:

Assembling the labeled pairs from the Curie model.
Forming the building blocks of a new dataset—comprehensive with code, comment, and generated label.

7 of 8

KEY RESULTS AND ACHIEVEMENTS

Precision and Recall Improvements:

SVM precision increased significantly by 6%, reaching 0.85.
ANN recall saw a commendable 1.5% boost, reaching 0.746.
Minor enhancements observed in ANN models with tanh and logistic activations.

Data Diversity Impact:

Addition of 1239 LLM-generated entries enriched training data diversity.
Improved sensitivity and generalization capabilities observed in models.
Enhanced ability to distinguish between "Useful" and "Not Useful" comments.

Efficient Dataset Generation:

Curie model and BERT embeddings efficiently captured code comment intricacies.
Labeling using LLMs streamlined dataset creation, proving more efficient than manual efforts.