1 of 8

FIRE 2023 – IRSE��SOFTWARE METADATA CLASSIFICATION BASED ON GENERATIVE ARTIFICIAL INTELLIGENCE�

Seetharam Killivalavan – II Year

Dr. Durairaj Thenmozhi – Assistant Professor

Department Of Computer Science

Sri Sivasubramania Nadar College of Engineering

FITRRE

2 of 8

INTRODUCTION

Objective:

    • Augmenting a binary code comment quality classification model for improved accuracy.

Input Data:

    • 9048 labeled pairs of C code and comments, categorized as Useful or Not Useful.
    • Generated code-comment pairs that augment the seed data.

Enhancement Goal:

    • Developing a classification model that depicts the enhancement in model accuracy with the augmented dataset.

Significance:

    • Addressing the need for heightened accuracy in code comment quality classification.

Enhancing Binary Code Comment Quality Classification

2

3 of 8

MULTI-SOURCE DATA COLLECTION

Data Source 1: OpenAI API - Curie engine

    • Utilizing the power of the Curie engine through the OpenAI API.
    • Generating synthetic code-comment pairs, simulating real-world coding scenarios.
    • Leveraging Curie's natural language understanding for authentic and relevant examples.

Data Source 2: GitHub Repositories

    • Tapping into the vast landscape of GitHub repositories.
    • Extracting real-world code snippets and associated comments from open-source projects.
    • Capturing the nuances of actual development practices for authenticity.

3

4 of 8

DATA PROCESSING AND BERT EMBEDDINGS

Data Processing Layer:

    • Transitioning to the data processing layer for enhanced understanding.
    • Feeding code-comment pairs from both sources as input prompts to the Curie engine
    • .

BERT Embeddings:

    • Introducing BERT (Bidirectional Encoder Representations from Transformers) embeddings.
    • Providing a contextualized representation of code and comment for improved interpretation.

5 of 8

THE STRENGTH OF OUR ARCHITECTURE

Holistic Data Approach:

    • Blending synthetic and real-world examples for a comprehensive dataset that mirrors coding diversity.

Contextual Understanding with BERT:

    • BERT embeddings enhance the model's contextual grasp, refining assessments for accuracy.

Scalability and Adaptability:

    • Designed for seamless scalability, accommodating future data sources and technological advancements.

5

6 of 8

LABEL GENERATION AND DATASET CONSTRUCTION

Label Generation with Curie Engine:

    • Presenting prompts to the Curie Engine, incorporating both code and comment.
    • Generating labels indicating whether a comment is deemed "Useful" or "Not Useful" for the associated code.
    • Sophisticated interplay of machine learning and natural language processing.

Meticulous Dataset Construction:

    • Assembling the labeled pairs from the Curie model.
    • Forming the building blocks of a new dataset—comprehensive with code, comment, and generated label.

6

7 of 8

KEY RESULTS AND ACHIEVEMENTS

7

Precision and Recall Improvements:

    • SVM precision increased significantly by 6%, reaching 0.85.
    • ANN recall saw a commendable 1.5% boost, reaching 0.746.
    • Minor enhancements observed in ANN models with tanh and logistic activations.

Data Diversity Impact:

    • Addition of 1239 LLM-generated entries enriched training data diversity.
    • Improved sensitivity and generalization capabilities observed in models.
    • Enhanced ability to distinguish between "Useful" and "Not Useful" comments.

Efficient Dataset Generation:

    • Curie model and BERT embeddings efficiently captured code comment intricacies.
    • Labeling using LLMs streamlined dataset creation, proving more efficient than manual efforts.

8 of 8

8