1 of 12

SANN: A Subtree-based Attention Neural Network Model for Student Success Prediction Through Source Code Analysis

Muntasir Hoq

North Carolina State University

mhoq@ncsu.edu

Peter Brusilovsky

University of Pittsburgh

peterb@pitt.edu

Bita Akram

North Carolina State University

bakram@ncsu.edu

2 of 12

Motivation

  • Automated programming code analysis tools can help in understanding student knowledge and learning process

  • These tools and student success predictions can improve learner experiences and enable personalized interventions for struggling students

  • Embedding models can embed student programs into condensed vector representations which can be used to further analyze student codes

2

3 of 12

Literature review

  • Existing models provide proof of effectiveness of capturing information from source code ASTs
    • TBCNN
    • TreeLSTM
    • ast2vec
    • code2vec
  • BKT, DKT, program trajectories have been used in student success prediction
  • Recent studies also provide evidence of the effectiveness of models like code2vec in student success prediction

3

4 of 12

Overview

  • We are proposing a Subtree-based Attention Neural Network (SANN)
  • It creates vector representations for student codes
  • Subtrees are extracted sequentially to preserve semantic information
  • A two-way embedding approach (node-based embedding + subtree-based embedding) helps to detect similar code structures effectively
  • Attention mechanism helps to retain as much as subtree information to limit vanishing gradient problem
  • SANN was used in two student code classification tasks to test its effectiveness

4

5 of 12

Architecture

5

6 of 12

Classification tasks

  • Task 1: Student programming submission from a single assignment
  • Task 2: Student programming submission across multiple assignments

  • Predictions: Correct/Incorrect submission

  • Dataset:
    • CodeWorkout (Spring 2019)

6

Properties

Task 1

Task 2

Compilable submissions

1850

9403

Correct

344

3162

Incorrect

1506

6241

7 of 12

Research questions

  • RQ1. How well does SANN perform in classifying source codes from a single assignment?

7

Model

Accuracy

Precision

Recall

F1-score

SVM

0.85

0.78

0.63

0.63

KNN

0.86

0.79

0.66

0.70

XGBoost

0.86

0.76

0.77

0.76

code2vec

0.89

0.84

0.77

0.80

SANN

0.92

0.92

0.80

86

8 of 12

Research questions

  • RQ2. How well does SANN perform in classifying source codes across multiple assignments?

8

Model

Accuracy

Precision

Recall

F1-score

SVM

0.74

0.71

0.70

0.70

KNN

0.75

0.72

0.70

0.71

XGBoost

0.77

0.75

0.74

0.74

code2vec

0.79

0.76

0.76

0.76

SANN

0.86

0.85

0.83

0.84

9 of 12

Research questions

  • RQ3. How well does the integration of the two levels of embedding (node-based embedding and subtree-based embedding) improve the performance of the model?

9

Embedding approach

Accuracy

Precision

Recall

F1-score

node-based embedding

0.75

0.72

0.71

0.71

subtree-based embedding

0.83

0.82

0.82

0.82

two-way embedding

0.86

0.85

0.83

0.84

10 of 12

Conclusion

  • SANN outperformed traditional models and also code2vec model in classifying correct student submissions

  • Two-way embedding approach more effectively embeds subtrees to capture useful information from student programs than one-way embedding

10

11 of 12

Limitation and future plan

  • Lacks generalization for larger and deeper ASTs with more nodes

  • Results in larger number of subtrees with high node counts

  • In future, we will devise a more effective subtree extraction process
    • Regulate the depth of each subtree
    • Extract non-overlapping subtrees

11

12 of 12

Thank you!

12