1 of 12

SANN: A Subtree-based Attention Neural Network Model for Student Success Prediction Through Source Code Analysis

Muntasir Hoq North Carolina State University mhoq@ncsu.edu	Peter Brusilovsky University of Pittsburgh peterb@pitt.edu	Bita Akram North Carolina State University bakram@ncsu.edu

2 of 12

Motivation

Automated programming code analysis tools can help in understanding student knowledge and learning process

These tools and student success predictions can improve learner experiences and enable personalized interventions for struggling students

Embedding models can embed student programs into condensed vector representations which can be used to further analyze student codes

3 of 12

Literature review

Existing models provide proof of effectiveness of capturing information from source code ASTs

TBCNN
TreeLSTM
ast2vec
code2vec

BKT, DKT, program trajectories have been used in student success prediction
Recent studies also provide evidence of the effectiveness of models like code2vec in student success prediction

4 of 12

Overview

We are proposing a Subtree-based Attention Neural Network (SANN)
It creates vector representations for student codes
Subtrees are extracted sequentially to preserve semantic information
A two-way embedding approach (node-based embedding + subtree-based embedding) helps to detect similar code structures effectively
Attention mechanism helps to retain as much as subtree information to limit vanishing gradient problem
SANN was used in two student code classification tasks to test its effectiveness

5 of 12

Architecture

6 of 12

Classification tasks

Task 1: Student programming submission from a single assignment
Task 2: Student programming submission across multiple assignments

Predictions: Correct/Incorrect submission

Dataset:

CodeWorkout (Spring 2019)

Properties	Task 1	Task 2
Compilable submissions	1850	9403
Correct	344	3162
Incorrect	1506	6241

7 of 12

Research questions

RQ1. How well does SANN perform in classifying source codes from a single assignment?

Model	Accuracy	Precision	Recall	F1-score
SVM	0.85	0.78	0.63	0.63
KNN	0.86	0.79	0.66	0.70
XGBoost	0.86	0.76	0.77	0.76
code2vec	0.89	0.84	0.77	0.80
SANN	0.92	0.92	0.80	86

8 of 12

Research questions

RQ2. How well does SANN perform in classifying source codes across multiple assignments?

Model	Accuracy	Precision	Recall	F1-score
SVM	0.74	0.71	0.70	0.70
KNN	0.75	0.72	0.70	0.71
XGBoost	0.77	0.75	0.74	0.74
code2vec	0.79	0.76	0.76	0.76
SANN	0.86	0.85	0.83	0.84

9 of 12

Research questions

RQ3. How well does the integration of the two levels of embedding (node-based embedding and subtree-based embedding) improve the performance of the model?

Embedding approach	Accuracy	Precision	Recall	F1-score
node-based embedding	0.75	0.72	0.71	0.71
subtree-based embedding	0.83	0.82	0.82	0.82
two-way embedding	0.86	0.85	0.83	0.84

10 of 12

Conclusion

SANN outperformed traditional models and also code2vec model in classifying correct student submissions

Two-way embedding approach more effectively embeds subtrees to capture useful information from student programs than one-way embedding

11 of 12

Limitation and future plan

Lacks generalization for larger and deeper ASTs with more nodes

Results in larger number of subtrees with high node counts

In future, we will devise a more effective subtree extraction process

Regulate the depth of each subtree
Extract non-overlapping subtrees

1 of 12

2 of 12

3 of 12

4 of 12

5 of 12

6 of 12

7 of 12

8 of 12

9 of 12

10 of 12

11 of 12

12 of 12