1 of 24

Can Students Understand AI Decisions Based on Variables Extracted via AutoML?

Jackie Tang, Nigel Bosch

1

IEEE International Conference on Systems, Man, and Cybernetics

(Human + Machine) Learning Lab

2 of 24

BACKGROUND

2

  • E-learning and EDM are transforming education

  • Machine learning provides valuable insights into student performance (student facing dashboard etc. )

So there's a catch: Can students understand these AI decisions?

Challenge: Ensuring AI decisions are understandable

3 of 24

Student facing dashboard

3

4 of 24

E-Learning

4

E-Learning System

Final Exam

Midterm Exam

Interaction Data

Modeling

Evaluation

Prediction

Click

Keyboard

View

5 of 24

Educational Data Mining

5

Data Processing

Feature Extraction

Prediction

ML algorithm

6 of 24

Example

6

Method

Feature type

Description

Example

Translation

Combined features

Created by multiplying two or more existing aggregated features. They capture the combined effect of multiple features

MIN(assessmentsmerged.PERCENTILE(score))

The lowest score a student has achieved and ranks it by the number of related clicks compared to their peers

Statistical aggregation features

Created by aggregating stat function

SUM(studentVle.date)

Adds up all the dates a student clicked on any course material

Time series features

Extracts a large number of time series characteristics by applying various mathematical transformations and statistical functions to time series data, then calculating relevant properties of the resulting transformed series.

forumng__change_quantiles__f_agg_"var"__isabs_False__qh_0.6__ql_0.2

Variance of the difference in the number of forum clicks, excluding the lowest 20% and the highest 40% of values

Expert features

Created by expert, identify and construct meaningful variables that capture important aspects of the problem domain

Score_higher_than_mean

Numbers of scores students received that higher than average of class

7 of 24

Research Questions

  • RQ1: How does interpretability of AutoML features compare to expert-crafted features?
    • Hypothesis: AutoML features harder to interpret

  • RQ2: What characteristics make a feature more interpretable to students?
    • Hypothesis: Statistical methods, aggregation functions, term familiarity may influence

7

8 of 24

Dataset

8

OULAD

EPM

Open University, UK

Middle East Technical University, Turkey

2013-2014

2012

Distance learning courses

Face-to-face learning processes

~32,000 students

115 students

Student demographics, course info, VLE interactions

Student activities, resource usage, academic performance

VLE activity, assessments

Student activities, resource accesses, time spent

Large-scale online learning environment data

Detailed process-oriented data on learning behaviors

9 of 24

Feature Engineering

9

Expert features

AUTOML features

  • Automated feature extraction for relational data
  • Uses Deep Feature Synthesis to create features across multiple tables
  • Resulted in 668 features after removing invariant and redundant ones
  • Capable of creating complex, multi-level aggregations

  • Specialized in time series feature extraction
  • Generates both simple and complex features from time series data
  • Initially produced 6,312 features
  • Captures trends like student pacing or time allocation strategies

  • Focused on predictive utility rather than interpretability
  • Created a smaller set of features (75 for EPM, 28 for OULAD)
  • Included derived features like "past due" submissions
  • Incorporated common educational metrics (e.g., quiz score quantiles)
  • Designed to represent typical feature engineering work in educational data mining

Selection for survey use: Top 15 features via decision tree model

10 of 24

Feature Engineering

10

Data Preprocessing

Feature Generation

Feature Selection

Entity Setup

"students" and "vle"

Relationship Definition

based on the student_id

Primitive Definition

Aggregation primitive: "median"

Transform primitive: "sum

Feature Generation

it applies the "sum" transform to the "click" column in the "vle" entity. aggregates these sums using the "median" function for each student.

Segment Time Series

Tsfresh identifies the data points at the 20th and 60th percentiles of the time series.�It uses these points to divide the time series into segments.

Calculate Changes

Within each segment, tsfresh computes the changes between consecutive data points.�Since isabs=False, it uses the raw changes, not their absolute values.

Compute Variance

tsfresh calculates the variance of all the changes computed in step 2.�Variance measures how far the changes are spread out from their average.

MIN(assessmentsmerged.PERCENTILE(score))

forumng__change_quantiles__f_agg_"var"__isabs_False__qh_0.6__ql_0.2

11 of 24

Method

11

Participants:

  • 199 college students, mean age 29.6 years

  • Diverse backgrounds, primarily novice to

intermediate AI/ML knowledge

Task Structure:

  • 20 prediction tasks (5 predictions × 2 datasets × 2 feature types)

  • Rate interpretability of each feature on a 5-point scale
  • Base payment: $5 USD

  • Bonus: Additional $2 for top 10% performers

Machine Learning Models:

  • Random forest classification/regression
  • 70/30 train/test split

Dataset

Feature

Metrics

Result

EPM

TSFRESH

R2

0.480

EPM

Expert

R2

0.443

OULAD

Featuretools

AUC

0.812

OULAD

Expert

AUC

0.789

12 of 24

Research procedure

Verification

Predictions Task

for each feature type

Consenting/ Introduction

Predictions Task

With ML Prediction Result

Rate

Interpretability

Pass

No Pass

Result Shown

Hidden value reveal

Make your own prediction

13 of 24

Research procedure

13

14 of 24

Research procedure

14

Verification

Predictions Task

for each feature type

Consenting/ Introduction

Predictions Task

With ML Prediction Result

Rate

Interpretability

Pass

No Pass

Result Shown

Hidden value reveal

Make your own prediction

15 of 24

Research procedure

15

Predictions Task

for each feature type

Consenting/ Introduction

Predictions Task

With ML Prediction Result

Rate

Interpretability

Pass

No Pass

Result Shown

Hidden value reveal

Make your own prediction

16 of 24

Research Procedure

16

Real result: PASS

Y/N

Verification

Match

Unmatched

Interpretability on each features

Note: We previously evaluated the algorithm on a large data set of student's data and the system is reliable to do the task with 75-85% accuracy.

17 of 24

Results - RQ1 Expert vs AutoML Interpretability

  1. Expert features significantly more interpretable

  • Kruskal-Wallis test: H-value = 315.607, p < .001

17

18 of 24

Results - RQ2: Data Type

18

Top 5 Interpretability by data type:

  • Most: Score-based features
  • Moderate: Keystroke, time, click, date
  • Least: Mousewheel interactions
  • Significant differences: H-value = 221.235,

p < .001

19 of 24

Results - RQ2: Aggregation

  • Number of levels:
  • Significant difference: H-value = 82.016, p < .001

  • Function types:
  • Significant differences: F = 38.126, p = .004

19

20 of 24

Results - RQ2: Familiarity

  • Recurrent Exposure: how often participants encountered a particular aggregation function during the survey
  • No significant correlation
  • Spearman's rho = .009, p = .983

  • Lexical Familiarity: how common or recognizable the words used in feature descriptions are in everyday language
  • No significant correlation
  • Spearman's rho = -.029, p = .837

20

21 of 24

Result - Key Takeaways

  • Root data type significantly impacts interpretability

• Preference for score and timing data over interaction data

  • Aggregation functions affect interpretability

• Cumulative and proportional calculations most understandable

  • non-effects:

• Repeated exposure and lexical familiarity didn't significantly impact interpretability

21

22 of 24

Limitation and Future Work

Limitations:

Future Research:

  • Short-term study
  • Limited to college student perspective

  • Long-term interpretability impacts
  • Contextual factors in real classrooms
  • Multi-stakeholder interpretability (teachers, parents)
  • Relationship between interpretability and AI trust/reliance

22

23 of 24

“Questions?”

23

Contact:

24 of 24

Citation

24