1 of 24

Can Students Understand AI Decisions Based on Variables Extracted via AutoML?

Jackie Tang, Nigel Bosch

1

IEEE International Conference on Systems, Man, and Cybernetics

(Human + Machine) Learning Lab

2 of 24

BACKGROUND

2

E-learning and EDM are transforming education

Machine learning provides valuable insights into student performance (student facing dashboard etc. )

So there's a catch: Can students understand these AI decisions?

Challenge: Ensuring AI decisions are understandable

Mahcine learning is also trasnfored education ststem, we have seeing many chnegs on how students interact with their own learning data

Now a student can logging into their learning management system and personalized study thirid wown dashboard. - it's a dynamic, AI-driven tool that predicts their academic performance, suggests study strategies, and highlights areas for improvement.

However here is the catch, Can students actually understand and effectively use this information?

The potential benefits are enormous. If students can interpret these AI-generated insights, they can take control of their data. They might see, for instance, that their engagement with online discussions correlates strongly with better exam performance, or that submitting assignments early is a predictor of highe grades.

However, if these metrics and predictions are confusing, we are risk creating a "black box" that students either ignore or misinterpret.

Bsically We need to ensure that these AI decisions in E-leiarningf system are understandable to the students themselves. This is the challenge that drives our research basicially

3 of 24

Student facing dashboard

3

4 of 24

E-Learning

4

E-Learning System

Final Exam

Midterm Exam

Interaction Data

Modeling

Evaluation

Prediction

Click

Keyboard

View

5 of 24

Educational Data Mining

5

Data Processing

Feature Extraction

Prediction

ML algorithm

6 of 24

Example

6

Method	Feature type	Description	Example	Translation
	Combined features	Created by multiplying two or more existing aggregated features. They capture the combined effect of multiple features	MIN(assessmentsmerged.PERCENTILE(score))	The lowest score a student has achieved and ranks it by the number of related clicks compared to their peers
	Statistical aggregation features	Created by aggregating stat function	SUM(studentVle.date)	Adds up all the dates a student clicked on any course material
	Time series features	Extracts a large number of time series characteristics by applying various mathematical transformations and statistical functions to time series data, then calculating relevant properties of the resulting transformed series.	forumng__change_quantiles__f_agg_"var"__isabs_False__qh_0.6__ql_0.2	Variance of the difference in the number of forum clicks, excluding the lowest 20% and the highest 40% of values
	Expert features	Created by expert, identify and construct meaningful variables that capture important aspects of the problem domain	Score_higher_than_mean	Numbers of scores students received that higher than average of class

give you a cbetter idea of what these features look like and how to intrpert tgen, let's look at some examples.

We got couple features craft byt different methods

Combined features might multiply two or more existing features. For instance, we got this MIN(assessmentsmerged.PERCENTILE(score)) whcic is bnasicalyy the The lowest score a student has achieved and ranks it by the number of related clicks

Most essy one would the polynomial feature, which raising a feature to a power (usually squared or cubed) to the original feature.

Next one is Statistical aggregation features which is Created by aggregating stat function, for example this Sum(studnetvle,date) is Adding up all the dates a student clicked on any course material

,

Time series features can be quite complex. Tsfresh first Extracts a large number of time series characteristics by applying various mathematical transformations and statistical functions to time series data, then calculating relevant properties of the resulting transformed series.

Expert features, on the other hand, tend to be more intuitive, like "Number of scores students received that are higher than the class average."

7 of 24

Research Questions

RQ1: How does interpretability of AutoML features compare to expert-crafted features?

Hypothesis: AutoML features harder to interpret

RQ2: What characteristics make a feature more interpretable to students?

Hypothesis: Statistical methods, aggregation functions, term familiarity may influence

7

8 of 24

Dataset

8

OULAD	EPM
Open University, UK	Middle East Technical University, Turkey
2013-2014	2012
Distance learning courses	Face-to-face learning processes
~32,000 students	115 students
Student demographics, course info, VLE interactions	Student activities, resource usage, academic performance
VLE activity, assessments	Student activities, resource accesses, time spent
Large-scale online learning environment data	Detailed process-oriented data on learning behaviors

The first is the Open University Learning Analytics Dataset, or OULAD. This is a large-scale dataset from the UK, covering distance learning courses from 2013 to 2014. It includes data from about 32,000 students, giving us a broad view of online learning behaviors. Btw this dataset is stored in a relationship database.

The second is the Educational Process Mining dataset, or EPM, from Middle East Technical University in Turkey. This dataset is smaller but more detailed, focusing on 115 students in face-to-face learning environments in 2012. Student activities, resource usage, academic performance. Most improitnaly, it has timestampe, for example, ith as datapoint it exactly points out what student was doing during a period of time. give us a complementary view of learning behaviors in different contexts.

9 of 24

Feature Engineering

9

Expert features

AUTOML features

Automated feature extraction for relational data
Uses Deep Feature Synthesis to create features across multiple tables
Resulted in 668 features after removing invariant and redundant ones
Capable of creating complex, multi-level aggregations

Specialized in time series feature extraction
Generates both simple and complex features from time series data
Initially produced 6,312 features
Captures trends like student pacing or time allocation strategies

Focused on predictive utility rather than interpretability
Created a smaller set of features (75 for EPM, 28 for OULAD)
Included derived features like "past due" submissions
Incorporated common educational metrics (e.g., quiz score quantiles)
Designed to represent typical feature engineering work in educational data mining

Selection for survey use: Top 15 features via decision tree model

Now, let's talk about how we generated our features. We used two approaches: expert features and AutoML features.

For expert features, we relied on our knowledge of educational data mining. We created a smaller set of features, 75 for EPM and 28 for OULAD, focusing on predictive utility. These included things like "past due" submissions and quiz score quantiles. It was designto to represent typical feature engineering work in educational data mining

For AutoML, we used two different tools. FeatureTools, applied to OULAD, initially gave us 3,799 features, which we narrowed down to 668. This tool Automated feature extraction for relational data

Uses Deep Feature Synthesis to create features across multiple tables, it is cabaloe of creating complex, multi-level aggregations

TSFRESH, used on EPM, started with 6,312 features and ended up with 321. This tool patiicalty , Specialized in time series feature extraction. Captures trends like student pacing or time allocation strategies

10 of 24

Feature Engineering

10

Data Preprocessing

Feature Generation

Feature Selection

Entity Setup

"students" and "vle"

Relationship Definition

based on the student_id

Primitive Definition

Aggregation primitive: "median"

Transform primitive: "sum

Feature Generation

it applies the "sum" transform to the "click" column in the "vle" entity. aggregates these sums using the "median" function for each student.

Segment Time Series

Tsfresh identifies the data points at the 20th and 60th percentiles of the time series.�It uses these points to divide the time series into segments.

Calculate Changes

Within each segment, tsfresh computes the changes between consecutive data points.�Since isabs=False, it uses the raw changes, not their absolute values.

Compute Variance

tsfresh calculates the variance of all the changes computed in step 2.�Variance measures how far the changes are spread out from their average.

MIN(assessmentsmerged.PERCENTILE(score))

forumng__change_quantiles__f_agg_"var"__isabs_False__qh_0.6__ql_0.2

Let's dive a bit deeper into how these AutoML tools work.

FeatureTools starts by setting up entities like "students" and "vle" (virtual learning environment). It defines relationships between these entities, then applies various primitives - basic building blocks for features. For example, it might sum up all clicks for a student, then take the median of these sums across students.

TSFRESH, on the other hand, is specialized for time series data. It might segment the time series, thencalculate changes within these segments, and then compute variances of these changes. This allows it to capture complex patterns in how student behavior changes over time.

In our example, this very long weird, feature on the top , Tsfresh identifies the data points at the 20th and 60th percentiles of the time series. uses these points to divide the time series into segments. Since isabs=False, it uses the raw changes, not their absolute values. The calculates the variance of all the changes , this Variance measures how far the changes are spread out from their average.

11 of 24

Method

11

Participants:

199 college students, mean age 29.6 years

Diverse backgrounds, primarily novice to

intermediate AI/ML knowledge

Task Structure:

20 prediction tasks (5 predictions × 2 datasets × 2 feature types)

Rate interpretability of each feature on a 5-point scale

Base payment: $5 USD

Bonus: Additional $2 for top 10% performers

Machine Learning Models:

Random forest classification/regression
70/30 train/test split

Dataset	Feature	Metrics	Result
EPM	TSFRESH	R2	0.480
EPM	Expert	R2	0.443
OULAD	Featuretools	AUC	0.812
OULAD	Expert	AUC	0.789

12 of 24

Research procedure

Verification

Predictions Task

for each feature type

Consenting/ Introduction

Predictions Task

With ML Prediction Result

Rate

Interpretability

Pass

No Pass

Result Shown

Hidden value reveal

Make your own prediction

13 of 24

Research procedure

13

14 of 24

Research procedure

14

Verification

Predictions Task

for each feature type

Consenting/ Introduction

Predictions Task

With ML Prediction Result

Rate

Interpretability

Pass

No Pass

Result Shown

Hidden value reveal

Make your own prediction

15 of 24

Research procedure

15

Predictions Task

for each feature type

Consenting/ Introduction

Predictions Task

With ML Prediction Result

Rate

Interpretability

Pass

No Pass

Result Shown

Hidden value reveal

Make your own prediction

16 of 24

Research Procedure

16

Real result: PASS

Y/N

Verification

Match

Unmatched

Interpretability on each features

Note: We previously evaluated the algorithm on a large data set of student's data and the system is reliable to do the task with 75-85% accuracy.

17 of 24

Results - RQ1 Expert vs AutoML Interpretability

Expert features significantly more interpretable

Kruskal-Wallis test: H-value = 315.607, p < .001

17

18 of 24

Results - RQ2: Data Type

18

Top 5 Interpretability by data type:

Most: Score-based features
Moderate: Keystroke, time, click, date
Least: Mousewheel interactions
Significant differences: H-value = 221.235,

p < .001

19 of 24

Results - RQ2: Aggregation

Number of levels:
Significant difference: H-value = 82.016, p < .001

Function types:
Significant differences: F = 38.126, p = .004

19

We also looked at how the complexity of feature aggregation affected interpretability.

We explored aggregation up to three levels, where level 3 included 3 or more, we did expecting that more levels of aggregation would be perceived as less interpretable because they just look completed by default.

In result, We found significant differences based on both the number of aggregation levels and the types of functions used.

Looking at the graph, we can see that features with median/ binary were considered the most interpretable by our participants. Following closely behind are features using maximum or highest value calculations. -poissbly it's often easy to understand the concept of a peak value in a dataset

And features using more complex statistical measures like standard deviation, variance, and Sum were rated as less interpretable

20 of 24

Results - RQ2: Familiarity

Recurrent Exposure: how often participants encountered a particular aggregation function during the survey
No significant correlation
Spearman's rho = .009, p = .983

Lexical Familiarity: how common or recognizable the words used in feature descriptions are in everyday language
No significant correlation
Spearman's rho = -.029, p = .837

20

21 of 24

Result - Key Takeaways

Root data type significantly impacts interpretability

• Preference for score and timing data over interaction data

Aggregation functions affect interpretability

• Cumulative and proportional calculations most understandable

non-effects:

• Repeated exposure and lexical familiarity didn't significantly impact interpretability

21

22 of 24

Limitation and Future Work

Limitations:

Future Research:

Short-term study
Limited to college student perspective

Long-term interpretability impacts
Contextual factors in real classrooms
Multi-stakeholder interpretability (teachers, parents)
Relationship between interpretability and AI trust/reliance

22

23 of 24

“Questions?”

23

Contact:

That's an excellent observation. Yes, you're right that our reward system and prediction tasks might seem disconnected from our focus on interpretability. The original purpose of this research was indeed broader. We initially designed the study to assess both participants' trust in these features and their interpretability.

The reward system and prediction tasks were specifically included to measure how participants would rely on and trust the features and AI predictions s. We wanted to see how interpretability might influence their decision-making and trust in the system.

However, as we progressed with our analysis, we made a decision to focus solely on interpretability for this particular study. We realized that the intereblity component opened up a whole new asepct of research that deserved its own in-depth investigation.

So,, we decided not to include this analysis in the current study. Instead, we're expanding the trust aspect into a separate, more comprehensive study which im currently writing now.

Make sure to have predicotn taksk , because it force sptyiivsny yo yhjink of the feature and trying use of them.

24 of 24

Citation

24