1 of 16

Riiid! Answer Correctness Prediction

Karmen Kink, Lisa Korotkova,

Taido Purason, Villem Tõnisson

2 of 16

Goal

In this competition, your challenge is to create algorithms for "Knowledge Tracing," the modeling of student knowledge over time. The goal is to accurately predict how students will perform on future interactions.

Data collected from Santa, an AI tutoring service (~780 000 students in South Korea) that prepares students for the TOEIC test

(link to paper)

3 of 16

Dataset

Training data columns

row_id
timestamp: the time in milliseconds between this user interaction and the first event completion from that user.
user_id
content_id: ID code for the user interaction
content_type_id: if the event was a question being posed to the user or the event was the user watching a lecture.
task_container_id: Id code for the batch of questions or lectures.
user_answer: the user's answer to the question, if any.
answered_correctly: if the user responded correctly.
prior_question_elapsed_time: The average time in milliseconds it took a user to answer each question in the previous question bundle
prior_question_had_explanation: Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle

4 of 16

Dataset

Question metadata

question_id: foreign key for the train/test content_id column, when the content type is question (0).
bundle_id: id for a bundle of questions (they are served together).
correct_answer: the answer to the question.
part: id of the relevant section of the TOEIC test.
tags: tag code(s) for the question.

Lecture metadata

lecture_id: foreign key for the train/test content_id column, when the content type is lecture (1).
part: id of the relevant section of the test
tag: tag code for the lecture.
type_of: brief description of the core purpose of the lecture

concept, solving question, intention, starter

5 of 16

Dataset

Train set size: over 101 million rows
Test set size: 2.5 million rows
The train/test data is complete, in the sense that there are no missing interactions in the union of train and test data.

The test data follows chronologically after the train data. The test iterations give interactions of users chronologically.
The hidden test set contains new users but not new questions.
You can only submit from Kaggle Notebooks
You must use their custom riiideducation Python module to submit the prediction.

6 of 16

Evaluation

Metric: AUC

Scores on the public leaderboard:

Bronze: 0.756 - 0.760
Silver: 0.760 - 0.777
Gold: 0.778 - ...
Best score: 0.790
Best public notebook score: 0.756 (LGBM)

8 of 16

EDA

Latest user timestamp distribution

Percentage of questions answered correctly by user vs. number of questions answered

https://www.kaggle.com/erikbruin/riiid-eda-of-full-dataset

9 of 16

EDA

Experienced users do better

<50 questions answered

≥50 questions answered

≥500 questions answered

https://www.kaggle.com/ilialar/simple-eda-and-baseline

10 of 16

EDA

Questions have tags
Some are easier, some are harder

https://www.kaggle.com/erikbruin/riiid-eda-of-full-dataset

11 of 16

EDA

prior_question_elapsed_time distribution

https://www.kaggle.com/isaienkov/riiid-answer-correctness-prediction-eda-modeling

12 of 16

Previous solutions

LGBM

Added features: mean of target per user; mean of target per question, mean of whether the prior question included explanation (per user), how many times a question set has been seen by the same user on average, how many and which lectures a user has watched, ...

New or rarely seen questions in the test set were assigned global mean, questions that were known to be very easy or very hard were assigned values respectively

Hyperparameters: objective=binary, boosting=gbdt, max_bin=800, lr=0.0175, num_leaves=80, early_stopping_rounds=12

13 of 16

Previous solutions

Added features very similar to previous approach

Hyperparameters: optimizer=Adam, lr=0.01, loss=binary_crossentropy, metric=binary_accuracy, dropout=0.1

14 of 16

Previous solutions

FTRL (Follow The Regularized Leader)
Implementation by Datatable
Only 6 features (user_id, question_id, prior_question_elapsed_time, bundle_id, part, tags)
90M train rows, the rest validation
20 seconds of training
AUC 0.74 public, 0.72 validation

15 of 16

Difficulties

Submissions only from Kaggle kernels

CPU Notebook <= 9 hours run-time
GPU Notebook <= 9 hours run-time
TPU Notebook <= 3 hours run-time

Memory issues

Kaggle Time Series API

Competition module

Prediction in batches

16 of 16

Ideas

Train / validation split

Validation should include completely new and older users

Freely & publicly available external data is allowed, including pre-trained models

Feature engineering

1 of 16

2 of 16

3 of 16

4 of 16

5 of 16

6 of 16

7 of 16

8 of 16

9 of 16

10 of 16

11 of 16

12 of 16

13 of 16

14 of 16

15 of 16

16 of 16