“Riiid Labs, an AI solutions provider delivering creative disruption to the education market, empowers global education players to rethink traditional ways of learning leveraging AI. With a strong belief in equal opportunity in education, Riiid launched an AI tutor based on deep-learning algorithms in 2017 that attracted more than one million South Korean students. This year, the company released EdNet, the world’s largest open database for AI education containing more than 100 million student interactions.” Source
The TOEIC Listening & Reading test is an objective test that features 200 questions with a two hour time limit. Listening (approximately 45 minutes, 100 questions) and Reading (75 minutes, 100 questions). Source
“Santa TOEIC is the AI-based web/mobile learning platform for the TOEIC. AI tutor provides a one-on-one curriculum, effectively increasing scores based on the essential questions and lectures for each user.” Source
train.csv
Feature Name | Description |
---|---|
row_id | (int64) ID code for the row |
timestamp | (int64) The time in milliseconds between this user interaction and the first event completion from the user |
user_id | (int32) ID code for the user |
content_id | (int16) ID Code for the user interaction |
content_type_id | (int8) 0 if the event was a question prompted to the user, 1 if the event was watching a lecture |
task_container_id | (int16) ID code for the batch of questions or lectures. -1 for lectures |
user_answer | (int8) The user's answer to the question, if any. -1 for lectures |
answered_correctly | (int8) If the user responded correctly, if any. -1 for lectures |
prior_question_elapsed_time | (float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in-between. Null for a user's first question bundle or lecture. Note: The time is the average time a user took to solve each question in the previous bundle. |
prior_question_had_explanation | (bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle ignoring any lectures in between. The value is shared across a single question bundle and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback. |
lectures.csv
metadata for the lectures watched by users as they progress in their education.
Feature Name | Description |
---|---|
lecture_id | Foreign key for the train/test content_id column, when the content type is a lecture (1). |
part | Top-level category code for the lecture. |
tag | One tag code for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together. |
type_of | Brief description of the core purpose of the lecture. |
questions.csv
metadata for the questions posed to users.
Feature Name | Description |
---|---|
question_id | Foreign key for the train/test content_id column, when the content-type is question (0). |
bundle_id | Code for which questions are served together. |
correct_answer | The answer to the question. It can be compared with the train user_answer column to check if the user was right. |
part | The relevant section of the TOEIC test. |
tags | One or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together. |
Feature Name | Description |
---|---|
question_had_explanation | Indicates if a question had an explanation |
user_acc_mean | The number of questions a user answered correctly divided by all questions they've answered |
user_lectures_running_total | The running total of lectures a user has watched at a given timestamp |
avg_user_q_time | The average amount of time a user spends on a question |
mean_bundle_accuracy | The average accuracy of a bundle of questions |
mean_content_accuracy | The number of questions a user answered correctly divided by all questions they've answered in different content/topics |
mean_tagcount_accuracy | The number of tags linked to a question |
mean_tags_accuracy | The average accuracy for questions that share the same number of tags |
mean_task_accuracy | The average accuracy of a specific content_id |
median_content_accuracy | The median accuracy for a specific content type |
median_tesk_accuracy | The median accuracy for a task |
q_time | The amount of time a user spent on the previous question |
question_content_asked | The type of question asked |
question_priortime_asked | The timestamp of the previous question prompted to the user |
question_task_asked | The type of question asked in a bundle |
question_timestamp_asked | The timestamp a question was prompted to the user |
std_content_accuracy | The standard deviation of content accuracy |
std_task_accuracy | The standard deviation of task accuracy |
std_timestamp_accuracy | The standard deviation of accuracy for a given timestamp |
skew_content_accuracy | The skewness of accuracy for a specific content type |
skew_task_accuracy | The skewness of accuracy for a specific task |
skew_timestamp_accuracy | The skewness of accuracy for a given timestamp |
avg_user_q_time [scaled] | Scaled version of avg_user_q_time using MinMaxScaler. Returned from prep_riiid function |
user_lectures_running_total [scaled] | Scaled version of user_lectures_running_total using MinMaxScaler. Returned from prep_riiid function |
- Are questions with explanations answered correctly more often?
- Do users with higher accuracy take longer to answer questions?
- Are there questions that are more difficult to answer than others?
- How long does the average user engage with the platform?
- Does the number of lectures a user watch impact their accuracy?
Data acquired from Kaggle. The data is stored in three separate files: lectures.csv, questions.csv, and train.csv. The primary dataset is train.csv, which has 100+ million user interactions from 390,000+ users. We used a random sample of 100K users for our analysis. The original 10 features describe the type of question, the time it took to answer, and whether the user’s response was correct.
Missing Values
- Filled missing boolean values in
question_had_explanation
with False. Missing values indicated that the question did not have an explanation or the user viewed a lecture. - Filled missing values in
prior_question_elapsed_time
with 0. Missing values indicated that a user viewed a lecture before answering the first question in a bundle of questions. - Dropped columns:
lecture_id
,tag
,lecture_part
,type_of
,question_id
,bundle_id
,correct_answer
,question_part
, andtags
- Dropped rows considered lectures: Where
answered_correctly
= -1
Feature Engineering
- Created new features using descriptive statistics for content, task, timestamp, and whether a question had an explanation.
- Scaled timestamp from milliseconds to weeks to look at trends overtime.
- Refer to the feature engineering data dictionary for more information.
Preprocessing
- Scaled
mean_timestamp_accuracy
,mean_priortime_accuracy
,user_lectured_running_total
, andavg_user_q_time
using MinMaxScaler
- Used scatterplots, barplots, and histograms to visualize interactions between features and the target variable.
- Performed hypothesis tests to find statistically significant relationships between features.
Hypothesis – Answered correctly vs. Question had an explanation
Null hypothesis: Whether a student gets a question right is independent of whether a question had an explanation.
Alternative hypothesis: Whether a student gets a question right is dependent on whether a question had an explanation.
Test: Chi-Squared Test
Results: With a p-value less than alpha, we reject the null hypothesis.
Hypothesis – Answered correctly vs. Part
Null hypothesis: Whether a user answers a question correctly is independent of the type of question being asked.
Alternative hypothesis: Whether a user answers a question correctly is dependent upon the type of question being asked.
Test: Chi-Squared Test
Results: With a p-value less than alpha, we reject the null hypothesis.
Hypothesis – Number of lectures a user has watched vs. Average task accuracy
Null hypothesis: There is no linear relationship between the number of lectures a user has watched and their average task accuracy.
Alternative hypothesis: There is a linear relationship between the number of lectures a user has watched and their average task accuracy.
Test: Pearson Correlation Test
Results: With a p-value less than alpha, we reject the null hypothesis.
- On average, as the number of lectures a user has seen increases, so does their task accuracy.
- Viewing lectures may have a weak positive impact on user accuracy.
Hypothesis – Average user question time vs. Average user accuracy
Null hypothesis: There is no linear relationship between the average time a user takes to answer a question and their average accuracy.
Alternative hypothesis: There is a linear relationship between the average time a user takes to answer a question and their average accuracy.
Test: Pearson Correlation Test
Results: With a p-value less than alpha, we reject the null hypothesis
- Users who take longer to answer questions tend to have lower overall accuracy and vice versa
Hypothesis – Average user accuracy vs. Average content accuracy
Null hypothesis: Users with above-average accuracy spend the same amount of time or more time on questions with lower-than-average content accuracy than users with average or lower-than-average accuracy.
Alternative hypothesis: Users with above-average accuracy spend less time on questions with lower-than-average content accuracy than users with average or lower-than-average accuracy.
Test: Two-Sample One-Tailed T-Test
Results: With a p-value less than alpha, and t less than 0, we reject the null hypothesis.
- If users with above-average accuracy answer questions (difficult and otherwise) more quickly than other users, then they may be more prepared for the content.
First, we created a baseline model was to compare our model performances. The baseline is the most common outcome from the training dataset, answered correctly = 1. Baseline accuracy is 50%. This means that a user will get an answer correct 50% of the time. Models evaluated on train, validate, and the test set were:
- Logistic Regression
- Random Forest
- LGBM
Our LGBM model performed the best, with an AUC score of .744. AUC is a measure of True Positives and False Positives. A True Positive means that our model predicted that a student answered a question correctly, and their response was correct. A False Positive means our model predicted a student responded to a question correctly when their answer was incorrect. An AUC score ranges between 0 and 1, where the higher the number, the more accurate a classification model is.
"Gradient boosting algorithm sequentially combines weak learners (decision tree) in way that each new tree fits to the residuals from the previous step so that the model improves. The final model aggregates the results from each step and a strong learner is achieved." Source
- Performance decreases on incomplete sentences and narration.
- Performance increases with explanations and visuals.
- Lectures don't significantly impact performance.
- LightGBM: AUC score of .744
- The LightGBM model surpassed the baseline by 0.24, which is a 47% improvement (which is a comparison of the difference between the scores divided by the baseline).
- Offer more study material for difficult subjects and to develop students' ability to understand english based on context clues when photos aren't available
- Revise lectures to have a stronger impact over the course of students' studies
- Use model to tailor content to student's needs
- Riiid's program will better prepare students for the TOEIC
- Use this predictive model on Riiid's other educational programs.
- Explore more features and different classification models like xgboost.
- Improve model to predict new student performance more effectively.
All files are reproducible and available for download and use.
- Read this README.md
- Download the aquire.py, prepare.py, and Final-Report.ipynb files
- Run Final-Report.ipynb