Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders
Danil Gusak*, Anna Volodkevich*, Anton Klenitskiy*, Alexey Vasilev, Evgeny Frolov
Sequential recommender systems currently dominate next‑item prediction task, but common evaluation protocols for sequential recommendations often fall short of real‑world scenarios. Leave‑one‑out splits introduce temporal leakage and unrealistically long test horizons, while global temporal splits lack clear rules for selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. We systematically compare splitting strategies across multiple datasets and baselines, showing that your choice of split can significantly reorder model rankings and influence deployment decisions. Our results lay the groundwork for more realistic and reproducible evaluation guidelines.

Data splitting and target selection strategies for sequential recommendations. (a) Leave-one-out split. (b) Global temporal split: all interactions after timepoint T_test are placed in the holdout set, targets for these holdout sequences are chosen according to (c). (c) Target items selection options for each holdout sequence (applicable for both test and validation sequences).
Note: the experiments were conducted with python==3.10.16
.
Install requirements:
pip install -r requirements.txt
Specify environment variables:
# data path. Replace with your path if necessary.
export SEQ_SPLITS_DATA_PATH=$(pwd)/data
# src path
export PYTHONPATH="./"
We use Hydra framework for conveniently configuring our experiments.
We worked with eight publicly available datasets: Beauty, Sports, Movielens-1m, Movielens-20m, BeerAdvocate, Diginetica, Zvuk, and YooChoose. To manage computational costs while ensuring sufficient data for analysis, we sampled 2,000,000 users from the YooChoose dataset and 20,000 users from Zvuk. Raw datasets (before the preprocessing step) are available for direct download: Raw Data.
Each dataset has a corresponding config, e.g. runs/configs/dataset/Beauty.yaml
.
Create data directories:
mkdir $SEQ_SPLITS_DATA_PATH
mkdir $SEQ_SPLITS_DATA_PATH/{raw,preprocessed,splitted}
Data folder structure:
- The raw data files are expected in
raw
subdirectory. Move the downloaded raw data .csv files here. - Data after preprocessing will be placed in the
preprocessed
subdirectory. - Data after splitting will be placed in the
splitted
subdirectory.
To run dataset preprocessing for a specific dataset, use:
# specific dataset
python runs/preprocess.py +dataset=Beauty
# all datasets
python runs/preprocess.py -m +dataset=Beauty,BeerAdvocate,Diginetica,Movielens-1m,Sports,Zvuk,Movielens-20m,YooChoose
See preprocess.yaml for possible configuration options. In the paper, we apply p-core filtering with p equal 5 to discard unpopular items and short user sequences. Furthemore, we eliminate consecutive repeated items in user interaction histories.
Split the selected dataset into training, validation, and test subsets.
Data after splitting will be placed in the splitted
subdirectory.
See split.yaml for possible configuration options.
For GTS, validation split options are:
- by_time (Global temporal, GT);
- last_train_item (Last training item, LTI);
- by_user (User-based, UB).
# example for LOO
python runs/split.py split_type=leave-one-out dataset=Beauty
# example for GTS with Global temporal validation
python runs/split.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=by_time dataset=Sports
# example for GTS with Last training item validation
python runs/split.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=last_train_item dataset=Beauty
# example for GTS with User-based validation
python runs/split.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=by_user split_params.validation_size=1024 dataset=Beauty
To run all splits, execute split.sh. Replace SEQ_SPLITS_DATA_PATH
with your path if necessary.
./runs/run_sh/split.sh
Calculate different resulting subset statistics for the chosen splitting strategy. See statistics.yaml for possible configuration options.
# example for GTS with LTI validation
python runs/statistics.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=val_last_train_item dataset=Beauty
To run all statistics calculation, execute stats.sh. Replace SEQ_SPLITS_DATA_PATH
with your path if necessary.
./runs/run_sh/stats.sh
We use popular unsampled top-K ranking metrics: Normalized Discounted Cumulative Gain (NDCG@K), Mean Reciprocal Rank (MRR@K) and HitRate (HR@K), with K = 5, 10, 20, 50, 100. We compute metrics using RePlay framework.
We conduct our experiments using three popular sequential recommender systems baselines:
- SASRec (denoted as SASRec
$^+$ in the paper); - BERT4Rec;
- GRU4Rec.
Each model has a config with a corresponding hyperparameter grid, e.g. runs/configs/model/BERT4Rec.yaml
. Parameter grid_point_number
represents the order number in the hyperparameter grid.
Config train.yaml combines all configurations required for model training and evaluation.
For GTS (split_type=global_timesplit
), validation split options are:
- val_by_time (Global temporal, GT);
- val_last_train_item (Last training item, LTI);
- val_by_user (User-based, UB).
Run training, and validation/test metrics computation:
# example for LOO
python runs/train.py split_type=leave-one-out dataset=Beauty
# example for GTS Last with GT validation
python runs/train.py dataset=Sports split_type=global_timesplit split_subtype=val_by_time quantile=0.9 cuda_visible_devices=0
# example for GTS Last with LTI validation
python runs/train.py dataset=Beauty split_type=global_timesplit split_subtype=val_last_train_item quantile=0.9 cuda_visible_devices=1
# example for GTS Last with UB validation
python runs/train.py dataset=Beauty split_type=global_timesplit split_subtype=val_by_user quantile=0.9 cuda_visible_devices=1
Run training with subsequent GTS Successive evaluation:
# example for GTS with User-based validation
python runs/train.py --config-name=train evaluator.successive_test=True split_type=global_timesplit split_subtype=val_by_user dataset=Beauty quantile=0.9
# same example, but run several grid points at once
python runs/train.py --config-name=train -m evaluator.successive_test=True split_type=global_timesplit split_subtype=val_by_user quantile=0.9 dataset=Beauty model.grid_point_number=0,1
All of the experiment results are available in the results directory, organized by split configuration, dataset, quantile, and model. Inside each
results/
└─ <split_type>/
└─ <split_subtype>/
└─ <dataset>/
└─ <quantile>/
└─ <model>/
└─ final_results/
└─...
you can find .csv
files containing all test and validation metrics for different targets and for every hyperparameter setting (108 grid points for SASRec⁺/BERT4Rec, 104 for GRU4Rec).
- Dataset_statistics_tables.ipynb Run after
Calculate Split Statistics
and previous pipeline steps to builds all tables from the paper along with additional statistics, - Time_gaps.ipynb Run after
Data Splitting
and previous pipeline steps to build Figure 4, - Test_vs_test.ipynb reproduces Figure 5 and Figure 6 from the paper,
- Test_vs_validation.ipynb reproduces Figure 8.
If you find our work helpful, please consider citing the paper:
@inproceedings{timetosplit2025,
title={Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders},
author={Gusak, Danil and Volodkevich, Anna and Klenitskiy, Anton and Vasilev, Alexey and Frolov, Evgeny},
booktitle={Proceedings of the 19th ACM Conference on Recommender Systems},
doi={10.1145/3705328.3748164},
year={2025}
}