Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

Danil Gusak^*, Anna Volodkevich^*, Anton Klenitskiy^*, Alexey Vasilev, Evgeny Frolov

Sequential recommender systems currently dominate next‑item prediction task, but common evaluation protocols for sequential recommendations often fall short of real‑world scenarios. Leave‑one‑out splits introduce temporal leakage and unrealistically long test horizons, while global temporal splits lack clear rules for selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. We systematically compare splitting strategies across multiple datasets and baselines, showing that your choice of split can significantly reorder model rankings and influence deployment decisions. Our results lay the groundwork for more realistic and reproducible evaluation guidelines.

Data splitting and target selection strategies for sequential recommendations. (a) Leave-one-out split. (b) Global temporal split: all interactions after timepoint T_test are placed in the holdout set, targets for these holdout sequences are chosen according to (c). (c) Target items selection options for each holdout sequence (applicable for both test and validation sequences).

Usage

Note: the experiments were conducted with python==3.10.16.

Install requirements:

pip install -r requirements.txt

Specify environment variables:

# data path. Replace with your path if necessary. 
export SEQ_SPLITS_DATA_PATH=$(pwd)/data

# src path
export PYTHONPATH="./"

We use Hydra framework for conveniently configuring our experiments.

Data

We worked with eight publicly available datasets: Beauty, Sports, Movielens-1m, Movielens-20m, BeerAdvocate, Diginetica, Zvuk, and YooChoose. To manage computational costs while ensuring sufficient data for analysis, we sampled 2,000,000 users from the YooChoose dataset and 20,000 users from Zvuk. Raw datasets (before the preprocessing step) are available for direct download: Raw Data.

Each dataset has a corresponding config, e.g. runs/configs/dataset/Beauty.yaml.

Data Preprocessing

Create data directories:

mkdir $SEQ_SPLITS_DATA_PATH
mkdir $SEQ_SPLITS_DATA_PATH/{raw,preprocessed,splitted}

Data folder structure:

The raw data files are expected in raw subdirectory. Move the downloaded raw data .csv files here.
Data after preprocessing will be placed in the preprocessed subdirectory.
Data after splitting will be placed in the splitted subdirectory.

To run dataset preprocessing for a specific dataset, use:

# specific dataset
python runs/preprocess.py +dataset=Beauty
# all datasets
python runs/preprocess.py -m +dataset=Beauty,BeerAdvocate,Diginetica,Movielens-1m,Sports,Zvuk,Movielens-20m,YooChoose

See preprocess.yaml for possible configuration options. In the paper, we apply p-core filtering with p equal 5 to discard unpopular items and short user sequences. Furthemore, we eliminate consecutive repeated items in user interaction histories.

Data Splitting

Split the selected dataset into training, validation, and test subsets. Data after splitting will be placed in the splitted subdirectory. See split.yaml for possible configuration options.

For GTS, validation split options are:

by_time (Global temporal, GT);
last_train_item (Last training item, LTI);
by_user (User-based, UB).

# example for LOO
python runs/split.py split_type=leave-one-out dataset=Beauty

# example for GTS with Global temporal validation
python runs/split.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=by_time dataset=Sports
# example for GTS with Last training item validation
python runs/split.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=last_train_item dataset=Beauty 
# example for GTS with User-based validation
python runs/split.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=by_user split_params.validation_size=1024 dataset=Beauty

To run all splits, execute split.sh. Replace SEQ_SPLITS_DATA_PATH with your path if necessary.

./runs/run_sh/split.sh

Calculate Split Statistics

Calculate different resulting subset statistics for the chosen splitting strategy. See statistics.yaml for possible configuration options.

# example for GTS with LTI validation
python runs/statistics.py split_type=global_timesplit split_params.quantile=0.9 split_params.validation_type=val_last_train_item dataset=Beauty

To run all statistics calculation, execute stats.sh. Replace SEQ_SPLITS_DATA_PATH with your path if necessary.

./runs/run_sh/stats.sh

Training and Evaluation

Metrics

We use popular unsampled top-K ranking metrics: Normalized Discounted Cumulative Gain (NDCG@K), Mean Reciprocal Rank (MRR@K) and HitRate (HR@K), with K = 5, 10, 20, 50, 100. We compute metrics using RePlay framework.

Models

We conduct our experiments using three popular sequential recommender systems baselines:

SASRec (denoted as SASRec $^+$ in the paper);
BERT4Rec;
GRU4Rec.

Each model has a config with a corresponding hyperparameter grid, e.g. runs/configs/model/BERT4Rec.yaml. Parameter grid_point_number represents the order number in the hyperparameter grid.

Model Training and Metrics Calculation

Config train.yaml combines all configurations required for model training and evaluation.

For GTS (split_type=global_timesplit), validation split options are:

val_by_time (Global temporal, GT);
val_last_train_item (Last training item, LTI);
val_by_user (User-based, UB).

Run training, and validation/test metrics computation:

# example for LOO
python runs/train.py split_type=leave-one-out dataset=Beauty

# example for GTS Last with GT validation
python runs/train.py dataset=Sports split_type=global_timesplit split_subtype=val_by_time quantile=0.9 cuda_visible_devices=0
# example for GTS Last with LTI validation
python runs/train.py dataset=Beauty split_type=global_timesplit split_subtype=val_last_train_item quantile=0.9 cuda_visible_devices=1
# example for GTS Last with UB validation
python runs/train.py dataset=Beauty split_type=global_timesplit split_subtype=val_by_user quantile=0.9 cuda_visible_devices=1

Run training with subsequent GTS Successive evaluation:

# example for GTS with User-based validation
python runs/train.py --config-name=train evaluator.successive_test=True split_type=global_timesplit split_subtype=val_by_user dataset=Beauty quantile=0.9
# same example, but run several grid points at once
python runs/train.py --config-name=train -m evaluator.successive_test=True split_type=global_timesplit split_subtype=val_by_user quantile=0.9 dataset=Beauty model.grid_point_number=0,1

Results

All of the experiment results are available in the results directory, organized by split configuration, dataset, quantile, and model. Inside each

results/
└─ <split_type>/
   └─ <split_subtype>/
      └─ <dataset>/
         └─ <quantile>/
            └─ <model>/
               └─ final_results/
                  └─...

you can find .csv files containing all test and validation metrics for different targets and for every hyperparameter setting (108 grid points for SASRec⁺/BERT4Rec, 104 for GRU4Rec).

Notebooks

Dataset_statistics_tables.ipynb Run after Calculate Split Statistics and previous pipeline steps to builds all tables from the paper along with additional statistics,
Time_gaps.ipynb Run after Data Splitting and previous pipeline steps to build Figure 4,
Test_vs_test.ipynb reproduces Figure 5 and Figure 6 from the paper,
Test_vs_validation.ipynb reproduces Figure 8.

📜 Citation

If you find our work helpful, please consider citing the paper:

@inproceedings{timetosplit2025,
  title={Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders},
  author={Gusak, Danil and Volodkevich, Anna and Klenitskiy, Anton and Vasilev, Alexey and Frolov, Evgeny},
  booktitle={Proceedings of the 19th ACM Conference on Recommender Systems},
  doi={10.1145/3705328.3748164},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
data/results		data/results
notebooks		notebooks
runs		runs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

Usage

Data

Data Preprocessing

Data Splitting

Calculate Split Statistics

Training and Evaluation

Metrics

Models

Model Training and Metrics Calculation

Results

Notebooks

📜 Citation

About

Uh oh!

Contributors 2

Uh oh!

Languages

monkey0head/time-to-split

Folders and files

Latest commit

History

Repository files navigation

Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

Usage

Data

Data Preprocessing

Data Splitting

Calculate Split Statistics

Training and Evaluation

Metrics

Models

Model Training and Metrics Calculation

Results

Notebooks

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages