Main Recent Update:
- [Feb. 3, 2025] Paper accepted to ACM SIGMOD! The full paper can be found at paper link.
Related Repository:
- TSDistEval: Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures
If you find our work helpful, please consider citing:
"A Structured Study of Multivariate Time-Series Distance Measures" ACM SIGMOD 2025.
@inproceedings{dhondt2025structured,
title={A Structured Study of Multivariate Time-Series Distance Measures},
author={d’Hondt, Jens E and Li, Haojun and Yang, Fan and Papapetrou, Odysseas and Paparrizos, John},
booktitle={Proceedings of the 2025 ACM SIGMOD international conference on management of data},
year={2025}
}
"Debunking four long-standing misconceptions of time-series distance measures" ACM SIGMOD 2020.
@inproceedings{paparrizos2020debunking,
title={Debunking four long-standing misconceptions of time-series distance measures},
author={Paparrizos, John and Liu, Chunwei and Elmore, Aaron J and Franklin, Michael J},
booktitle={Proceedings of the 2020 ACM SIGMOD international conference on management of data},
pages={1887--1905},
year={2020}
}
Distance measures are fundamental to time series analysis and have been extensively studied for decades. Until now, research efforts mainly focused on univariate time series, leaving multivariate cases largely under-explored. Furthermore, the existing experimental studies on multivariate distances have critical limitations: (a) focusing only on lock-step and elastic measures while ignoring categories such as sliding and kernel measures; (b) considering only one normalization technique; and (c) placing limited focus on statistical analysis of findings. Motivated by these shortcomings, we present the most complete evaluation of multivariate distance measures to date. Our study examines 30 standalone measures across 8 categories, 2 channel-dependency models, and considers 13 normalizations. We perform a comprehensive evaluation across 30 datasets and 3 downstream tasks, accompanied by rigorous statistical analysis. To ensure fairness, we conduct a thorough investigation of parameters for methods in both a supervised and an unsupervised manner. Our work verifies and extends earlier findings, showing that insights from univariate distance measures also apply to the multivariate case: (a) alternative normalization methods outperform Z-score, and for the first time, we demonstrate statistical differences in certain categories for the multivariate case; (b) multiple lock-step measures are better suited than Euclidean distance, when it comes to multivariate time series; and (c) newer elastic measures outperform the widely adopted Dynamic Time Warping distance, especially with proper parameter tuning in the supervised setting. Moreover, our results reveal that (a) sliding measures offer the best trade-off between accuracy and runtime; (b) current normalization techniques fail to significantly enhance accuracy on multivariate time series and, surprisingly, do not outperform the no normalization case, indicating a lack of appropriate solutions for normalizing multivariate time series; and (c) independent consideration of time series channels is beneficial only for elastic measures. In summary, we offer guidelines to aid in designing and selecting preprocessing strategies and multivariate distance measures for our community.
Set up the environment with required packages:
pip install -r requirements.txt
Getting the data from:
- The original UEA archive, preprocessed to have equal-length time-series can be downloaded from the following link.
- The downsampled version of the UEA archive (used to evaluate elastic measures) can be downloaded from here.
- The TSB-AD-M archive can be downloaded from here, with the related repository available here.
This repository primarily focuses on benchmarking experiments. To compute MTS distances using this code, please refer to the example as bellow. A complete list of supported MTS distances is available in src/onennclassifier.py
.
import numpy as np
from src.onennclassifier import MEASURES
euc_dist = MEASURES['l2']
sbd_indep = MEASURES['sbd-i']
sbd_dep = MEASURES['sbd-d']
dtw_indep = MEASURES['dtw-i']
dtw_dep = MEASURES['dtw-d']
X = np.random.rand(1, 2, 50) # shape: batchsize x n_channel x n_timesteps
Y = np.random.rand(1, 2, 50) # shape: batchsize x n_channel x n_timesteps
result_euc = euc_dist(X, Y)
result_sbd_indep = sbd_indep(X, Y)
result_sbd_dep = sbd_dep(X, Y)
result_dtw_indep = dtw_indep(X, Y)
result_dtw_dep = dtw_dep(X, Y)
print("Euclidean Distance: ", result_euc)
print("SBD-Independent Distance: ", result_sbd_indep)
print("SBD-Dependent Distance: ", result_sbd_dep)
print("DTW-Independent Distance: ", result_dtw_indep)
print("DTW-Dependent Distance: ", result_dtw_dep)
You can run either inference or LOOCV validation on a specific dataset with a specific metric and normalization method through running the src/classification.py script, with the following arguments:
-mp
- run type (options: inference, loocv)-d
or--data
- path to the data directory-p
or--problem
- name of the dataset to run classification on (e.g. BasicMotions)-m
or--measure
- name of the measure to use (e.g. l2)-n
or--norm
- name of the normalization method to use (options: zscore, minmax, median, mean, unit, sigmoid, tanh, none)-c
or--metric_params
- additional parameters for the metric (e.g. gamma for SINK kernel), passed as key=value pairs separated by spaces
Example 1: Run inference with Euclidean distance on the BasicMotions dataset with z-score normalization we would run:
python -m src.classification -mp inference -d $DATASET_DIR$ -p BasicMotions -m l2 -n zscore-i
Example 2: Run inference with DTW-D distance on the BasicMotions dataset with z-score normalization we would run:
python -m src.classification -mp inference -d $DATASET_DIR$ -p BasicMotions -m dtw-d -n zscore-i -c sakoe_chiba_radius=0.1
We provide an example of performing clustering experiments by running src/Clustering/Clustering_pipeline.py script, with the following arguments:
-p
- path to the data directory-f
- name of the dataset to run clustering on (e.g. BasicMotions)-a
- name of the clustering algorithm to use (e.g. PAM_DTW_D)-i
- the iteration index (e.g. 1)-s
- path to save results
Example: Run clustering with PAM + DTW-D on the BasicMotions dataset with Nonorm we would run:
python ./src/Clustering/Clustering_pipeline.py -p $DATASET_DIR$ -f BasicMotions -a PAM_DTW_D -i 1 -s ./Clustering_results
We provide an example of performing anomaly detection experiments by running src/Anomaly_Detection/AD_pipeline.py script, with the following arguments:
-p
- path to the data directory-f
- name of the dataset to run anomaly detection on (e.g. 010_MSL_id_9_Sensor_tr_554_1st_1172.csv)-s
- path to save results-sw
- the sliding window size (e.g. 100)-ws
- the strategy to determine the sliding window size (options: auto, fixed)-st
- the stride (e.g. 50)-dm
- name of the measure to use (e.g. euclidean)
Example: Run anomaly detection (sliding window = 100 and stride = 50) with euclidean on the 010_MSL_id_9_Sensor_tr_554_1st_1172.csv with Nonorm we would run:
python ./src/Anomaly_Detection/AD_pipeline.py -p $DATASET_DIR$ -f 010_MSL_id_9_Sensor_tr_554_1st_1172.csv -s ./AD_results -sw 100 -ws fixed -st 50 -dm euclidean
If you have any questions or suggestions, feel free to contact:
- Jens E. d'Hondt (j.e.d.hondt@tue.nl)
- Haojun Li (li.14118@osu.edu)
- Fan Yang (yang.7007@osu.edu)
- Odysseas Papapetrou (o.papapetrou@tue.nl)
- John Paparrizos (paparrizos.1@osu.edu)
Or describe it in Issues.
We would like to acknowledge Ryan DeMilt (demilt.4@osu.edu) for the valuable contributions to this work. Also appreciate the following github repos a lot for their valuable code base: