Skip to content

eXascaleInfolab/ImputeGAP

Repository files navigation



Welcome to ImputeGAP

ImputeGAP is a comprehensive Python library for imputation of missing values in time series data. It implements user-friendly APIs to easily visualize, analyze, and repair your own time series datasets. The library supports a diverse range of imputation methods and modular missing data simulation catering to datasets with varying characteristics. ImputeGAP includes extensive customization options, such as automated hyperparameter tuning, benchmarking, explainability, downstream evaluation, and compatibility with popular time series frameworks.

In detail, the package provides: Access to commonly used datasets in time series research (Datasets).

  • Access to commonly used datasets in time series research (Datasets).
  • Automated preprocessing with built-in methods for normalizing time series (PreProcessing).
  • Configurable contamination module that simulates real-world missingness patterns (Patterns).
  • Parameterizable state-of-the-art time series imputation algorithms (Algorithms).
  • Benchmarking to foster reproducibility in time series imputation (Benchmark).
  • Modular tools to analyze the behavior of imputation algorithms and assess their impact on key downstream tasks in time series analysis (Downstream).
  • Fine-grained analysis of the impact of time series features on imputation results (Explainer).
  • Plug-and-play integration of new datasets and algorithms in various languages such as Python, C++, Matlab, Java, and R.

If you like our library, please star our GitHub repository.


Python Release License Coverage PyPI Language Platform Docs


Tools URL
📚 Documentation https://imputegap.readthedocs.io/en/latest/
📦 PyPI https://pypi.org/project/imputegap/
📁 Datasets Repository

List of available imputation algorithms

Family Algorithm Venue -- Year
Deep Learning BitGraph [32] ICLR -- 2024
Deep Learning BayOTIDE [30] PMLR -- 2024
Deep Learning MissNet [27] KDD -- 2024
Deep Learning MPIN [25] PVLDB -- 2024
Deep Learning PRISTI [26] ICDE -- 2023
Deep Learning GRIN [29] ICLR -- 2022
Deep Learning HKMF_T [31] TKDE -- 2021
Deep Learning DeepMVI [24] PVLDB -- 2021
Deep Learning MRNN [22] IEEE Trans on BE -- 2019
Deep Learning BRITS [23] NeurIPS -- 2018
Deep Learning GAIN [28] ICML -- 2018
Matrix Completion CDRec [1] KAIS -- 2020
Matrix Completion TRMF [8] NeurIPS -- 2016
Matrix Completion GROUSE [3] PMLR -- 2016
Matrix Completion ROSL [4] CVPR -- 2014
Matrix Completion SoftImpute [6] JMLR -- 2010
Matrix Completion SVT [7] SIAM J. OPTIM -- 2010
Matrix Completion SPIRIT [5] VLDB -- 2005
Matrix Completion IterativeSVD [2] BIOINFORMATICS -- 2001
Pattern Search TKCM [11] EDBT -- 2017
Pattern Search STMVL [9] IJCAI -- 2016
Pattern Search DynaMMo [10] KDD -- 2009
Machine Learning IIM [12] ICDE -- 2019
Machine Learning XGBOOST [13] KDD -- 2016
Machine Learning MICE [14] Statistical Software -- 2011
Machine Learning MissForest [15] BioInformatics -- 2011
Statistics KNNImpute -
Statistics Interpolation -
Statistics MinImpute -
Statistics ZeroImpute -
Statistics MeanImpute -
Statistics MeanImputeBySeries -

Quick Navigation


System Requirements

ImputeGAP is compatible with Python>=3.10 (except 3.13) and Unix-compatible environment.

To create and set up an environment with Python 3.12, please refer to the installation guide.


Installation

To install the latest version of ImputeGAP from PyPI, run the following command:

pip install imputegap

Alternatively, you can install the library from source:

git init
git clone https://github.com/eXascaleInfolab/ImputeGAP
cd ./ImputeGAP
pip install -e .

Loading and Preprocessing

ImputeGAP comes with several time series datasets. The list of datasets is described here.

As an example, we start by using eeg-alcohol, a standard dataset composed of individuals with a genetic predisposition to alcoholism. The dataset contains measurements from 64 electrodes placed on subject’s scalps, sampled at 256 Hz (3.9-ms epoch) for 1 second. The dimensions of the dataset are 64 series, each containing 256 values.

Example Loading

You can find this example in the file runner_loading.py.

from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()
print(f"ImputeGAP datasets : {ts.datasets}")

# load and normalize the dataset from file or from the code
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# plot and print a subset of time series
ts.plot(input_data=ts.data, nbr_series=9, nbr_val=100, save_path="./imputegap_assets")
ts.print(nbr_series=9, nbr_val=20)

Contamination

We now describe how to simulate missing values in the loaded dataset. ImputeGAP implements eight different missingness patterns.

For more details, please refer to the documentation in this page.

Example Contamination

You can find this example in the file runner_contamination.py.

As example, we show how to contaminate the eeg-alcohol dataset with the MCAR pattern:

from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series with MCAR pattern
ts_m = ts.Contamination.mcar(ts.data, rate_dataset=0.2, rate_series=0.4, block_size=10, seed=True)

# [OPTIONAL] plot the contaminated time series
ts.plot(ts.data, ts_m, nbr_series=9, subplot=True, save_path="./imputegap_assets/contamination")

All missingness patterns developed in ImputeGAP are available in the ts.patterns module. To list all the available patterns, you can use this command:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"Missingness patterns : {ts.patterns}")

Imputation

In this section, we will illustrate how to impute the contaminated time series. Our library implements five families of imputation algorithms. Statistical, Machine Learning, Matrix Completion, Deep Learning, and Pattern Search Methods. The list of algorithms and their optimizers is described here.

Example Imputation

You can find this example in the file runner_imputation.py.

Let's illustrate the imputation using the CDRec Algorithm from the Matrix Completion family.

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate the time series
ts_m = ts.Contamination.mcar(ts.data)

# impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute and print the imputation metrics
imputer.score(ts.data, imputer.recov_data)
ts.print_results(imputer.metrics)

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, algorithm=imputer.algorithm, save_path="./imputegap_assets/imputation")

Imputation can be performed using either default values or user-defined values. To specify the parameters, please use a dictionary in the following format:

config = {"rank": 5, "epsilon": 0.01, "iterations": 100}
imputer.impute(params=config)

All algorithms developed in ImputeGAP are available in the ts.algorithms module. To list all the available algorithms, you can use this command:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"Imputation algorithms : {ts.algorithms}")

Parameter Tuning

The Optimizer component manages algorithm configuration and hyperparameter tuning. To invoke the tuning process, users need to specify the optimization option during the Impute call by selecting the appropriate input for the algorithm. The parameters are defined by providing a dictionary containing the ground truth, the chosen optimizer, and the optimizer's options. Several search algorithms are available, including those provided by Ray Tune.

Example Auto-ML

You can find this example in the file runner_optimization.py.

Let's illustrate the imputation using the CDRec Algorithm and Ray-Tune AutoML:

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()
print(f"AutoML Optimizers : {ts.optimizers}")

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# contaminate and impute the time series
ts_m = ts.Contamination.mcar(ts.data)
imputer = Imputation.MatrixCompletion.CDRec(ts_m)

# use Ray Tune to fine tune the imputation algorithm
imputer.impute(user_def=False, params={"input_data": ts.data, "optimizer": "ray_tune"})

# compute the imputation metrics with optimized parameter values
imputer.score(ts.data, imputer.recov_data)

# compute the imputation metrics with default parameter values
imputer_def = Imputation.MatrixCompletion.CDRec(ts_m).impute()
imputer_def.score(ts.data, imputer_def.recov_data)

# print the imputation metrics with default and optimized parameter values
ts.print_results(imputer_def.metrics, text="Default values")
ts.print_results(imputer.metrics, text="Optimized values")

# plot the recovered time series
ts.plot(input_data=ts.data, incomp_data=ts_m, recov_data=imputer.recov_data, nbr_series=9, subplot=True, save_path="./imputegap_assets/imputation", display=True)

# save hyperparameters
utils.save_optimization(optimal_params=imputer.parameters, algorithm=imputer.algorithm, dataset="eeg-alcohol", optimizer="ray_tune", file_name="./imputegap_assets/params")

All optimizers developed in ImputeGAP are available in the ts.optimizers module.

To list all the available optimizers, you can use this command:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"AutoML Optimizers : {ts.optimizers}")

Explainer

ImputeGAP provides insights into the algorithm’s behavior by identifying the features that impact the most the imputation results. It trains a regression model to predict imputation results across various methods and uses SHapley Additive exPlanations (SHAP) to reveal how different time series features influence the model’s predictions. The documentation for the explainer is described here.

Example Explainer

You can find this example in the file runner_explainer.py.

Let’s illustrate the explainer using the CDRec Algorithm and MCAR missingness pattern:

from imputegap.recovery.manager import TimeSeries
from imputegap.recovery.explainer import Explainer
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()

# load and normalize the dataset
ts.load_series(utils.search_path("eeg-alcohol"))
ts.normalize(normalizer="z_score")

# configure the explanation
shap_values, shap_details = Explainer.shap_explainer(input_data=ts.data, 
                                                     extractor="pycatch", 
                                                     pattern="mcar", 
                                                     file_name=ts.name,
                                                     algorithm="CDRec")

# print the impact of each feature
Explainer.print(shap_values, shap_details)

To list all the available features extractors, you can use this command:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"ImputeGAP features extractors : {ts.extractors}")

Downstream

ImputeGAP includes a dedicated module for systematically evaluating the impact of data imputation on downstream tasks. Currently, forecasting is the primary supported task, with plans to expand to additional applications in the future. The example below demonstrates how to define the forecasting task and specify Prophet as the predictive model The documentation for the downstream evaluation is described here.

Below is an example of how to call the downstream process for the model Prophet by defining a dictionary for the evaluator and selecting the model:

Example Downstream

You can find this example in the file runner_downstream.py.

Below is an example of how to call the downstream process for the model Prophet by defining a dictionary for the evaluator and selecting the model:

from imputegap.recovery.imputation import Imputation
from imputegap.recovery.manager import TimeSeries
from imputegap.tools import utils

# initialize the time series object
ts = TimeSeries()
print(f"ImputeGAP downstream models for forcasting : {ts.downstream_models}")

# load and normalize the dataset
ts.load_series(utils.search_path("forecast-economy"))
ts.normalize(normalizer="min_max")

# contaminate the time series
ts_m = ts.Contamination.aligned(ts.data, rate_series=0.8)

# define and impute the contaminated series
imputer = Imputation.MatrixCompletion.CDRec(ts_m)
imputer.impute()

# compute and print the downstream results
downstream_config = {"task": "forecast", "model": "hw-add", "comparator": "ZeroImpute"}
imputer.score(ts.data, imputer.recov_data, downstream=downstream_config)
ts.print_results(imputer.downstream_metrics, algorithm=imputer.algorithm)

To list all the available downstream models, you can use this command:

from imputegap.recovery.manager import TimeSeries
ts = TimeSeries()
print(f"ImputeGAP downstream models for forcasting : {ts.downstream_models}")

Benchmark

ImputeGAP can serve as a common test-bed for comparing the effectiveness and efficiency of time series imputation algorithms[33] . Users have full control over the benchmark by customizing various parameters, including the list of datasets to evaluate, the algorithms to compare, the choice of optimizer to fine-tune the algorithms on the chosen datasets, the missingness patterns, and the range of missing rates. The default metrics evaluated include "RMSE", "MAE", "MI", "Pearson", and the runtime. The documentation for the benchmark is described here.

Example Benchmark

You can find this example in the file runner_benchmark.py.

The benchmarking module can be utilized as follows:

from imputegap.recovery.benchmark import Benchmark

save_dir = "./imputegap_assets/benchmark"
nbr_runs = 1

datasets = ["eeg-alcohol"]

optimizers = ["default_params"]

algorithms = ["SoftImpute", "KNNImpute"]

patterns = ["mcar"]

range = [0.05, 0.1, 0.2, 0.4, 0.6, 0.8]

# launch the evaluation
list_results, sum_scores = Benchmark().eval(algorithms=algorithms, datasets=datasets, patterns=patterns, x_axis=range, optimizers=optimizers, save_dir=save_dir, runs=nbr_runs)

You can change the optimizer using the following command:

optimizer = {"optimizer": "ray_tune", "options": {"n_calls": 1, "max_concurrent_trials": 1}}
optimizers = [optimizer]

Integration

To add your own imputation algorithm in Python or C++, please refer to the detailed integration guide.


Citing

If you use ImputeGAP in your research, please cite the paper:

@article{nater2025imputegap,
  title = {ImputeGAP: A Comprehensive Library for Time Series Imputation},
  author = {Nater, Quentin and Khayati, Mourad and Pasquier, Jacques},
  year = {2025},
  eprint = {2503.15250},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  url = {https://arxiv.org/abs/2503.15250}
}

Core Contributors

Quentin Nater - ImputeGAP Mourad Khayati - ImputeGAP
Quentin Nater Mourad Khayati

References

[1]: Mourad Khayati, Philippe Cudré-Mauroux, Michael H. Böhlen: Scalable recovery of missing blocks in time series with high and low cross-correlations. Knowl. Inf. Syst. 62(6): 2257-2280 (2020)

[2]: Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, Russ B. Altman: Missing value estimation methods for DNA microarrays. Bioinform. 17(6): 520-525 (2001)

[3]: Dejiao Zhang, Laura Balzano: Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation. AISTATS 2016: 1460-1468

[4]: Xianbiao Shu, Fatih Porikli, Narendra Ahuja: Robust Orthonormal Subspace Learning: Efficient Recovery of Corrupted Low-Rank Matrices. CVPR 2014: 3874-3881

[5]: Spiros Papadimitriou, Jimeng Sun, Christos Faloutsos: Streaming Pattern Discovery in Multiple Time-Series. VLDB 2005: 697-708

[6]: Rahul Mazumder, Trevor Hastie, Robert Tibshirani: Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 11: 2287-2322 (2010)

[7]: Jian-Feng Cai, Emmanuel J. Candès, Zuowei Shen: A Singular Value Thresholding Algorithm for Matrix Completion. SIAM J. Optim. 20(4): 1956-1982 (2010)

[8]: Hsiang-Fu Yu, Nikhil Rao, Inderjit S. Dhillon: Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction. NIPS 2016: 847-855

[9]: Xiuwen Yi, Yu Zheng, Junbo Zhang, Tianrui Li: ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data. IJCAI 2016: 2704-2710

[10]: Lei Li, James McCann, Nancy S. Pollard, Christos Faloutsos: DynaMMo: mining and summarization of coevolving sequences with missing values. 507-516

[11]: Kevin Wellenzohn, Michael H. Böhlen, Anton Dignös, Johann Gamper, Hannes Mitterer: Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series. EDBT 2017: 330-341

[12]: Aoqian Zhang, Shaoxu Song, Yu Sun, Jianmin Wang: Learning Individual Models for Imputation (Technical Report). CoRR abs/2004.03436 (2020)

[13]: Tianqi Chen, Carlos Guestrin: XGBoost: A Scalable Tree Boosting System. KDD 2016: 785-794

[14]: Royston Patrick , White Ian R.: Multiple Imputation by Chained Equations (MICE): Implementation in Stata. Journal of Statistical Software 2010: 45(4), 1–20.

[15]: Daniel J. Stekhoven, Peter Bühlmann: MissForest - non-parametric missing value imputation for mixed-type data. Bioinform. 28(1): 112-118 (2012)

[22]: Jinsung Yoon, William R. Zame, Mihaela van der Schaar: Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks. IEEE Trans. Biomed. Eng. 66(5): 1477-1490 (2019)

[23]: Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, Yitan Li: BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018: 6776-6786

[24]: Parikshit Bansal, Prathamesh Deshpande, Sunita Sarawagi: Missing Value Imputation on Multidimensional Time Series. Proc. VLDB Endow. 14(11): 2533-2545 (2021)

[25]: Xiao Li, Huan Li, Hua Lu, Christian S. Jensen, Varun Pandey, Volker Markl: Missing Value Imputation for Multi-attribute Sensor Data Streams via Message Propagation (Extended Version). CoRR abs/2311.07344 (2023)

[26]: Mingzhe Liu, Han Huang, Hao Feng, Leilei Sun, Bowen Du, Yanjie Fu: PriSTI: A Conditional Diffusion Framework for Spatiotemporal Imputation. ICDE 2023: 1927-1939

[27]: Kohei Obata, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai: Mining of Switching Sparse Networks for Missing Value Imputation in Multivariate Time Series. KDD 2024: 2296-2306

[28]: Jinsung Yoon, James Jordon, Mihaela van der Schaar: GAIN: Missing Data Imputation using Generative Adversarial Nets. ICML 2018: 5675-5684

[29]: Andrea Cini, Ivan Marisca, Cesare Alippi: Multivariate Time Series Imputation by Graph Neural Networks. CoRR abs/2108.00298 (2021)

[30]: Shikai Fang, Qingsong Wen, Yingtao Luo, Shandian Zhe, Liang Sun: BayOTIDE: Bayesian Online Multivariate Time Series Imputation with Functional Decomposition. ICML 2024

[31]: Liang Wang, Simeng Wu, Tianheng Wu, Xianping Tao, Jian Lu: HKMF-T: Recover From Blackouts in Tagged Time Series With Hankel Matrix Factorization. IEEE Trans. Knowl. Data Eng. 33(11): 3582-3593 (2021)

[32]: Xiaodan Chen, Xiucheng Li, Bo Liu, Zhijun Li: Biased Temporal Convolution Graph Network for Time Series Forecasting with Missing Values. ICLR 2024

[33] Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, Philippe Cudré-Mauroux: Mind the Gap: An Experimental Evaluation of Imputation of Missing Values Techniques in Time Series. Proc. VLDB Endow. 13(5): 768-782 (2020)

[34] Mourad Khayati, Quentin Nater, Jacques Pasquier: ImputeVIS: An Interactive Evaluator to Benchmark Imputation Techniques for Time Series Data. Proc. VLDB Endow. 17(12): 4329-4332 (2024)

Releases

No releases published

Packages

No packages published

Languages