Please use the following BibTeX code to cite our paper, which is accepted for presentation at VLDB 2024:
@inproceedings{lopster2024,
title = {Generalizable Data Cleaning of Tabular Data in Latent Space},
author = {Eduardo dos Reis, Mohamed Abdelaal and Carsten Binnig},
journal = {Proceedings of the VLDB Endowment},
volume = {17},
number = {13},
pages = {4786--4798},
year = {2024},
publisher = {VLDB Endowment}
}
We provided the training and cleaning in a single script lopster.py.
Step 1. The requirements are two csv files inside a folder < my dataset folder >: a clean.csv containing a - mostly - clean sample of the data, and dirty01.csv, the target dataset to be cleaned. Keep this folder in the project root folder for ease-of-use.
- < my dataset folder >
- clean.csv
- dirty01.csv
Step 2 Open the dataset_configuration.json file and replace the "empty" json entry with your dataset information, while also replacing the name of the entry for < my dataset folder >.
Next, the script will yield a lopster.csv file containing the full cleaned version of the dataset under < my dataset folder > . If it is the first time running the script for the current dataset, it will first train a Lopster model using tensorflow. Example usage:
python3 lopster.py --dataset < my dataset folder >
You can also change hyperparameters for training (K, batch size, latent space dimensionality, training epochs):
python3 lopster.py --dataset < my dataset folder > --path < path/to/model/folder/ > --K 12 --latent 120 --epochs 100 --batch_size 256
We provided the REIN benchmark data used for the paper as is, in rein_data.zip. Once the data is decompressed to the DATASETS_REIN/ folder, one must create a python 3 environment and install the dependencies listed on requirements.txt. Next, a new Lopster model can be trained on any dataset as long as its configuration is mapped to the datasets.py script.
To reproduce the paper results on the REIN benchmark, a script was provided (rein_benchmark.py). For example:
python3 rein_benchmark.py --dataset adult --K 12 --latent 120 --epochs 100 --batch_size 256 --eval_tuples 30000
All our ablation studies are available on ablation_studies.py, and it is already configured to replicate only the published ones.
python3 ablation_studies.py --dataset soccer_PLAYER --K 12 --latent 120 --epochs 40 --batch_size 256 --training_tuples 30000
All the evaluation files are saved inside the evaluation/ folder as .CSV and can be easily plotted. We provide a script to replicate our plots in plotter.py, the usage follows:
python3 plotter.py --experiment tuplewise --dataset adult --y_title --legend
The largest dataset used for the paper experiments is available for download at Large Soccer Dataset. All the others are available in the DATASETS_REIN folder.