This repository contains the code for the PVLDB paper: "Scalable Model-Based Management of Massive High Frequency Wind Turbine Data with ModelarDB"
Abduvoris Abduvakhobov, Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen
Modelar Analyzer is a Python program developed for analyzing the compression effectiveness and compressed data quality from ModelarDB TSMS's error-bounded lossy and lossless compression. We use ModelarDB's JVM-based implementation in our evaluation.
Experiments can be performed with the public WTM dataset that can also be found in data/ directory in different formats and structures (i.e., as univariate and multivariate time series).
Scalable Model-Based Management of Massive High Frequency Wind Turbine Data with ModelarDB
by Abduvoris Abduvakhobov, Søren Kejser Jensen, Torben Bach Pedersen, and Christian Thomsen
in The Proceedings of the VLDB Endowment, 17(13): 4723-4732, 2024
Links: PVLDB
- Java Development Kit for ModelarDB. The following were tested:
- OpenJDK 11
- Oracle's Java SE Development Kit 11
- Scala Build Tool (sbt)
- Conda package manager
- Clone ModelarDB Analyzer from this repository
- Create new virtual environment with conda using the requirements.txt file:
conda env create <my_env> --file requirements.txt
- Activate the new environment:
conda activate <my_env>
- In the project root, clone ModelarDB 's JVM-based implementation.
- Change to ModelarDB and verify that the Head is at tag
v0.3.0
. Otherwise, checkout.
ModelarDB is configured using a configuration file (i.e., modelardb.conf). In our experiments, we only use few them, thus we created config.cfg for centralized and simplified management of all configurations used in this paper. For more details on ModelarDB's configuration options, check the instructions in ModelarDB repository and the comments in modelardb.conf file. Specifically we only configure the following parameters:
modelardb.engine
- processing engine. We use option spark (i.e., for Apache Spark) for evaluating OLAP queries, evaluation of compression factor and data quality.modelardb.source
- file-path or ip:port to the data source. In our case, those high frequency datasets: PCD, MTD and WTM*.modelardb.dimensions
- path to schema file (dimensions file) for specifying a hierarchy for the ingested data. We include the dimensions file for WTM. More on dimensions can be here.modelardb.error_bound
- pointwise error bound used for lossy compression with ModelarDB.modelardb.sampling_interval
- sampling interval of the data source. ModelarDB's JVM-based implementation only assumes that the regular time series is provided for ingestion.modelardb.interface
- we configure Apache Arrow Flight interface for the evaluation of OLAP queries with ModelarDB.
*Please note that only WTM is a public dataset and can be used for reproducability. Thus, in this repository, we automatically configure all ModelarDB configuration options for WTM. The rest of the parameters are not changed.
Use ModelarDB Analyzer to evaluate ModelarDB's compression effectiveness and MAPE from its lossy compression:
- Change to ModelarDB:
cd ModelarDB
- To add extended logging to ModelarDB, apply the git patch file in the patches directory with:
git apply ../patches/ModelarDB-Extended-Logging.patch
- Run main.py:
python3 main.py
Please note that configurations in config.cfg are tuned for WTM and thus you are not required to change them when testing with WTM.
- Run
git restore .
in ModelarDB directory - Change to the project root:
cd ..
- Run
python3 utils/conf_change_olap_querying.py
- Run loader_querying.sh with two parameters: 1) path to ModelarDB; 2) error bound value for ingestion. Run the script using
./loader_querying.sh ModelarDB/ 0.0
. ModelarDB will start the ingestion of the WTM dataset and open the query interface using Apache Arrow Flight at localhost:9999 - In a separate terminal window run:
python3 run_olap_queries.py
The transfer efficiency experiments were only performed with the highest frequency, proprietary dataset PCD. In order to set up edge and cloud nodes, we used ModelarDB's user manual and configuration file modelardb.conf.
Other hints:
- Make sure to configure the same dimensions file and error bound in modelardb.conf file of both ModelarDB instances (edge and cloud)
- Make sure that the cloud instance's modelardb.conf enables server transfer mode* like:
modelardb.transfer server
- Make sure that the edge instance's modelardb.conf file has the cloud instance's IP address for transfer:
modelardb.transfer xxx.xx.xx.xx
- Start the cloud and the edge instance either with sbt or a compiled jar file in the edge and Spark job in the cloud. To check for more options, refer to ModelarDB's user manual.
* For transfer efficieny experiments, that require edge-to-cloud environment, make sure that hostname ip resolves to what is shown in ModelarDB's modelardb.conf file. Otherwise it needs to be changed from /etc/hosts at the cloud node.
- Change to Baselines directory
- Choose the subdirectory for the chosen method (i.e., either AGG-experiments or IoTDB-experiments). There you will find respective bash scripts for each solution and detailed instructions for both AGG and IoTDB.
The default configurations for both methods use the available public dataset WTM.
The code is licensed under version 2.0 of the Apache License and a copy of the license is bundled with the program.
The code uses components from ModelarDB which is released under version 2.0 of the Apache License.