In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. While in this process a myriad of artifcats (datasets, pipeline scripts, etc.) are created, there has so far been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists resort to recovering information and experience from colleagues or learn via trial and error. Hence, this paper presents a scalable system, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information KGLiDS enables a variety of downstream applications, such as data discovery and pipelines automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML and shows that KGLiDS is significantly faster with a lower memory footprint as the state of the art while achieving comparable or better accuracy.
Try out our KGLiDS Colab Demo and KGLiDS DataPrep Demo that demonstrates our APIs on Kaggle data!
To learn more about Linked Data Science and its applications, please watch Dr. Mansour's talk at Waterloo DSG Seminar (Here).
- Clone the
kglids
repo - Create
kglids
Conda environment (Python 3.8) and install pip requirements. - Activate the
kglids
environment
conda create -n kglids python=3.8 -y
conda activate kglids
pip install -r requirements.txt
Generating the LiDS graph:
- Add the data sources to config.py:
# sample configuration
# list of data sources to process
data_sources = [DataSource(name='benchmark',
path='/home/projects/sources/kaggle',
file_type='csv')]
- Run the Data profiler
cd kg_governor/data_profiling/src/
python main.py
- Run the Knowledge graph builder to generate the data_items graph
cd kg_governor/knowledge_graph_construction/src/
python data_global_schema_builder.py
- Run the Pipeline abstractor to generate the pipeline named graph(s)
cd kg_governor/pipeline_abstraction/
python pipelines_analysis.py
Uploading LiDS graph to the graph-engine (we recommend using GraphDB ): Please see populate_graphdb.py for an example of uploading graphs to GraphDB.
Using the KGLiDS APIs:
KGLiDS provides predefined operations in form of python apis that allow seamless integration with a conventional data science pipeline. Checkout the full list of KGLiDS APIs
To store the created knowledge graph in a standardized and well-structured way,
we developed an ontology for linked data science: the LiDS Ontology.
Checkout LiDS Ontology!
The following benchmark datasets were used to evaluate KGLiDS:
-
Data Discovery: Table Union Search
-
Kaggle
See the full list of supported APIs here.
If you find our work useful, please cite it in your research.
@INPROCEEDINGS{kglids,
author={Helali, Mossad and Monjazeb, Niki and Vashisth, Shubham and Carrier, Philippe and Helal, Ahmed and Cavalcante, Antonio and Ammar, Khaled and Hose, Katja and Mansour, Essam},
booktitle={2024 IEEE 40th International Conference on Data Engineering (ICDE)},
title={KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science},
year={2024},
pages={179-192},
url={https://doi.org/10.1109/ICDE60146.2024.00021},
ISSN={2375-026X},
}
We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs.
For any questions please contact us: