Skip to content

A Data pipeline that consolidates, cleanse, and loads vocabulary data into graph databases for wilhelmlang.com

License

Notifications You must be signed in to change notification settings

QubitPi/wilhelm-data-loader

Repository files navigation

Wilhelm Data Loader

Python Version Read the Docs PyPI GitHub Workflow Status Apache License badge

Wilhelm Data Loader is a bundle of data pipeline that reads wilhelmlang.com's vocabulary from supported data sources and loads them into graph databases

Some features can be reused as SDK which can be installed via

pip install wilhelm_data_loader

Details documentations can be found at sdk.wilhelmlang.com

Wiktionary Data Loader (Arango DB)

wilhelm-data-loader works naturally for single-tenant application, the wilhelmlang.com. In order to support cross-language inferencing, all data are hence loaded into a single Database. Data of each langauge resides in dedicated Collections

There are n + 2 Collections loaded:

  • n document collections for n languages supported by wiktionary-data
  • 1 document collection for "Definition" entity, where the English definition of each word resides in one document
  • 1 edge collection for connections between words and definitions as well as those among words themselves

Tip

See Collection Types for differences between document & edge collections

Each collection generates index on the word term. If the term comes with a gender modifier, such as "das Audo" (car, in German), a new computed attribute that has the modifier stripped-off is used for indexing instead

Wilhelm Vocabulary Loader

the absolute fastest way (by far) to load large datasets into neo4j is to use the bulk loader

The cache here is defined as the set of all connected components formed by all vocabularies.

Computing cache directly within the webservice is not possible because Hugging Face Datasets does not have Java API. Common cache store such as Redis is overkill because this cache is going to be read-only. The best option is then a file-based cache

Computing Cache

Since wilhelm-vocabulary is a highly personalized and manually-made data set, it is safe to assume the datasize won't be large. In fact, its no more than tens of thousands of nodes. This allows for simpler cache loading algorithm which is easier to maintain

Development

Environment Setup

Get the source code:

git clone git@github.com:QubitPi/wilhelm-data-loader.git
cd wilhelm-data-loader

It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by

python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv

To activate this environment:

source .venv/bin/activate

or, on Windows

./venv\Scripts\activate

Tip

To deactivate this environment, use

deactivate

Installing Dependencies

pip3 install -r requirements.txt

License

The use and distribution terms for Wilhelm Graph Database Python SDK are covered by the Apache License, Version 2.0.

About

A Data pipeline that consolidates, cleanse, and loads vocabulary data into graph databases for wilhelmlang.com

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published