Wilhelm Data Loader is a bundle of data pipeline that reads wilhelmlang.com's vocabulary from supported data sources and loads them into graph databases
Some features can be reused as SDK which can be installed via
pip install wilhelm_data_loader
Details documentations can be found at sdk.wilhelmlang.com
wilhelm-data-loader works naturally for single-tenant application, the wilhelmlang.com. In order to support cross-language inferencing, all data are hence loaded into a single Database. Data of each langauge resides in dedicated Collections
There are n + 2 Collections loaded:
- n document collections for n languages supported by wiktionary-data
- 1 document collection for "Definition" entity, where the English definition of each word resides in one document
- 1 edge collection for connections between words and definitions as well as those among words themselves
See Collection Types for differences between document & edge collections
Each collection generates index on the word term. If the term comes with a gender modifier, such as "das Audo" (car, in German), a new computed attribute that has the modifier stripped-off is used for indexing instead
the absolute fastest way (by far) to load large datasets into neo4j is to use the bulk loader
The cache here is defined as the set of all connected components formed by all vocabularies.
Computing cache directly within the webservice is not possible because Hugging Face Datasets does not have Java API. Common cache store such as Redis is overkill because this cache is going to be read-only. The best option is then a file-based cache
Since wilhelm-vocabulary is a highly personalized and manually-made data set, it is safe to assume the datasize won't be large. In fact, its no more than tens of thousands of nodes. This allows for simpler cache loading algorithm which is easier to maintain
Get the source code:
git clone git@github.com:QubitPi/wilhelm-data-loader.git
cd wilhelm-data-loader
It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by
python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv
To activate this environment:
source .venv/bin/activate
or, on Windows
To deactivate this environment, use
pip3 install -r requirements.txt
The use and distribution terms for Wilhelm Graph Database Python SDK are covered by the Apache License, Version 2.0.