Wilhelm Data Loader

Wilhelm Data Loader is a bundle of data pipeline that reads wilhelmlang.com's vocabulary from supported data sources and loads them into graph databases

Some features can be reused as SDK which can be installed via

pip install wilhelm_data_loader

Details documentations can be found at sdk.wilhelmlang.com

Wiktionary Data Loader (Arango DB)

wilhelm-data-loader works naturally for single-tenant application, the wilhelmlang.com. In order to support cross-language inferencing, all data are hence loaded into a single Database. Data of each langauge resides in dedicated Collections

There are n + 2 Collections loaded:

n document collections for n languages supported by wiktionary-data
1 document collection for "Definition" entity, where the English definition of each word resides in one document
1 edge collection for connections between words and definitions as well as those among words themselves

Tip

See Collection Types for differences between document & edge collections

Each collection generates index on the word term. If the term comes with a gender modifier, such as "das Audo" (car, in German), a new computed attribute that has the modifier stripped-off is used for indexing instead

Wilhelm Vocabulary Loader

the absolute fastest way (by far) to load large datasets into neo4j is to use the bulk loader

The cache here is defined as the set of all connected components formed by all vocabularies.

Computing cache directly within the webservice is not possible because Hugging Face Datasets does not have Java API. Common cache store such as Redis is overkill because this cache is going to be read-only. The best option is then a file-based cache

Computing Cache

Since wilhelm-vocabulary is a highly personalized and manually-made data set, it is safe to assume the datasize won't be large. In fact, its no more than tens of thousands of nodes. This allows for simpler cache loading algorithm which is easier to maintain

Development

Environment Setup

Get the source code:

git clone git@github.com:QubitPi/wilhelm-data-loader.git
cd wilhelm-data-loader

It is strongly recommended to work in an isolated environment. Install virtualenv and create an isolated Python environment by

python3 -m pip install --user -U virtualenv
python3 -m virtualenv .venv

To activate this environment:

source .venv/bin/activate

or, on Windows

./venv\Scripts\activate

Tip

To deactivate this environment, use

deactivate

Installing Dependencies

pip3 install -r requirements.txt

License

The use and distribution terms for Wilhelm Graph Database Python SDK are covered by the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
database		database
docs		docs
wiktionary_data		wiktionary_data
wilhelm_vocabulary		wilhelm_vocabulary
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.lycheeignore		.lycheeignore
.mdlrc		.mdlrc
.readthedocs.yaml		.readthedocs.yaml
.yamllint		.yamllint
LICENSE		LICENSE
README.md		README.md
markdownlint.rb		markdownlint.rb
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wilhelm Data Loader

Wiktionary Data Loader (Arango DB)

Wilhelm Vocabulary Loader

Computing Cache

Development

Environment Setup

Installing Dependencies

License

About

Releases

Packages

Languages

License

QubitPi/wilhelm-data-loader

Folders and files

Latest commit

History

Repository files navigation

Wilhelm Data Loader

Wiktionary Data Loader (Arango DB)

Wilhelm Vocabulary Loader

Computing Cache

Development

Environment Setup

Installing Dependencies

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages