-
First you'll have to download the Tatoeba dataset. Run
./download-data.sh
You will find the data under
data-tatoeba/
inside this folder. -
Download any required sentence pairs you are interested in and place it in
data-tatoeba/
.
As this sentence pairs are always created on the fly, this step has to happen manually.
Sentence pair data should be namedsentences_{SRC_LANG}_{TGT_LANG}.tsv
, whereSRC_LANG
andTGT_LANG
are lowercased two-letter language codes. For examplesentences_uk_de.tsv
. -
Run the
./generate_data.ipynb
notebook to process raw data into TSVs. Change input variables as needed. -
Run the
./import_sql.ipynb
notebook to import the generated CSVs into a local SQLite database. -
Run the
./generate_exercise_precursors.ipynb
notebook to generate exercise precursors. -
Run the
./similar_words.ipynb
notebook to generate similar words. -
Optionally do a manual quality control check over the generated data.
-
To populate the exercise table which is the main entity in the API server, run
python3 populate_exercise_table.py
The script expects the following two files to exist (they have to be moved to that folder manually):
data-import/exercise-import.tsv
: which is the output of the./generate_exercise_precursors.ipynb
notebookdata-import/similar-words-import.tsv
: which is the output of the./similar_words.ipynb
notebook
-
Optionally run the
./generate_sentence_audio
notebook to generate audio files.
After all those above steps. You should have a ready-to-use tasbkpool.db
SQLite DB in the parent folder which
will be used by the API server.