This repository contains software (scripts) and associated data for importing, converting and other processing of (corpus) data in Kielipankki – The Language Bank of Finland.
In particular, the repository contains scripts converting data to the VRT (VeRticalized Text) format used as an input format for the IMS Open Corpus Workbench (CWB) and Korp. Although many scripts are specific to the Language Bank of Finland, some of them may be more generally useful, or at least they may be adapted to other environments.
Note: Corpus data itself should not be included in this public repository. Neither should any secret or private information, such as passwords.
In general, the top-level directory structure is as follows:
corp/
: corpus-specific scripts (and associated data), containing subdirectories by corpus, group of corpora, corpus origin (owner) or corpus type (such as speech)docs/
: general documentation on corpus processingfulltext/
: scripts to be included on corpus full-text HTML pagesscripts/
: general-purpose scriptsvrt-tools/
: FIN-CLARIN VRT Tools: tools for processing VeRticalized Text data
For a major subsystems of scripts, such as harvesting or parsing, you
may add a top-level directory of its own, or alternatively, a
subdirectory under scripts/
.
If you are unsure about where you should put your scripts in development, you can develop them in a private branch first and merge it to the master only when they are relatively stable.
This repository is the public successor of the previous Kielipankki-konversio repository that also contained some private data. The repository was made public on 2020-02-04. Any further development should be done in this repository.