This repository contains a set of scripts for creating the corpus and database used in the FILTER project. The corpus is a compilation of four folk poetry collections, of which at the moment two are public and two non-public. The pipeline produces a set of tables in CSV format that is stored in hsci-r/filter-data.
After cloning this repository, initialize the Git submodules for the source data using the command:
git submodule update --init --recursive
Further, install the Python dependencies. The preferred way of doing it
is through Anaconda - use the environment file env.yml
. For CSC
computing clusters, it is recommended to use
Tykkky.
The different steps of the pipeline are called using GNU Make. The
environment variable DATA_DIR
should be set to the path of the output
directory (which will contain the resulting CSV files). For example to
call the preprocessing, execute:
DATA_DIR=/path/to/filter-data make combined
Execute make combined
to run the preprocessing step.
The corpus consists of four collections, which are linked as submodules in
data/raw
:
- Suomen Kansan Vanhat Runot (SKVR) (public)
- Eesti Regilaulude Andmebaas (ERAB) (public)
- Julkaisemattomat Runot (JR) (private)
- Kirjalliset Runot (KR) (private)
The private repositories are planned to be published soon, but currently the pipeline can also be executed without them.
A description of the format of the source files can be found here.
TODO
TODO
The code published in this repository is licenced under the MIT license.
For the folk poetry materials linked as submodules, see the information in their repositories.