FILTER pipeline

This repository contains a set of scripts for creating the corpus and database used in the FILTER project. The corpus is a compilation of four folk poetry collections, of which at the moment two are public and two non-public. The pipeline produces a set of tables in CSV format that is stored in hsci-r/filter-data.

Installation and running

After cloning this repository, initialize the Git submodules for the source data using the command:

git submodule update --init --recursive

Further, install the Python dependencies. The preferred way of doing it is through Anaconda - use the environment file env.yml. For CSC computing clusters, it is recommended to use Tykkky.

The different steps of the pipeline are called using GNU Make. The environment variable DATA_DIR should be set to the path of the output directory (which will contain the resulting CSV files). For example to call the preprocessing, execute:

DATA_DIR=/path/to/filter-data make combined

Steps

Sources and preprocessing

Execute make combined to run the preprocessing step.

The corpus consists of four collections, which are linked as submodules in data/raw:

Suomen Kansan Vanhat Runot (SKVR) (public)
Eesti Regilaulude Andmebaas (ERAB) (public)
Julkaisemattomat Runot (JR) (private)
Kirjalliset Runot (KR) (private)

The private repositories are planned to be published soon, but currently the pipeline can also be executed without them.

A description of the format of the source files can be found here.

Similarity computation

TODO

Other scripts

TODO

Copyright note

The code published in this repository is licenced under the MIT license.

For the folk poetry materials linked as submodules, see the information in their repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
code		code
data/raw		data/raw
slurm		slurm
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
env.yml		env.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FILTER pipeline

Installation and running

Steps

Sources and preprocessing

Similarity computation

Other scripts

Copyright note

About

Releases

Packages

Contributors 2

Languages

License

hsci-r/filter-pipeline

Folders and files

Latest commit

History

Repository files navigation

FILTER pipeline

Installation and running

Steps

Sources and preprocessing

Similarity computation

Other scripts

Copyright note

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages