This repository contains code to automatically create a tagged sense corpus from OpenSubtitles2018. It also contains a lot of corpora wrangling code, most notably code to convert (the CC-NC licensed) EuroSense into a format usable by finn-wsd-eval.
You will need HFST and OMorFi installed globally before beginning. The reason for this is neither are currently PyPI installable. You will also need poetry. You can then run
$ ./install.sh
There is a Makefile. Reading the source is a recommended next step
after this README
. It has variables for most file paths. These are over
ridable with default. You can override them to help make it convenient when
supplying intermediate steps/upstream corpora yourself, wanting outputs in
a particular place, and when running it with Docker when you may want to use
bind mounts to make the aforementioned appear on the host.
You can make the data needed for finn-wsd-eval by running::
make wsd-eval
which will make the STIFF and EuroSense WSD evaluation corpora, including trying to fetch all dependencies. However,
- It will take a long time. The longest step is building STIFF from scratch
which can take around two weeks. To speed things up you can supply a premade
stiff.raw.xml.zst
downloaded from here (TODO). - It will not fetch one dependency with restrictions upon it: BABELWNMAP.
You will next need to set the environment variable BABELWNMAP
as the path to a TSV
mapping from BabelNet synsets to WordNet synsets. You can either:
- Obtain the BabelNet indices by following these instructions and dump out the TSV by following the instructions at https://github.com/frankier/babelnet-lookup
- If you are affiliated with a research institution, I have permission to send you the TSV file, but you must send me a direct communication from your institutional email address. (Please shortly state your position/affiliation and non-commercial research use in the email so there is a record.)
- Alternatively (subject to the same conditions) if you prefer, I can just send you eurosense.unified.sample.xml eurosense.unified.sample.key
Run::
make corpus-eval
Both the following pipelines first create a corpus tagged in the unified
format, which consists of an xml
and key
file, and then create a directory
consisting of the files needed by
finn-wsd-eval.
poetry run python scripts/fetch_opensubtitles2018.py cmn-fin
poetry run python scripts/pipeline.py mk-stiff cmn-fin stiff.raw.xml.zst
poetry run python scripts/variants.py proc bilingual-precision-4 stiff.raw.xml.zst stiff.bp4.xml.zst
./stiff2unified.sh stiff.bp4.xml.zst stiff.unified.bp4.xml stiff.unified.bp4.key
You will first need to obtain EuroSense. Since there are some language tagging issues with the original, I currently recommend you use a version I have attempted to fix.
You will next need to set the environment variable BABEL2WN_MAP as the path to a TSV mapping from BabelNet synsets to WordNet synsets. You can either:
- Obtain the BabelNet indices by following these instructions and dump out the TSV by following the instructions at https://github.com/frankier/babelnet-lookup
- If you are affiliated with a research institution, I have permission to send you the TSV file, but you must send me a direct communication from your institutional email address. (Please shortly state your position/affiliation and non-commercial research use in the email so there is a record.)
- Alternatively (subject to the same conditions) if you prefer, I can just send you eurosense.unified.sample.xml eurosense.unified.sample.key
Then run:
poetry run python scripts/pipeline.py eurosense2unified \
/path/to/eurosense.v1.0.high-precision.xml eurosense.unified.sample.xml \
eurosense.unified.sample.key
First obtain finn-man-ann.
Then run:
poetry run python scripts/munge.py man-ann-select --source=europarl \
../finn-man-ann/ann.xml - \
| poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 \
../finn-man-ann/ann.xml man-ann-opensubs18.xml
This makes a directory usable by finn-wsd-eval.
poetry run python scripts/pipeline.py unified-to-eval \
/path/to/stiff-or-eurosense.unified.xml /path/to/stiff-or-eurosense.unified.key \
stiff-or-eurosense.eval/
TODO: STIFF
poetry run python scripts/filter.py tok-span-dom man-ann-europarl.xml \
man-ann-europarl.filtered.xml
poetry run python scripts/pipeline.py stiff2unified --eurosense \
man-ann-europarl.filtered.xml man-ann-europarl.uni.xml man-ann-europarl.uni.key
poetry run python scripts/pipeline.py stiff2unified man-ann-opensubs18.xml \
man-ann-opensubs18.uni.xml man-ann-opensubs18.key.xml
poetry run python scripts/pipeline.py unified-auto-man-to-evals \
eurosense.unified.sample.xml man-ann-europarl.uni.xml \
eurosense.unified.sample.key man-ann-europarl.uni.key eurosense.eval
First process finn-man-ann.
poetry run python scripts/variants.py eval /path/to/stiff.raw.zst stiff-eval-out
poetry run python scripts/eval.py pr-eval --score=tok <(poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 /path/to/finn-man-ann/ann.xml -) stiff-eval-out stiff-eval.csv
poetry run python scripts/munge.py man-ann-select --source=europarl /path/to/finn-man-ann/ann.xml - | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
mkdir eurosense-pr
mv /path/to/eurosense/high-precision.xml eurosense-pr/EP.xml
mv /path/to/eurosense/high-coverage.xml eurosense-pr/EC.xml
poetry run python scripts/eval.py pr-eval --score=tok man-ann-europarl.xml eurosense-pr europarl.csv
Warning, plot may be misleading...
poetry run python scripts/eval.py pr-plot stiff-eval.csv europarl.csv
For help using the tools, try running with --help
. The main entry points are
in scripts
.
scripts/tag.py
: Produce an unfiltered STIFFscripts/filter.py
: Filter STIFF according to various criteriascripts/munge.py
: Convert between different corpus/stream formats
scripts/stiff2unified.sh
: Convert from STIFF format to the unified formatscripts/pipeline.py
: Various pipelines composing multiple layers of filtering/conversion
The Makefile
and Makefile.manann
.