Extract competency triples from written text.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Python 3.7
- pipenv
- (optional) pyenv to automatically install required Pythons
- If pyenv is not installed, Python 3.7 is required, otherwise pyenv will install it
- Java JRE 1.8+ for CoreNLP server
- Stanford CoreNLP
Setup a python virtual environment and download all dependencies
$ pipenv install --dev
ComPex requires an installation of CoreNLP with german models. Download required CoreNLP Java server and german models from here to destination of your choosing. You can use the following script to automate this process, which downloads all required files to ./.corenlp
:
$ ./download_corenlp.sh
Enter pipenv virtual environment
$ pipenv shell
Set environment variable $CORENLP_HOME
to the directory, where CoreNLP and german models are located. If you used the helper script download_corenlp.sh
, the files are in ./.corenlp
.
$ export CORENLP_HOME=./.corenlp
Show help
$ python -m compex -h
Show help
$ python -m compex extract -h
Extract competencies of a simple sentence (you can pipe textdata into compex!)
$ echo "Die studierenden beherrschen grundlegende Techniken des wissenschaftlichen Arbeitens." | python -m compex extract
or use a file
$ python -m compex extract testsentences.txt
or use stdin
$ python -m compex extract < testsentences.txt
Check for taxonomy verbs. Checks if a found competency verb is in the given taxonomy verb dictionary. If not, it's ignored. In addition, this parameter fills the taxonomy_dimension
parameter of the extracted competency. You can use the sample file blooms_taxonomy.json
.
$ python -m compex extract --taxonomyjson blooms_taxonomy.json testsentences.txt
Sample output on stdout
(formatted for better readability)
{
"Die studierenden beherrschen grundlegende Techniken des wissenschaftlichen Arbeitens.": [
{
"objects": [],
"taxonomy_dimension": null,
"word": {
"index": 2,
"word": "beherrschen"
}
}
]
}
Evaluate compex against pre-annotated data. Outputs recall, precision and F1.
To evaluate a pre-annoted WebAnno TSV 3.2 file is needed. See here for the file format. You can use WebAnno to annotate data and evaluate compex with it. This repository contains pre-annotated data from Modulhandbooks of Department~VI of Beuth University of Applied Sciences Berlin. They can be found here: tests/resources/bht-annotated
. The corresponding WebAnno Projekt is located at tests/resources/webanno/BHT+Test_2020-03-22_1808.zip
.
Show help
$ python -m compex evaluate -h
Evaluate only competency verbs
$ python -m compex evaluate tests/resources/test.tsv
Evaluate competency verbs and objects
$ python -m compex evaluate --objects tests/resources/test.tsv
Evaluate competency verbs, objects and contexts
$ python -m compex evaluate --objects --contexts tests/resources/test.tsv
It is possible to use a dedicated taxonomy json file just like with the extract
function
$ python -m compex evaluate --taxonomyjson blooms_taxonomy.json tests/resources/test.tsv
Sample evaluation output on stdout
(formatted for better readability)
{
"f1": 0.5024705551113972,
"negatives": {
"false": 168.36206347622323,
"true": 81.63793652377686
},
"positives": {
"false": 137.53333333333336,
"true": 154.4666666666666
},
"precision": 0.5289954337899542,
"recall": 0.4784786862008745
}
Run unit tests. CoreNLP server in ./.corenlp
is required!
$ pytest
Run coverage
$ coverage run --source=./compex/ -m pytest
Export coverage report as html
$ coverage html
Generate coverage badge
$ coverage-badge -o coverage.svg
- Python 3.7
- pipenv - Python Development Workflow for Humans
- stanfordnlp - Python NLP Library for many Human Languages
- Stanford CoreNLP - Natural language software
- jsonpickle - Python library for serialization and deserialization of complex Python objects
- Timo Raschke - Initial work - traschke
This project is licensed under the MIT License - see the LICENSE file for details