An end-to-end NLP pipeline for Coptic text in UTF-8 encoding.
Online production version available as a web interface at: https://tools.copticscriptorium.org/coptic-nlp/
The pipeline supports normalization, segmentation (at the word and subword levels), part of speech tagging, lemmatization, language of origin detection, sentence splitting, syntactic dependency parsing, multiword expression recognition, entity recognition, Wikification, and more.
🔥New🔥: The coptic-nlp now supports both Sahidic and Bohairic Coptic dialect varieties!
The NLP pipeline can run as a script or as part of the included web interface via a web server (e.g. Apache, using index.py as a landing page). To run it:
- Clone this repository into the directory that the script should run on
- Ensure that the relevant user has permissions to run scripts in the directory
- Make sure the dependencies under Requirements are installed
The NLP pipeline will run on Python 2.7+ or Python 3.5+ (2.6 and lower are not supported). See requirements.txt for required libraries.
The pipeline also requires perl for segmentation. If you want to use the old marmot tagger instead of flair (not recommended), or the old Malt parser model or Marmot tagging model (also not recommended) instead of the Python models, java needs to be available. Additionally if you want to use the old TreeTagger for POS tagging and lemmatization, TreeTagger must be installed. These are not included in the distribution but the script will offer to attempt to download them if they are missing.
Note on using TreeTagger with older Linux distributions: the latest TreeTagger binaries do not run on some older Linux distributions. When automatically downloading TreeTagger, the script will attempt to notice this. If you receive the error FATAL: kernel too old
, please contact @amir-zeldes or open an issue describing your Linux version so it can be added to the script handler. The compatible older version of TreeTagger can be downloaded manually from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2-old5.tar.gz. This should not be necessary is you are using our recommended Python tagger built using flair.
usage: python coptic_nlp.py [OPTIONS] files
positional arguments:
files File name or pattern of files to process (e.g. *.txt)
optional arguments:
-h, --help show this help message and exit
standard module options:
-u, --unary Binarize unary XML milestone tags
-t, --tag Do POS tagging
-l, --lemma Do lemmatization
-n, --norm Do normalization
-m, --multiword Tag multiword expressions
-b, --breaklines Add line tags at line breaks
-p, --parse Parse with dependency parser
-e, --etym Add etymolgical language of origin for loan words
-s SENT, --sent SENT XML tag to split sentences, e.g. verse for <verse ..>
(otherwise PUNCT tag is used to split sentences);
use -s=predict to use neural segmenter instead
-o {pipes,sgml,conllu}, --outmode {pipes,sgml,conllu}
Output SGML, conllu or tokenize with pipes
--dialect {auto,sahidic,bohairic}
Coptic dialect of input data (default: auto-detect)
less common options:
-f, --finitestate Use old finite-state tokenizer (less accurate)
-d {0,1,2,3}, --detokenize {0,1,2,3}
Re-group non-standard bound groups (a.k.a.
'laytonize') - 1=normal 2=aggressive 3=smart
--segment_merged When re-grouping bound groups, assume merged groups
have segmentation boundary between them
-q, --quiet Suppress verbose messages
-x EXTENSION, --extension EXTENSION
Extension for SGML mode output files (default: tt)
--stdout Print output to stdout, do not create output file
--para Add <p> tags if <hi rend="ekthetic"> is present
--space Add spaces around punctuation
--from_pipes Tokenization is indicated in input via pipes
--dirout DIROUT Optional output directory (default: this dir)
--meta META Add fixed meta data string read from this file name
--parse_only Only add a parse to an existing tagged SGML input
--no_tok Do not tokenize at all, input is one token per line
--pos_spans Harvest POS tags and lemmas from SGML spans
--merge_parse Merge/add a parse into a ready SGML file
--version Print version number and quit
--treetagger Tag using TreeTagger instead of flair
--marmot Tag using Marmot instead of flair
--malt Parse using MaltParser instead of Diaparser (requires Java)
--no_gold_parse Do not use UD_Coptic cache for gold parses
--processing_meta Add segmentation/tagging/parsing/entities="auto"
--old_testament Use Old Testament identities (Jesus means Jesus son of Naue i.e. Joshua, etc.)
Add norm, lemma, parse, tag, unary tags, find multiword expressions and do language recognition:
> python coptic_nlp.py -penmult infile.txt
Just tokenize a file using pipes and dashes:
> python coptic_nlp.py -o pipes infile.txt
Tokenize with pipes and mark up line breaks, conservatively detokenize bound groups, assume seg boundary at merge site:
> python coptic_nlp.py -b -d 1 --segment_merged -o pipes infile.txt
Normalize, tag, lemmatize, find multiword expressions and parse, splitting sentences by <verse> tags:
> python coptic_nlp.py -pnmlt -s verse infile.txt
Add full analyses to a whole directory of *.xml files, output to a specified directory:
> python coptic_nlp.py -penmult --dirout /home/cop/out/ *.xml
Parse a tagged SGML file into CoNLL tabular format for treebanking, use translation tag to recognize sentences:
> python coptic_nlp.py --no_tok --parse_only --pos_spans -s translation infile.tt
Merge a parse into a tagged SGML file's <norm> tags, use translation tag to recognize sentences:
> python coptic_nlp.py --merge_parse --pos_spans -s translation infile.tt
The pipeline accepts the following kinds of input:
- Plain text, with bound groups separated by underscores or spaces.
- Note that if punctuation has not been separated from bound groups, you can use the
--space
option to attempt to automatically separate punctuation - If your Coptic text represents line breaks as new line characters, you can automatically add line break tags using
-b
/--breaklines
- Gold tokenization information may be present in the input as pipes between part-of-speech bearing units and hyphens between morphemes
- Note that if punctuation has not been separated from bound groups, you can use the
- XML/SGML input, with bound groups separated by underscores or spaces. The script will retain XML tags as-is around Coptic text.
- Coptic Scriptorium style TreeTagger SGML, with normalized units in tags such as .
- This input format is used when adding a parse to an existing .tt file using the
--merge_parse
option - It is also possible to enrich .tt files with additional annotations, e.g. only add language of origin tags to an existing analysis
- This input format is used when adding a parse to an existing .tt file using the
- Plain text input, untokenized, no meaningful line breaks, spaces separating bound groups
Ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡϣⲃⲏⲣ ⲛ̄ⲧⲙⲛⲧⲁⲧⲥⲱⲧⲙ
- XML input with meaningful line breaks and gold tokenization, underscores separating bound groups
Ⲁ|ϥ|ⲥ<hi renf="large">ⲱ</hi>ⲧⲙ_
ⲛϭⲓ|ⲡ|ϣⲃⲏⲣ_ⲛ̄|ⲧ|ⲙⲛⲧ-ⲁ
ⲧ-ⲥⲱⲧⲙ
- TreeTagger SGML input with some analyses, but for example missing the syntactic parse
<lb n="1">
<norm_group norm_group="ⲁϥⲥⲱⲧⲙ" orig_group="ⲁϥⲥⲱⲧⲙ">
<norm norm="ⲁ" orig="Ⲁ" lemma="ⲁ" pos="APST">
Ⲁ
</norm>
<norm norm="ϥ" orig="ϥ" lemma="ⲛⲧⲟϥ" pos="PPERS">
ϥ
</norm>
<norm norm="ⲥⲱⲧⲙ" orig="ⲥⲱⲧⲙ" lemma="ⲥⲱⲧⲙ" pos="V">
ⲥ
<hi rend="large">
ⲱ
</hi>
ⲧⲙ
</norm>
</norm_group>
</lb>
<lb n="2">
<norm_group norm_group="ⲛϭⲓⲡϣⲃⲏⲣ" orig_group="ⲛϭⲓⲡϣⲃⲏⲣ">
<norm norm="ⲛϭⲓ" orig="ⲛϭⲓ" lemma="ⲛϭⲓ" pos="PTC">
ⲛϭⲓ
</norm>
<norm norm="ⲡ" orig="ⲡ" lemma="ⲡ" pos="ART">
ⲡ
</norm>
<norm norm="ϣⲃⲏⲣ" orig="ϣⲃⲏⲣ" lemma="ϣⲃⲏⲣ" pos="N">
ϣⲃⲏⲣ
</norm>
</norm_group>
<norm_group norm_group="ⲛⲧⲙⲛⲧⲁⲧⲥⲱⲧⲙ" orig_group="ⲛ̄ⲧⲙⲛⲧⲁⲧⲥⲱⲧⲙ">
<norm norm="ⲛ" orig="ⲛ̄" lemma="ⲛ" pos="PREP">
ⲛ
</norm>
<norm norm="ⲧ" orig="ⲧ" lemma="ⲡ" pos="ART">
ⲧ
</norm>
<norm norm="ⲙⲛⲧⲁⲧⲥⲱⲧⲙ" orig="ⲙⲛⲧⲁⲧⲥⲱⲧⲙ" lemma="ⲙⲛⲧⲁⲧⲥⲱⲧⲙ" pos="N">
<morph morph="ⲙⲛⲧ">
ⲙⲛⲧ
</morph>
<morph morph="ⲁⲧ">
ⲁ
</lb>
<lb n="3">
ⲧ
</morph>
<morph morph="ⲥⲱⲧⲙ">
ⲥⲱⲧⲙ
</morph>
</norm>
</norm_group>
</lb>
If all requirements are installed correctly, you can verify that modules are working correctly by running the built-in unit tests:
python run_tests.py
We would like to thank Tito Orlandi (CMCL) for contributing Sahidic lexicon data to the project, Hany Takla (Saint Shenouda the Archimandrite Society) for much of the Bohairic data that the tools are based on and Pishoy Georgios for contributing lexical data for Bohairic. A full list of contributors and collaborators in other projects can be found on the Coptic Scriptorium website.