Skip to content

State-of-the-art neural tagger and lemmatizer for ancient languages

Notifications You must be signed in to change notification settings

asahala/BabyLemmatizer

Repository files navigation

alt text

BabyLemmatizer 2.2

State-of-the-art neural part-of-speech-tagger and lemmatizer finetuned for Cuneiform languages such as Akkadian, Sumerian and Urartian. BabyLemmatizer models also exist for other ancient languages such as Ancient Greek.

BabyLemmatizer is fully based on OpenNMT, which makes it simpler to use than the previous BabyLemmatizer version that was dependent on an outdated version of TurkuNLP with some problematic dependencies. At its current stage, BabyLemmatizer can be used for part-of-speech tagging and lemmatization of transliterated Akkadian texts. Unlike the old version, BabyLemmatizer uses an unindexed character based representation for syllabic signs and sign-based tokenization for logograms, that maximize its capability to discriminate between predictable and suppletive grapheme to phoneme relations. For network architecture and encoding of the input sequences, see this description. For the user manual of BabyLemmatizer, click here.

Brief description

BabyLemmatizer approaches POS-tagging and lemmatization as a Machine Translation task. It features a POS-tagger and lemmatizer that combine strengths of encoder-decoder neural networks (i.e. predicting analyses for unseen word forms), and (at the moment very slightly) statistical and heuristic dictionary-based methods to post-correct and score the reliability of the annotations. BabyLemmatizer is useful for making Akkadian texts searchable and useable for other Natural Language Processing tasks, such as building Word Embeddings, as transliterated texts are practically impossible to use efficiently due to orthographic and morphological complexity of the language.

Transliteration Lemma POS-tag
IMIN{+et} sebe NU
a-di adi PRP
IMIN{+et} sebe NU
a-ra-an-šu arnu N
pu-uṭ-ri paṭāru V

Requirements

  1. OpenNMT-py
  2. Python 3.9+

BabyLemmatizer has been tested with Python 3.9 and OpenNMT-py 3.2.0 and it's highly recommended to use these versions for the virtual environment. Versions other than 3.2.0 of OpenNMT might cause unexpected crashes. See instructions below for installing the supported version.

Setting up BabyLemmatizer

The easiest way to get BabyLemmatizer running is to create a Python 3.9 virtual environment for OpenNMT-py. This ensures that you have permanently all necessary requirements installed and they do not conflict with your other libraries. This is fairly simple to do:

  1. make a directory and go there (you need to use this path later)
  2. python3.9 -m venv OpenNMT
  3. source OpenNMT/bin/activate
  4. pip install --upgrade pip
  5. pip install OpenNMT-py==3.2.0

Then you need to clone the BabyLemmatizer repository and edit preferences.py to add paths to the virtual environment and OpenNMT binaries. Note that onmt_path should point to the directory where OpenNMT has files build_vocab.py and train.py, for example:

python_path = '/yourpath/OpenNMT/bin/'
onmt_path = '/yourpath/OpenNMT/lib/python3.9/site-packages/onmt/bin/'

Now you can run preferences.py and if lots of OpenNMT documentation prints on your screen, everything should be okay.

Pretrained models

Following pretrained models are available for version 2.1 (and newer): Sumerian (includes two sub-models: literary and administrative), Neo-Assyrian, Middle Assyrian (augmented model), First Millennium Babylonian (Late and Neo-Babylonian, Standard Babylonian), Second Millennium Babylonian (e.g. Middle Babylonian), Urartian, Latin (demo), Ancient Greek (demo).

To use these models, clone or download the repository you want and extract the .tar.gz file, e.g. tar -xf sumerian-lit.tar.gz to the models directory. You can rerun the evaluation with python babylemmatizer.py --evaluate=sumerian-lit. If you want to use custom model path, see command-line parameters how to specify it. To lemmatize a text, see instructions later in this file and the User Manual.

Version compatibility

Models trained with 2.0 are compatible with 2.1 and 2.2.

User Manual

See BabyLemmatizer Manual for more in-depth instructions how to lemmatize the demo text, train models and evaluate them.

Command line use

At present, BabyLemmatizer should be used via command line instead of calling it directly in Python.

Lemmatization

To lemmatize unlemmatized corpus, run the following command:

python3 babylemmatizer.py --lemmatize=modelname --filename=corpus_file

where corpus_file points to the CoNLL-U file (e.g. input/example.conllu) you want to lemmatize and modelname to the model you want to use. Lemmatization is by default done on GPU, but if you don't have a CUDA capable GPU, you can add parameter --use-cpu. If you use a custom model directory, remember to add --model-path=yourpath argument.

It is recommended that the file that you are lemmatizing is in some directory, because the lemmatizer produces several output files. For example, if your unlemmatized conllu file is in myworkpath/ use --filename=myworkpath/corpus_file. For more information about lemmatization, see BabyLemmatizer Manual.

Training and evaluation

Training and evaluation can be done using babylemmatizer.py command line API. The command line interface is purposefully simple and does not give user direct access to any additional parameters.

GENERAL PARAMETERS (use only one)
--build=<arg>                  Builds data from CoNNL-U files in your conllu folder
--train=<arg>                  Trains a model or models from the built data
--build-train=<arg>            Builds data and trains a model or models
--evaluate=<arg>               Evaluates and cross-validates your model or models
--evaluate-fast=<arg>          Rerun evaluation without running POS-tagger and lemmatizer
                               (only post-corrections are applied, useful if you want to tweak the override lexicon
                                or if you just want to quickly see the evaluation results of your model again.
                                You must run --evaluate at least once for you model before using --evaluate-fast)

PATH AND OPTIONS
--use-cpu                      Use CPU instead of GPU (read more below)
--conllu-path=<arg>            Path where to read CoNLL-U files
--model-path=<arg>             Path where to save/read models

OPTIONAL OPTIONS FOR --build and --build-train
--tokenizer=<arg>              Select input tokenization type when you use --build or --build-train (default = 0)
                               0 : Partly unindexed logo-syllabic tokenization (Akkadian, Elamite, Hittite, Urartian, Hurrian)
                               1 : Indexed logo-syllabic (Sumerian)
                               2 : Character sequences (Non-cuneiform languages, like Greek, Latin, Sanskrit etc.)
--lemmatizer-context=<arg>     Number of surrounding XPOS tags used in lemmatization (default = 1)
--tagger-context=<arg>         Number of surrounding forms used in tagging (default = 2)

All these parameters have one mandatory argument, which points to the data in your conllu folder if you are building new data, or to your models folder if you are training or evaluating models. For example, if you have CoNLL-U files assyrian-train.conllu, assyrian-dev.conllu, assyrian-test.conllu and want to build data and train models for them, you can call BabyLemmatizer python babylemmatizer.py --build-train=assyrian. In case you want to train several models for n-fold cross-validation, you can have train/dev/test CoNLL-U files with prefixes followed by numbers, e.g. with n=10 assyrian0, assyrian1, ..., assyrian9 and use the command python babylemmatizer.py --build-train=assyrian*. Similarly, to cross-validate these models after training, use python babylemmatizer.py --evaluate=assyrian*.

Note that --tokenizer, --lemmatizer-context, --tagger-context are defined only when you build the model. This does nothing if used with --evaluate or --lemmatize, as the tokenization and context window preferences are saved in your model.

Using CPU: If you want to use CPU instead of GPU (i.e. if you get a CUDA error), use parameter --use-cpu in addition with parameters --train, --build-train and --evaluate. Note that training models with CPU is extremely slow and may take days depending on your training data size and hardware. However, you can lemmatize new texts using CPU without too much waiting.

On the first run OpenNMT may take a while to initialize (up to few minutes depending on your system).

Performance

For 10-fold cross-validated results see Sahala & Lindén (2023). The table below summarizes the performance of the pretrained models. Full lacunae are not counted in the results as labeling them is trivial. Gothic, Greek and Latin data come from PROIEL.

Gothic Greek Latin Sum-L Sum-A Bab-1st Bab-2nd Neo-Ass Mid-Ass Urartian
Tagger 95.54 97.22 95.38 94.00 96.48 96.84 97.85 97.49 96.76 96.51
Lemmatizer 95.36 97.62 96.49 93.70 95.42 95.36 94.59 95.44 94.46 93.96
OOV-rate 12.44 11.03 10.58 19.04 5.44 6.63 13.04 9.51 10.21 8.26

Sum-L = literary, Sum-A = administrative, Bab-1st = first millennium Babylonian, Bab-2nd = second millennium Babylonian.

Citations

If you use BabyLemmatizer for annotating your data or training new models, please cite Sahala and Lindén 2023:

@inproceedings{sahala2023babylemmatizer,
  title={A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages},
  author={Sahala, A. J. Aleksi and Lindén, Krister},
  booktitle={Proceedings of the Ancient Language Processing Workshop at the 14th International Conference on Recent Advances in Natural Language Processing RANLP 2023},
  pages={203-212},
  year={2023}
}

For use-cases of the earlier version of BabyLemmatizer, see Sahala et al. 2022.

Upcoming features

In order of priority:

  • Advanced command-line use (tuning the neural net, customizing folders etc)
  • Phonological transcription
  • Morphological analysis
  • Named-entity recognition
  • Direct Oracc ATF support (E. Pagé-Perron has perhaps already done ATF<->CoNLL-u scripts)
  • Server-side use

Bugs

  • If user forgets = between parameter and argument in command line use, things go wrong
  • Conll-u line write protection DOES NOT WORK! Do not use it or you will get weird results (as of August 2024)

Todo

  • Simplify commandline parameters, --lemmatize is a bit illogical now
  • Make filepaths and star expressions more robust
  • Fix missing = causing problems
  • conf score-wise evaluation
  • global version and logger
  • publish data splitter for conlluplus
  • make data merger and augmentation scripts public
  • remove unused code
  • Oracc guideword field for conllu+
  • [DONE] --lemmatize to work with conlluplus
  • [DONE] lemmatization cycle as automatic as possible
  • [DONE] write-protected fields
  • [DONE] model versioning
  • [DONE but needs CMD params] adjustable context window
  • [DONE] normalize function for conlluplus
  • [DONE] force determinative capitalization for train data

If willpower

  • rewrite conllu+ class in a way that updates can be done on the fly instead of reiterating
  • tag confusion matrix
  • tag-wise evaluation
  • category-wise evaluation (logo/logosyll/syll)
  • add possibility to use external validation set

Latest updates

  • 2.2 (2024-06-07) --lemmatizer-context and --tagger-context parameters can be used to adjust how much context information is taken into account in tagging and lemmatization. These parameters are used only with --build and --build-train parameters. These settings are saved in the model's config.yaml.
  • 2.1 (2023-09-05): --tokenizer parameter, models now rembember which tokenization to use if it is defined during --build. Models created in version 2.0 will use logo-syllabic tokenization by default, unless you make a file config.yaml in your model directory (the same place where the yaml files are) and type tokenizer=2 on the first line (see command line parameters for possible values).

Data-openness Disclaimer

This tool was made possible by open data, namely thousands of work-hours invested in annotating Oracc projects. If you use BabyLemmatizer for your dataset, it is HIGHLY advised that your corpus will be shared openly (e.g. CC-BY SA). Sitting on a corpus does not give it the recognition it could have, if it were distributed openly. Just be sure to publish a paper describing your data to ensure academically valued citations.