Official code for the paper "Evaluating Neural Word Embeddings for Sanskrit".
SanEval is a toolkit for evaluating the quality of Sanskrit embeddings. We assess their generalization power by using them as features on a broad and diverse set of tasks. We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in word embeddings. Our goal is to ease the study and the development of general-purpose fixed-size word representations for Sanskrit.
This code is written in python. The dependencies are:
- Python 3.6
pip install -r requirements.txt
- SanEval includes a series of Intrinsic tasks to evaluate what linguistic properties are encoded in your word embeddings.
- We use
SLP1
transliteration scheme for our data. You can change it to another scheme using this code.
Task | Metric | #dev | #test |
---|---|---|---|
Relatedness | F-score | 4.5k | 9k |
Similarity | Accuracy | na | 3k |
Categorization Syntactic | Purity | na | 1.1k |
Categorization Semantic | Purity | na | 150 |
Analogy Syntactic | Accuracy | na | 10k |
Analogy Semantic | Accuracy | na | 6.4k |
- You can download the pretrained models from this link.
README.md
is given for each model. - Place the
models
folder in the parent directory path. - Pretrained vectors can be downloaded from this link. Place this folder in
EvalSan/evaluations/Intrinsic/
path. This vectors are being used in evaluation script.
Please refer to the models
folder for more details.
bash train_embeddings.sh
To evaluate your word embeddings, run the following command:
bash run_SanEval.sh
If you use our tool, we'd appreciate if you cite the following paper:
@inproceedings{sandhan-etal-2023-evaluating,
title = "Evaluating Neural Word Embeddings for {S}anskrit",
author = "Sandhan, Jivnesh and
Paranjay, Om Adideva and
Digumarthi, Komal and
Behra, Laxmidhar and
Goyal, Pawan",
booktitle = "Proceedings of the Computational {S}anskrit {\&} Digital Humanities: Selected papers presented at the 18th World {S}anskrit Conference",
month = jan,
year = "2023",
address = "Canberra, Australia (Online mode)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.wsc-csdh.2",
pages = "21--37",
}
This project is licensed under the terms of the Apache license 2.0
.