EvalSan: Evaluation Toolkit for Sanskrit Embeddings

Official code for the paper "Evaluating Neural Word Embeddings for Sanskrit".

EvalSan: Evaluation Toolkit for Sanskrit Embeddings

SanEval is a toolkit for evaluating the quality of Sanskrit embeddings. We assess their generalization power by using them as features on a broad and diverse set of tasks. We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in word embeddings. Our goal is to ease the study and the development of general-purpose fixed-size word representations for Sanskrit.

Dependencies

This code is written in python. The dependencies are:

Python 3.6

pip install -r requirements.txt

Evaluation tasks

Intrinsic tasks

SanEval includes a series of Intrinsic tasks to evaluate what linguistic properties are encoded in your word embeddings.
We use SLP1 transliteration scheme for our data. You can change it to another scheme using this code.

Task	Metric	#dev	#test
Relatedness	F-score	4.5k	9k
Similarity	Accuracy	na	3k
Categorization Syntactic	Purity	na	1.1k
Categorization Semantic	Purity	na	150
Analogy Syntactic	Accuracy	na	10k
Analogy Semantic	Accuracy	na	6.4k

Pretrained models

You can download the pretrained models from this link. README.md is given for each model.
Place the models folder in the parent directory path.
Pretrained vectors can be downloaded from this link. Place this folder in EvalSan/evaluations/Intrinsic/ path. This vectors are being used in evaluation script.

How to train the models

Please refer to the models folder for more details.

bash train_embeddings.sh

How to run evaluation

To evaluate your word embeddings, run the following command:

bash run_SanEval.sh

Citation

If you use our tool, we'd appreciate if you cite the following paper:

@inproceedings{sandhan-etal-2023-evaluating,
    title = "Evaluating Neural Word Embeddings for {S}anskrit",
    author = "Sandhan, Jivnesh  and
      Paranjay, Om Adideva  and
      Digumarthi, Komal  and
      Behra, Laxmidhar  and
      Goyal, Pawan",
    booktitle = "Proceedings of the Computational {S}anskrit {\&} Digital Humanities: Selected papers presented at the 18th World {S}anskrit Conference",
    month = jan,
    year = "2023",
    address = "Canberra, Australia (Online mode)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wsc-csdh.2",
    pages = "21--37",
}

License

This project is licensed under the terms of the Apache license 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalSan: Evaluation Toolkit for Sanskrit Embeddings

Dependencies

Evaluation tasks

Intrinsic tasks

Pretrained models

How to train the models

How to run evaluation

Citation

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
evaluations/Intrinsic		evaluations/Intrinsic
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
run_SanEval.sh		run_SanEval.sh
train_embeddings.sh		train_embeddings.sh

License

Jivnesh/EvalSan

Folders and files

Latest commit

History

Repository files navigation

EvalSan: Evaluation Toolkit for Sanskrit Embeddings

Dependencies

Evaluation tasks

Intrinsic tasks

Pretrained models

How to train the models

How to run evaluation

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages