Skip to content

Commit

Permalink
Support Sanskrit
Browse files Browse the repository at this point in the history
  • Loading branch information
QubitPi committed Nov 20, 2024
1 parent 197443a commit c23c545
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 3 deletions.
23 changes: 21 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ language:
- ko
- peo
- akk
- sa
configs:
- config_name: Languages
data_files:
Expand All @@ -24,22 +25,27 @@ configs:
path: old-persian-wiktextract-data.jsonl
- split: Akkadian
path: akkadian-wiktextract-data.jsonl
- split: Sanskrit
path: sanskrit-wiktextract-data.jsonl
- config_name: Graph
data_files:
- split: AllLanguage
path: word-definition-graph-data.jsonl
tags:
- Natural Language Processing
- NLP
- Wiktionary
- Vocabulary
- German
- Latin
- Ancient Greek
- Korean
- Old Persian
- Akkadian
- Vocabulary
- Sanskrit
- Knowledge Graph
size_categories:
- 1M<n<10M
- 100M<n<1B
---

Wiktionary Data on Hugging Face Datasets
Expand All @@ -61,6 +67,7 @@ supports the following languages:
- __한국어__ - Korean
- __𐎠𐎼𐎹__ - [Old Persian](https://en.wikipedia.org/wiki/Old_Persian_cuneiform)
- __𒀝𒅗𒁺𒌑(𒌝)__ - [Akkadian](https://en.wikipedia.org/wiki/Akkadian_language)
- __संस्कृतम्__ - Sanskrit, or Classical Sanskrit

[wiktionary-data]() was originally a sub-module of [wilhelm-graphdb](https://github.com/QubitPi/wilhelm-graphdb). While
the dataset it's getting bigger, I noticed a wave of more exciting potentials this dataset can bring about that
Expand All @@ -84,11 +91,23 @@ There are __two__ data subsets:
- `Korean`
- `OldPersian`
- `Akkadian`
- `Sanskrit`

2. __Graph__ subset that is useful for constructing knowledge graphs:

- `AllLanguage`: all the languages in a giant graph

The _Graph_ data ontology is the following:

<div align="center">
<img src="ontology.png" size="50%" alt="Error loading ontology.png"/>
</div>

> [!TIP]
>
> Two words are structurally similar if and only if the two shares the same\
> [stem](https://en.wikipedia.org/wiki/Word_stem)
Development
-----------

Expand Down
Binary file added ontology.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 6 additions & 1 deletion wiktionary/wiktextract/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,8 @@ def extract_data(wiktextract_data_path: str):
open("ancient-greek-wiktextract-data.jsonl", "w") as ancient_greek,
open("korean-wiktextract-data.jsonl", "w") as korean,
open("old-persian-wiktextract-data.jsonl", "w") as old_persian,
open("akkadian-wiktextract-data.jsonl", "w") as akkadian
open("akkadian-wiktextract-data.jsonl", "w") as akkadian,
open("sanskrit-wiktextract-data.jsonl", "w") as sanskrit
):
for line in data:
vocabulary = json.loads(line)
Expand Down Expand Up @@ -81,6 +82,10 @@ def extract_data(wiktextract_data_path: str):
if vocabulary["lang"] == "Akkadian":
akkadian.write(json.dumps({"term": term, "part of speech": pos, "definitions": definitions, "audios": audios}))
akkadian.write("\n")
if vocabulary["lang"] == "Sanskrit":
sanskrit.write(json.dumps({"term": term, "part of speech": pos, "definitions": definitions, "audios": audios}))
sanskrit.write("\n")


def extract_graph(wiktextract_data_path: str):
import json
Expand Down

0 comments on commit c23c545

Please sign in to comment.