Support Sanskrit

QubitPi · Nov 20, 2024 · c23c545 · c23c545
1 parent 197443a
commit c23c545
Show file tree

Hide file tree

Showing 3 changed files with 27 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@ language:
   - ko
   - peo
   - akk
+  - sa
 configs:
   - config_name: Languages
     data_files:
@@ -24,22 +25,27 @@ configs:
       path: old-persian-wiktextract-data.jsonl
     - split: Akkadian
       path: akkadian-wiktextract-data.jsonl
+    - split: Sanskrit
+      path: sanskrit-wiktextract-data.jsonl
   - config_name: Graph
     data_files:
     - split: AllLanguage
       path: word-definition-graph-data.jsonl
 tags:
+  - Natural Language Processing
+  - NLP
   - Wiktionary
+  - Vocabulary
   - German
   - Latin
   - Ancient Greek
   - Korean
   - Old Persian
   - Akkadian
-  - Vocabulary
+  - Sanskrit
   - Knowledge Graph
 size_categories:
-  - 1M<n<10M
+  - 100M<n<1B
 ---
 
 Wiktionary Data on Hugging Face Datasets
@@ -61,6 +67,7 @@ supports the following languages:
 - __한국어__ - Korean
 - __𐎠𐎼𐎹__ - [Old Persian](https://en.wikipedia.org/wiki/Old_Persian_cuneiform)
 - __𒀝𒅗𒁺𒌑(𒌝)__ - [Akkadian](https://en.wikipedia.org/wiki/Akkadian_language)
+- __संस्कृतम्__ - Sanskrit, or Classical Sanskrit
 
 [wiktionary-data]() was originally a sub-module of [wilhelm-graphdb](https://github.com/QubitPi/wilhelm-graphdb). While
 the dataset it's getting bigger, I noticed a wave of more exciting potentials this dataset can bring about that
@@ -84,11 +91,23 @@ There are __two__ data subsets:
    - `Korean`
    - `OldPersian`
    - `Akkadian`
+   - `Sanskrit`
 
 2. __Graph__ subset that is useful for constructing knowledge graphs:
 
    - `AllLanguage`: all the languages in a giant graph
 
+   The _Graph_ data ontology is the following:
+
+   <div align="center">
+       <img src="ontology.png" size="50%" alt="Error loading ontology.png"/>
+   </div>
+
+> [!TIP]
+>
+> Two words are structurally similar if and only if the two shares the same\
+> [stem](https://en.wikipedia.org/wiki/Word_stem)
+
 Development
 -----------
 

diff --git a/ontology.png b/ontology.png
diff --git a/wiktionary/wiktextract/extract.py b/wiktionary/wiktextract/extract.py
@@ -44,7 +44,8 @@ def extract_data(wiktextract_data_path: str):
           open("ancient-greek-wiktextract-data.jsonl", "w") as ancient_greek,
           open("korean-wiktextract-data.jsonl", "w") as korean,
           open("old-persian-wiktextract-data.jsonl", "w") as old_persian,
-          open("akkadian-wiktextract-data.jsonl", "w") as akkadian
+          open("akkadian-wiktextract-data.jsonl", "w") as akkadian,
+          open("sanskrit-wiktextract-data.jsonl", "w") as sanskrit
     ):
         for line in data:
             vocabulary = json.loads(line)
@@ -81,6 +82,10 @@ def extract_data(wiktextract_data_path: str):
                 if vocabulary["lang"] == "Akkadian":
                     akkadian.write(json.dumps({"term": term, "part of speech": pos, "definitions": definitions, "audios": audios}))
                     akkadian.write("\n")
+                if vocabulary["lang"] == "Sanskrit":
+                    sanskrit.write(json.dumps({"term": term, "part of speech": pos, "definitions": definitions, "audios": audios}))
+                    sanskrit.write("\n")
+
 
 def extract_graph(wiktextract_data_path: str):
     import json