Skip to content

Commit

Permalink
Graph data in separate config
Browse files Browse the repository at this point in the history
  • Loading branch information
QubitPi committed Nov 20, 2024
1 parent e240108 commit 4963e7a
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 1 deletion.
6 changes: 5 additions & 1 deletion .github/workflows/ci-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,12 @@ jobs:
git lfs install
git lfs track "*-wiktextract-data.jsonl"
git lfs track "word-definition-graph-data.jsonl"
git add *-wiktextract-data.jsonl
git commit -m "Extract raw-wiktextract-data.jsonl into per-language wiktextract-data.jsonl"
git add word-definition-graph-data.jsonl
git commit -m "Extract raw-wiktextract-data.jsonl into per-language wiktextract-data.jsonl and generate graph data"
git push https://QubitPi:$HF_TOKEN@huggingface.co/datasets/QubitPi/wiktionary-data master:main -f
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ configs:
path: old-persian-wiktextract-data.jsonl
- split: Akkadian
path: akkadian-wiktextract-data.jsonl
- config_name: Graph
data_files:
- split: All
path: word-definition-graph-data.jsonl
tags:
- Wiktionary
- German
Expand All @@ -33,6 +37,7 @@ tags:
- Old Persian
- Akkadian
- Vocabulary
- Knowledge Graph
size_categories:
- 1M<n<10M
---
Expand Down Expand Up @@ -78,6 +83,12 @@ The available splits are
- `OldPersian`
- `Akkadian`

In addition, a separate split for graph data is offered:

- `GraphData`

This split contains all the languages and puts everything in a giant graph

Development
-----------

Expand Down
1 change: 1 addition & 0 deletions extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@
args = vars(parser.parse_args())

extract_data(args["input"])
extract_graph(args["input"])

0 comments on commit 4963e7a

Please sign in to comment.