Dataset preparation

Scripts for dataset preparation are located in dataset directory, and should be run from the root of the repository.

1. Building the dataset

Dataset was downloaded from open API of Polish Court Judgements. The following procedure will download data and store it in MongoDB. Whenever script interacts with outside environment (storing data in mongodb or pushing files to huggingface-hub) it is run outisde dvc. Prior to downloading, make sure you have proper environment variable set in .env file:

MONGO_URI=<mongo_uri_including_password>
MONGO_DB_NAME="datasets"

Raw dataset

Download judgements metadata - this will store metadata in the database:
```
PYTHONPATH=. python scripts/dataset/download_pl_metadata.py
```

Download judgements text (XML content of judgements) - this will alter the database with content:

PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \
    --data-type content \
    --n-jobs 10

Download additional details available for each judgement - this will alter the database with acquired details:

PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \
    --data-type details \
    --n-jobs 10

Map id of courts and departments to court name:
```
PYTHONPATH=. python scripts/dataset/map_court_dep_id_2_name.py --n-jobs 10
```
Remark: File with mapping available at data/datasets/pl/court_id_2_name.csv was prepared based on data published on: https://orzeczenia.wroclaw.sa.gov.pl/indices
Extract raw text from XML content and details of judgments not available through API:
```
PYTHONPATH=. python scripts/dataset/extract_pl_xml.py --n-jobs 10
```

For further processing prepare local dataset dump in parquet file, version it with dvc and push to remote storage:

PYTHONPATH=.  python scripts/dataset/dump_pl_dataset.py \
    --file-name data/datasets/pl/raw/raw.parquet \
    --filter-empty-content
dvc add data/datasets/pl/raw && dvc push

Generate dataset card for pl-court-raw

dvc repro raw_dataset_readme && dvc push

Upload pl-court-raw dataset (with card) to huggingface

PYTHONPATH=. python scripts/dataset/push_raw_dataset.py --repo-id "JuDDGES/pl-court-raw" --commit-message <commit_message>

Instruction dataset

Generate intruction dataset and upload it to huggingface (pl-court-instruct)
```
NUM_JOBS=8 dvc repro build_instruct_dataset
```

Generate dataset card for pl-court-instruct

dvc repro instruct_dataset_readme && dvc push

Upload pl-court-instruct dataset card to huggingface

PYTHONPATH=. scripts/dataset/push_instruct_readme.py --repo-id JuDDGES/pl-court-instruct

Graph dataset

Embed judgments with pre-trained lanuage model (documents arechunked and embeddings are computed per chunk)
```
CUDA_VISIBLE_DEVICES=<device_number> dvc repro embed
```
Aggregate embeddings of chunks into embeddings of document
```
NUM_PROC=4 dvc repro embed aggregate_embeddings
```

Eventually ingest data to mongodb (e.g. for vector search)

PYTHONPATH=. python scripts/embed/ingest.py --embeddings-file <embeddgings>

Generate graph dataset
```
dvc repro embed build_graph_dataset
```

Generate dataset card and upload it to huggingface (remember to be logged in to huggingface or set HUGGING_FACE_HUB_TOKEN env variable)

PYTHONPATH=. python scripts/dataset/upload_graph_dataset.py \
    --root-dir <dir_to_dataset> \
    --repo-id JuDDGES/pl-court-graph \
    --commit-message <message>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset preparation

1. Building the dataset

Raw dataset

Instruction dataset

Graph dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset preparation

1. Building the dataset

Raw dataset

Instruction dataset

Graph dataset