Scripts for dataset preparation are located in dataset
directory, and should be run from the root
of the repository.
Dataset was downloaded from open API of Polish Court Judgements.
The following procedure will download data and store it in MongoDB
. Whenever script interacts with outside environment (storing data in mongodb
or pushing files to huggingface-hub
) it is run outisde dvc
.
Prior to downloading, make sure you have proper environment variable set in .env
file:
MONGO_URI=<mongo_uri_including_password>
MONGO_DB_NAME="datasets"
-
Download judgements metadata - this will store metadata in the database:
PYTHONPATH=. python scripts/dataset/download_pl_metadata.py
-
Download judgements text (XML content of judgements) - this will alter the database with content:
PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \ --data-type content \ --n-jobs 10
-
Download additional details available for each judgement - this will alter the database with acquired details:
PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \ --data-type details \ --n-jobs 10
-
Map id of courts and departments to court name:
PYTHONPATH=. python scripts/dataset/map_court_dep_id_2_name.py --n-jobs 10
Remark: File with mapping available at
data/datasets/pl/court_id_2_name.csv
was prepared based on data published on: https://orzeczenia.wroclaw.sa.gov.pl/indices -
Extract raw text from XML content and details of judgments not available through API:
PYTHONPATH=. python scripts/dataset/extract_pl_xml.py --n-jobs 10
-
For further processing prepare local dataset dump in
parquet
file, version it with dvc and push to remote storage:PYTHONPATH=. python scripts/dataset/dump_pl_dataset.py \ --file-name data/datasets/pl/raw/raw.parquet \ --filter-empty-content dvc add data/datasets/pl/raw && dvc push
-
Generate dataset card for
pl-court-raw
dvc repro raw_dataset_readme && dvc push
-
Upload
pl-court-raw
dataset (with card) to huggingfacePYTHONPATH=. python scripts/dataset/push_raw_dataset.py --repo-id "JuDDGES/pl-court-raw" --commit-message <commit_message>
-
Generate intruction dataset and upload it to huggingface (
pl-court-instruct
)NUM_JOBS=8 dvc repro build_instruct_dataset
-
Generate dataset card for
pl-court-instruct
dvc repro instruct_dataset_readme && dvc push
-
Upload
pl-court-instruct
dataset card to huggingface
PYTHONPATH=. scripts/dataset/push_instruct_readme.py --repo-id JuDDGES/pl-court-instruct
-
Embed judgments with pre-trained lanuage model (documents arechunked and embeddings are computed per chunk)
CUDA_VISIBLE_DEVICES=<device_number> dvc repro embed
-
Aggregate embeddings of chunks into embeddings of document
NUM_PROC=4 dvc repro embed aggregate_embeddings
-
Eventually ingest data to
mongodb
(e.g. for vector search)PYTHONPATH=. python scripts/embed/ingest.py --embeddings-file <embeddgings>
-
Generate graph dataset
dvc repro embed build_graph_dataset
-
Generate dataset card and upload it to huggingface (remember to be logged in to
huggingface
or setHUGGING_FACE_HUB_TOKEN
env variable)PYTHONPATH=. python scripts/dataset/upload_graph_dataset.py \ --root-dir <dir_to_dataset> \ --repo-id JuDDGES/pl-court-graph \ --commit-message <message>