Skip to content

Latest commit

 

History

History
112 lines (92 loc) · 3.83 KB

README.md

File metadata and controls

112 lines (92 loc) · 3.83 KB

Dataset preparation

Scripts for dataset preparation are located in dataset directory, and should be run from the root of the repository.

1. Building the dataset

Dataset was downloaded from open API of Polish Court Judgements. The following procedure will download data and store it in MongoDB. Whenever script interacts with outside environment (storing data in mongodb or pushing files to huggingface-hub) it is run outisde dvc. Prior to downloading, make sure you have proper environment variable set in .env file:

MONGO_URI=<mongo_uri_including_password>
MONGO_DB_NAME="datasets"

Raw dataset

  1. Download judgements metadata - this will store metadata in the database:

    PYTHONPATH=. python scripts/dataset/download_pl_metadata.py
  2. Download judgements text (XML content of judgements) - this will alter the database with content:

    PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \
        --data-type content \
        --n-jobs 10
  3. Download additional details available for each judgement - this will alter the database with acquired details:

    PYTHONPATH=. python scripts/dataset/download_pl_additional_data.py \
        --data-type details \
        --n-jobs 10
  4. Map id of courts and departments to court name:

    PYTHONPATH=. python scripts/dataset/map_court_dep_id_2_name.py --n-jobs 10

    Remark: File with mapping available at data/datasets/pl/court_id_2_name.csv was prepared based on data published on: https://orzeczenia.wroclaw.sa.gov.pl/indices

  5. Extract raw text from XML content and details of judgments not available through API:

    PYTHONPATH=. python scripts/dataset/extract_pl_xml.py --n-jobs 10
  6. For further processing prepare local dataset dump in parquet file, version it with dvc and push to remote storage:

    PYTHONPATH=.  python scripts/dataset/dump_pl_dataset.py \
        --file-name data/datasets/pl/raw/raw.parquet \
        --filter-empty-content
    dvc add data/datasets/pl/raw && dvc push
  7. Generate dataset card for pl-court-raw

    dvc repro raw_dataset_readme && dvc push
  8. Upload pl-court-raw dataset (with card) to huggingface

    PYTHONPATH=. python scripts/dataset/push_raw_dataset.py --repo-id "JuDDGES/pl-court-raw" --commit-message <commit_message>

Instruction dataset

  1. Generate intruction dataset and upload it to huggingface (pl-court-instruct)

    NUM_JOBS=8 dvc repro build_instruct_dataset
  2. Generate dataset card for pl-court-instruct

    dvc repro instruct_dataset_readme && dvc push
  3. Upload pl-court-instruct dataset card to huggingface

PYTHONPATH=. scripts/dataset/push_instruct_readme.py --repo-id JuDDGES/pl-court-instruct

Graph dataset

  1. Embed judgments with pre-trained lanuage model (documents arechunked and embeddings are computed per chunk)

    CUDA_VISIBLE_DEVICES=<device_number> dvc repro embed
  2. Aggregate embeddings of chunks into embeddings of document

    NUM_PROC=4 dvc repro embed aggregate_embeddings
  3. Eventually ingest data to mongodb (e.g. for vector search)

    PYTHONPATH=. python scripts/embed/ingest.py --embeddings-file <embeddgings>
  4. Generate graph dataset

    dvc repro embed build_graph_dataset
  5. Generate dataset card and upload it to huggingface (remember to be logged in to huggingface or set HUGGING_FACE_HUB_TOKEN env variable)

    PYTHONPATH=. python scripts/dataset/upload_graph_dataset.py \
        --root-dir <dir_to_dataset> \
        --repo-id JuDDGES/pl-court-graph \
        --commit-message <message>