Download dataset scripts FIX + new option to download datasets from benchmark configs #129

leriomaggio · 2023-02-16T19:09:29Z

This PR includes changes to the datasets/load_datasets.py script and related documentation.

In more details: the previous version of the script had a bug that completely ignored any dataset name provided to the script via the -d option. (It downloaded all the datasets instead).

This has been fixed in this PR, along with some improvement to internal documentation (i.e. help message).

Moreover, a new and extra option has been also added to the script, namely -c, --configs.

This options override the manual selection of dataset to download by automatically extracting the names of required datasets from input configuration file(s).

This becomes particularly useful when preparing to run multiple benchmark experiments, downloading all the necessary datasets used.

A new README.md file has been added in the datasets package, inline with other packages included in the benchmark.
This documentation file includes all the detailed instructions on how to run and use the load_datasets utility.

Also, a new section has been added to the main README.md file to highlight the new features.

This fixes a bug in the script that was completely ignoring the values passed in via CL option. Moreover, extra checks and warning logs/messages are now included regarding what's going on with dataset download. This would be nice to have when downloading various datasets at once.

Simplified report messages, and re-used pre-existing load function (not reinventing the wheel).

Improved the description of the script when executed with the `--help` option. Moreover, I realised that the reference Python version is 3.9 (also included in Azure pipeline config). Plus f-strings are already used elsewhere. Therefore, all string formattings is now using f-strings.

New utility script used to collect the names of public datasets used in input benchmark configuration files. These script can be used to gather all the dataset names that is necessary to dowload to run specific set of experiments.

New documentation readme with instructions on how to use utilities included in the `datasets` package. This doc file adopts a similar structure as in other README doc files used for other packages.

The load_datasets utility script has been extended with the extra option to download datasets automatically extracted from input configuration files. This would avoid having multiple scripts - to be executed in specific order - while still keeping flexibility to download datasets manually. Documentation has been updated, and improved after this change.

Alexsandruss · 2023-03-20T21:42:07Z

Datasets loading will be changed soon as part of complete refactoring of benchmarks. --dummy-run or another runner argument will be used for dataset downloading only during run.

Alexsandruss · 2024-07-26T09:09:47Z

Implemented with --prefetch-datasets flag in refactoring (#133).

leriomaggio added 7 commits February 16, 2023 15:47

simplified log msgs and reused pre-exst function

39ed884

Simplified report messages, and re-used pre-existing load function (not reinventing the wheel).

New utility to collect DS names from json config

4330770

New utility script used to collect the names of public datasets used in input benchmark configuration files. These script can be used to gather all the dataset names that is necessary to dowload to run specific set of experiments.

New documentation for datasets utilities

2cc2baf

New documentation readme with instructions on how to use utilities included in the `datasets` package. This doc file adopts a similar structure as in other README doc files used for other packages.

Added section on Data download

fd481b5

leriomaggio requested a review from Alexsandruss as a code owner February 16, 2023 19:09

leriomaggio changed the title ~~Download dataset scripts FIX + new option to automatically download datasets in benchmark configs~~ Download dataset scripts FIX + new option to download datasets from benchmark configs Feb 16, 2023

leriomaggio added 3 commits February 16, 2023 19:47

MyPy & PEP8 fix

7ae3888

MyPy FIX

c10996f

MyPy FIX

0e81f45

Alexsandruss closed this Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download dataset scripts FIX + new option to download datasets from benchmark configs #129

Download dataset scripts FIX + new option to download datasets from benchmark configs #129

leriomaggio commented Feb 16, 2023

Alexsandruss commented Mar 20, 2023

Alexsandruss commented Jul 26, 2024

Download dataset scripts FIX + new option to download datasets from benchmark configs #129

Download dataset scripts FIX + new option to download datasets from benchmark configs #129

Conversation

leriomaggio commented Feb 16, 2023

Alexsandruss commented Mar 20, 2023

Alexsandruss commented Jul 26, 2024