Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download dataset scripts FIX + new option to download datasets from benchmark configs #129

Closed

Conversation

leriomaggio
Copy link

This PR includes changes to the datasets/load_datasets.py script and related documentation.

In more details: the previous version of the script had a bug that completely ignored any dataset name provided to the script via the -d option. (It downloaded all the datasets instead).

This has been fixed in this PR, along with some improvement to internal documentation (i.e. help message).

Moreover, a new and extra option has been also added to the script, namely -c, --configs.

This options override the manual selection of dataset to download by automatically extracting the names of required datasets from input configuration file(s).

This becomes particularly useful when preparing to run multiple benchmark experiments, downloading all the necessary datasets used.

A new README.md file has been added in the datasets package, inline with other packages included in the benchmark.
This documentation file includes all the detailed instructions on how to run and use the load_datasets utility.

Also, a new section has been added to the main README.md file to highlight the new features.

This fixes a bug in the script that was completely ignoring
the values passed in via CL option.
Moreover, extra checks and warning logs/messages are now
included regarding what's going on with dataset download.

This would be nice to have when downloading various datasets
at once.
Simplified report messages, and re-used pre-existing
load function (not reinventing the wheel).
Improved the description of the script when executed with
the `--help` option.

Moreover, I realised that the reference Python version is 3.9
(also included in Azure pipeline config).
Plus f-strings are already used elsewhere.

Therefore, all string formattings is now using f-strings.
New utility script used to collect the names of public datasets
used in input benchmark configuration files.

These script can be used to gather all the dataset names that
is necessary to dowload to run specific set of experiments.
New documentation readme with instructions on how to use
utilities included in the `datasets` package.

This doc file adopts a similar structure as in other README
doc files used for other packages.
The load_datasets utility script has been extended with the
extra option to download datasets automatically extracted from
input configuration files.

This would avoid having multiple scripts - to be executed in specific
order - while still keeping flexibility to download datasets
manually.

Documentation has been updated, and improved after this change.
@leriomaggio leriomaggio changed the title Download dataset scripts FIX + new option to automatically download datasets in benchmark configs Download dataset scripts FIX + new option to download datasets from benchmark configs Feb 16, 2023
@Alexsandruss
Copy link
Contributor

Datasets loading will be changed soon as part of complete refactoring of benchmarks. --dummy-run or another runner argument will be used for dataset downloading only during run.

@Alexsandruss
Copy link
Contributor

Implemented with --prefetch-datasets flag in refactoring (#133).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants