-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download dataset scripts FIX + new option to download datasets from benchmark configs #129
Closed
leriomaggio
wants to merge
10
commits into
IntelPython:main
from
leriomaggio:fix-load-datasets-download
Closed
Download dataset scripts FIX + new option to download datasets from benchmark configs #129
leriomaggio
wants to merge
10
commits into
IntelPython:main
from
leriomaggio:fix-load-datasets-download
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This fixes a bug in the script that was completely ignoring the values passed in via CL option. Moreover, extra checks and warning logs/messages are now included regarding what's going on with dataset download. This would be nice to have when downloading various datasets at once.
Simplified report messages, and re-used pre-existing load function (not reinventing the wheel).
Improved the description of the script when executed with the `--help` option. Moreover, I realised that the reference Python version is 3.9 (also included in Azure pipeline config). Plus f-strings are already used elsewhere. Therefore, all string formattings is now using f-strings.
New utility script used to collect the names of public datasets used in input benchmark configuration files. These script can be used to gather all the dataset names that is necessary to dowload to run specific set of experiments.
New documentation readme with instructions on how to use utilities included in the `datasets` package. This doc file adopts a similar structure as in other README doc files used for other packages.
The load_datasets utility script has been extended with the extra option to download datasets automatically extracted from input configuration files. This would avoid having multiple scripts - to be executed in specific order - while still keeping flexibility to download datasets manually. Documentation has been updated, and improved after this change.
leriomaggio
changed the title
Download dataset scripts FIX + new option to automatically download datasets in benchmark configs
Download dataset scripts FIX + new option to download datasets from benchmark configs
Feb 16, 2023
Datasets loading will be changed soon as part of complete refactoring of benchmarks. |
Implemented with |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes changes to the
datasets/load_datasets.py
script and related documentation.In more details: the previous version of the script had a bug that completely ignored any dataset name provided to the script via the
-d
option. (It downloaded all the datasets instead).This has been fixed in this PR, along with some improvement to internal documentation (i.e. help message).
Moreover, a new and extra option has been also added to the script, namely
-c
,--configs
.This options override the manual selection of dataset to download by automatically extracting the names of required datasets from input configuration file(s).
This becomes particularly useful when preparing to run multiple benchmark experiments, downloading all the necessary datasets used.
A new
README.md
file has been added in thedatasets
package, inline with other packages included in the benchmark.This documentation file includes all the detailed instructions on how to run and use the
load_datasets
utility.Also, a new section has been added to the main
README.md
file to highlight the new features.