-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new splits to a dataset script with existing old splits info in metadata's dataset_info
fails
#5315
Comments
dataset_info
failsdataset_info
fails
EDIT: One idea:
|
@albertvillanova You mean in cases when the script was changed? I suggest that we:
I started it here: https://github.com/huggingface/datasets/pull/5327/files What do you think @albertvillanova ? |
I edited my previous comment:
I agree with you there are 2 things to be addressed here:
|
Describe the bug
If you first create a custom dataset with a specific set of splits, generate metadata with
datasets-cli test ... --save_info
, then change your script to include more splits, it fails.That's what happened in https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/discussions/2#6385fd1269634850f8ddff48.
Steps to reproduce the bug
"train"
split in_splits_generators'
. specifically, if really want to reproduce, copy `https://huggingface.co/datasets/mrdbourke/food_vision_199_classes/blob/main/food_vision_199_classes.pydatasets-cli test dataset_script.py --save_info --all_configs
- this would generate metadata yaml inREADME.md
that would contain info about splits, for example, like this:"train"
and"test"
(uncomment these lines)load_dataset
and get the following error:README.md
withdatasets-cli
as in step 2 and get the same error.This is because
dataset.info.splits
contains only"train"
split so when we are doingself.info.splits[split_generator.name]
it tries to infer smth likeinfo.splits['train[50%]']
and that's not the case and it fails.Expected behavior
to be discussed?
This can be solved by removing splits information from metadata file first. But I wonder if there is a better way.
Environment info
The text was updated successfully, but these errors were encountered: