Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Huggingface dataset uploader #216

Merged
merged 3 commits into from
Dec 1, 2023

Conversation

josvandervelde
Copy link

@josvandervelde josvandervelde commented Dec 1, 2023

Andrej found a bug in the HuggingFace uploader: as repo_id it used username/dataset_name_cleaned, while you'd expect the dataset_name to already include the username.

In this PR, this is fixed:

  • Instead of using dataset.name as repo_id, the dataset.platform_resource_identifier is used, because this identifier should uniquely identify the dataset on HF.
  • An error is thrown if you try to upload data to HF, if the platform of the mentioned dataset is not "huggingface" or the dataset.platform_resource_identifier is not a valid <namespace>/<dataset_name>.
  • On POST/PUT the metadata of a resource, there are now checks:
  • If platform=huggingface, the platform_resource_identifier should be a valid repo_id (unfortunately, this might be an id without <namespace>, because we have to support legacy HF datasets, and I couldn't think of a clean way to allow legacy datasets only for the connector, not for the POST/PUT routes, without slightly complicated and ugly code
  • If platform=openml, the platform_resource_identifier should be a positive integer
  • Other platforms are not checked yet. Jean Matias is currently busy with Zenodo, I'll ask if he could add a check in his PR

…ure that the HuggingFace datasets have a correct platform_resource_identifier
repo_id = f"{username}/{dataset_name_cleaned}"
dataset: Dataset = _get_resource(session=session, identifier=identifier)
repo_id = dataset.platform_resource_identifier
if dataset.platform != PlatformName.huggingface or not repo_id:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would split this check into two separate ones and provide different error message

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! 3f0e878

@josvandervelde
Copy link
Author

As discussed:

It would probably be better to have an endpoint to which you can give all metadata + the data. Then it can

  1. check the metadata (return if problems)
  2. check if it's possible to create repo (return if problems)
  3. create repo on HF
  4. upload data (return if problems)
  5. create AIoD metadata
    We'll leave that out-of-scope for this PR

@josvandervelde josvandervelde merged commit e4d5285 into develop Dec 1, 2023
1 check passed
@josvandervelde josvandervelde deleted the bugfix/huggingface-dataset-uploader branch December 1, 2023 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants