Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request pipeline to fetch upload data to hugging face #1239

Conversation

koch3092
Copy link
Collaborator

@koch3092 koch3092 commented Nov 28, 2024

Description

add CRUD of hugging face's datasets/records

Motivation and Context

close #1213

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Implemented Tasks

  • Subtask 1
  • Subtask 2
  • Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@koch3092 koch3092 requested a review from Wendong-Fan November 28, 2024 09:38
@koch3092 koch3092 self-assigned this Nov 28, 2024
@koch3092 koch3092 linked an issue Nov 28, 2024 that may be closed by this pull request
2 tasks
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @koch3092 , left some initial comments, we need to support creating dataset card for camel, refer: https://huggingface.co/docs/datasets/en/dataset_card

camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
@Wendong-Fan Wendong-Fan added the Data Related to camel data processing label Nov 29, 2024
@Wendong-Fan Wendong-Fan modified the milestones: Sprint 17, Sprint 17.5 Nov 29, 2024
@Wendong-Fan Wendong-Fan requested review from willshang76, zjrwtx and CaelumF and removed request for willshang76 November 30, 2024 20:09
@zjrwtx
Copy link
Collaborator

zjrwtx commented Dec 3, 2024

Can we support picture-type data and examples?

@zjrwtx
Copy link
Collaborator

zjrwtx commented Dec 3, 2024

Can we support picture-type data and examples?

image
https://huggingface.co/docs/hub/datasets-adding

camel/datahubs/clients/base.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
camel/datahubs/clients/huggingface.py Outdated Show resolved Hide resolved
…oad-data-to-hugging-face

# Conflicts:
#	poetry.lock
1.throw NotImplementedError in abstract methods
2.update logging to camel.logger
3.Check the validity of json in upload_file()
4.update poetry.lock
5.Update the token assignment method in test
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @koch3092 ! Added one commit here 9f82a29 to simplify the code structure and did some update based on comment below, feel free to review the change and merge this if there's no question

Returns:
str: The URL of the created dataset.
"""
raise NotImplementedError("Method not implemented.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pass instead

Comment on lines 35 to 38
class RepoType(str, Enum):
DATASET = "dataset"
MODEL = "model"
SPACE = "space"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to camel.types

Comment on lines 90 to 94
if not authors:
authors = []

if not tags:
tags = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

authors and tags could be unified, set within metadata

Comment on lines 96 to 103
metadata = {
"license": license,
"task_categories": task_categories if task_categories else [],
"language": language if language else [],
"tags": tags,
"pretty_name": dataset_name,
"size_categories": [size_category] if size_category else [],
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we remove keys with None values by using metadata = {k: v for k, v in metadata.items() if v}?

Comment on lines 258 to 262
if not existing_records:
raise ValueError(
f"Dataset '{dataset_name}' does not have an existing file to "
f"update. Use `add_records` first."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add records directly and give user a warming msg?

@koch3092
Copy link
Collaborator Author

koch3092 commented Dec 9, 2024

Thanks @koch3092 ! Added one commit here 9f82a29 to simplify the code structure and did some update based on comment below, feel free to review the change and merge this if there's no question

Thanks @Wendong-Fan , It looks good to me.

@Wendong-Fan Wendong-Fan merged commit c16af26 into master Dec 9, 2024
6 checks passed
@Wendong-Fan Wendong-Fan deleted the 1213-feature-request-pipeline-to-fetch-upload-data-to-hugging-face branch December 9, 2024 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Related to camel data processing
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] Pipeline to fetch & upload data to hugging face
4 participants