Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasets to 2.11.0 #1002

Closed
4 of 6 tasks
severo opened this issue Mar 31, 2023 · 7 comments
Closed
4 of 6 tasks

Update datasets to 2.11.0 #1002

severo opened this issue Mar 31, 2023 · 7 comments

Comments

@severo
Copy link
Collaborator

severo commented Mar 31, 2023

See https://github.com/huggingface/datasets/releases/tag/2.11.0

TODO: See discussions below

Useful changes for the datasets server (please complete if there are more, @huggingface/datasets)

Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in huggingface/datasets#5573

  • this allows to not have dependencies on pytorch to decode audio files
  • this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support

should we remove the dependency to torch and torchaudio? cc @polinaeterna

Add writer_batch_size for ArrowBasedBuilder by @lhoestq in huggingface/datasets#5565

  • allow to specofy the row group / record batch size when you download_and_prepare() a dataset

Needed for #833 I think; cc @lhoestq

Allow direct cast from binary to Audio/Image by @mariosasko in huggingface/datasets#5644

Should we adapt the code in https://github.com/huggingface/datasets-server/blob/main/services/worker/src/worker/features.py due to that?

Support streaming datasets with numpy.load by @albertvillanova in huggingface/datasets#5626

should we refresh some datasets after that?

@albertvillanova
Copy link
Member

albertvillanova commented Apr 3, 2023

I would suggest splitting this issue into several subtasks. I'm editing the description above...

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator Author

severo commented May 2, 2023

Keep open. Note also #1099 to upgrade to 2.12.0

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@albertvillanova
Copy link
Member

Keep open

@albertvillanova
Copy link
Member

I think "Use writer_batch_size for ArrowBasedBuilder" is already done by:

CC: @lhoestq @severo

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants