Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower parquet row group size for image datasets #833

Merged
merged 8 commits into from
Apr 21, 2023

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 22, 2023

REQUIRES test_get_writer_batch_size to be merged, and to update the datasets version to use this feature.

This should help optimize random access to parquet files for https://github.com/huggingface/datasets-server/pull/687/files

@HuggingFaceDocBuilder
Copy link
Collaborator

HuggingFaceDocBuilder commented Feb 22, 2023

The documentation is not available anymore as the PR was closed or merged.

Writer batch size to pass to a dataset builder.
If `None`, then it will use the `datasets` default.
"""
return 100 if "Image(" in str(ds_config_info.features) else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define a constant for 100?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a constants.py file as in libcommon

@@ -774,12 +795,17 @@ def compute_parquet_and_dataset_info_response(
parquet_files: List[ParquetFile] = []
dataset_info: dict[str, Any] = {}
for config in config_names:
ds_config_info = get_dataset_config_info(
Copy link
Collaborator

@severo severo Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_dataset_config_info calls load_dataset_builder (that we also call below). Maybe we can factorize in some way?

Also: it can rely on streaming and fail if the dataset does not support streaming, right? In that case, it defeats the purpose of downloading the dataset. Maybe handle the case with a try/except block?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator

severo commented Mar 27, 2023

Hi, do you want to adapt this PR, now that #875 has been merged to main?

@lhoestq
Copy link
Member Author

lhoestq commented Mar 27, 2023

We'll do a new release of datasets tomorrow to add the new parameter to download_and_prepare, I'll update this PR after the release

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 100.00% and project coverage change: -1.79 ⚠️

Comparison is base (4788650) 89.58% compared to head (aef5c96) 87.80%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #833      +/-   ##
==========================================
- Coverage   89.58%   87.80%   -1.79%     
==========================================
  Files         147       94      -53     
  Lines        7854     4115    -3739     
==========================================
- Hits         7036     3613    -3423     
+ Misses        818      502     -316     
Flag Coverage Δ
jobs_cache_refresh 98.50% <ø> (ø)
jobs_mongodb_migration 80.57% <ø> (ø)
libs_libcommon 93.55% <100.00%> (+<0.01%) ⬆️
services_admin 87.32% <ø> (ø)
services_api 84.70% <ø> (ø)
services_worker ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
libs/libcommon/src/libcommon/constants.py 100.00% <100.00%> (ø)

... and 53 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@severo severo mentioned this pull request Mar 31, 2023
6 tasks
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator

severo commented Apr 20, 2023

keep open

@lhoestq lhoestq marked this pull request as ready for review April 21, 2023 10:42
@lhoestq
Copy link
Member Author

lhoestq commented Apr 21, 2023

this is ready for review @severo :)

Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We will have to force-refresh the image datasets. Should we increment the job runner version? It would refresh everything, instead of just the image datasets...

Another comment: why don't we do the same for audio datasets?

libs/libcommon/src/libcommon/constants.py Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Apr 21, 2023

Another comment: why don't we do the same for audio datasets?

we should ! let me update this

@lhoestq
Copy link
Member Author

lhoestq commented Apr 21, 2023

Could we hardcode a rule that would not recompute the parquet files if there's no image/audio in it ?

@severo
Copy link
Collaborator

severo commented Apr 21, 2023

Could we hardcode a rule that would not recompute the parquet files if there's no image/audio in it ?

No, I think we should instead find a way to get the list of datasets with a specific type in it (it's useful anyway -> #561)

@lhoestq
Copy link
Member Author

lhoestq commented Apr 21, 2023

Ok :) I just increased the job version

@severo
Copy link
Collaborator

severo commented Apr 21, 2023

:) OK... Let's see how it goes along with #1077... Many jobs will be run this weekend :)

@lhoestq lhoestq merged commit cffaca1 into main Apr 21, 2023
@lhoestq lhoestq deleted the lower-parquet-row-group-size-for-image-datasets branch April 21, 2023 14:09
@lhoestq lhoestq mentioned this pull request May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants