-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower parquet row group size for image datasets #833
Lower parquet row group size for image datasets #833
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Writer batch size to pass to a dataset builder. | ||
If `None`, then it will use the `datasets` default. | ||
""" | ||
return 100 if "Image(" in str(ds_config_info.features) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define a constant for 100
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a constants.py file as in libcommon
@@ -774,12 +795,17 @@ def compute_parquet_and_dataset_info_response( | |||
parquet_files: List[ParquetFile] = [] | |||
dataset_info: dict[str, Any] = {} | |||
for config in config_names: | |||
ds_config_info = get_dataset_config_info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_dataset_config_info
calls load_dataset_builder
(that we also call below). Maybe we can factorize in some way?
Also: it can rely on streaming and fail if the dataset does not support streaming, right? In that case, it defeats the purpose of downloading the dataset. Maybe handle the case with a try/except block?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi, do you want to adapt this PR, now that #875 has been merged to main? |
We'll do a new release of |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #833 +/- ##
==========================================
- Coverage 89.58% 87.80% -1.79%
==========================================
Files 147 94 -53
Lines 7854 4115 -3739
==========================================
- Hits 7036 3613 -3423
+ Misses 818 502 -316
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 53 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
keep open |
this is ready for review @severo :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We will have to force-refresh the image datasets. Should we increment the job runner version? It would refresh everything, instead of just the image datasets...
Another comment: why don't we do the same for audio datasets?
we should ! let me update this |
Could we hardcode a rule that would not recompute the parquet files if there's no image/audio in it ? |
No, I think we should instead find a way to get the list of datasets with a specific type in it (it's useful anyway -> #561) |
Ok :) I just increased the job version |
:) OK... Let's see how it goes along with #1077... Many jobs will be run this weekend :) |
REQUIRES test_get_writer_batch_size to be merged, and to update the
datasets
version to use this feature.This should help optimize random access to parquet files for https://github.com/huggingface/datasets-server/pull/687/files