Lower parquet row group size for image datasets #833

lhoestq · 2023-02-22T15:34:07Z

REQUIRES test_get_writer_batch_size to be merged, and to update the datasets version to use this feature.

This should help optimize random access to parquet files for https://github.com/huggingface/datasets-server/pull/687/files

HuggingFaceDocBuilder · 2023-02-22T15:37:03Z

The documentation is not available anymore as the PR was closed or merged.

severo · 2023-02-23T16:41:24Z

services/worker/src/worker/job_runners/parquet_and_dataset_info.py

+            Writer batch size to pass to a dataset builder.
+            If `None`, then it will use the `datasets` default.
+    """
+    return 100 if "Image(" in str(ds_config_info.features) else None


Can we define a constant for 100?

we can add a constants.py file as in libcommon

severo · 2023-02-23T16:48:03Z

services/worker/src/worker/job_runners/parquet_and_dataset_info.py

@@ -774,12 +795,17 @@ def compute_parquet_and_dataset_info_response(
    parquet_files: List[ParquetFile] = []
    dataset_info: dict[str, Any] = {}
    for config in config_names:
+        ds_config_info = get_dataset_config_info(


get_dataset_config_info calls load_dataset_builder (that we also call below). Maybe we can factorize in some way?

Also: it can rely on streaming and fail if the dataset does not support streaming, right? In that case, it defeats the purpose of downloading the dataset. Maybe handle the case with a try/except block?

github-actions · 2023-03-25T15:03:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo · 2023-03-27T08:34:36Z

Hi, do you want to adapt this PR, now that #875 has been merged to main?

lhoestq · 2023-03-27T13:02:15Z

We'll do a new release of datasets tomorrow to add the new parameter to download_and_prepare, I'll update this PR after the release

codecov-commenter · 2023-03-27T14:39:09Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -1.79 ⚠️

Comparison is base (4788650) 89.58% compared to head (aef5c96) 87.80%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #833      +/-   ##
==========================================
- Coverage   89.58%   87.80%   -1.79%     
==========================================
  Files         147       94      -53     
  Lines        7854     4115    -3739     
==========================================
- Hits         7036     3613    -3423     
+ Misses        818      502     -316

Flag	Coverage Δ
jobs_cache_refresh	`98.50% <ø> (ø)`
jobs_mongodb_migration	`80.57% <ø> (ø)`
libs_libcommon	`93.55% <100.00%> (+<0.01%)`	⬆️
services_admin	`87.32% <ø> (ø)`
services_api	`84.70% <ø> (ø)`
services_worker	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
libs/libcommon/src/libcommon/constants.py	`100.00% <100.00%> (ø)`

... and 53 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

github-actions · 2023-04-20T15:04:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo · 2023-04-20T15:27:45Z

keep open

lhoestq · 2023-04-21T10:55:01Z

this is ready for review @severo :)

severo

OK. We will have to force-refresh the image datasets. Should we increment the job runner version? It would refresh everything, instead of just the image datasets...

Another comment: why don't we do the same for audio datasets?

libs/libcommon/src/libcommon/constants.py

lhoestq · 2023-04-21T11:29:22Z

Another comment: why don't we do the same for audio datasets?

we should ! let me update this

lhoestq · 2023-04-21T11:30:41Z

Could we hardcode a rule that would not recompute the parquet files if there's no image/audio in it ?

severo · 2023-04-21T11:35:10Z

Could we hardcode a rule that would not recompute the parquet files if there's no image/audio in it ?

No, I think we should instead find a way to get the list of datasets with a specific type in it (it's useful anyway -> #561)

lhoestq · 2023-04-21T13:53:27Z

Ok :) I just increased the job version

severo · 2023-04-21T14:05:55Z

:) OK... Let's see how it goes along with #1077... Many jobs will be run this weekend :)

lower parquet row group size for image datasets

be1a86e

severo reviewed Feb 23, 2023

View reviewed changes

lhoestq added 2 commits March 27, 2023 14:54

Merge branch 'main' into lower-parquet-row-group-size-for-image-datasets

106dcd0

sylvain's comments

aef5c96

severo mentioned this pull request Mar 31, 2023

Update datasets to 2.11.0 #1002

Closed

6 tasks

lhoestq added 3 commits April 21, 2023 12:25

Merge branch 'main' into lower-parquet-row-group-size-for-image-datasets

3781cc8

set writer_batch_size

713fb94

remove unused import

aea52b3

lhoestq marked this pull request as ready for review April 21, 2023 10:42

severo approved these changes Apr 21, 2023

View reviewed changes

libs/libcommon/src/libcommon/constants.py Show resolved Hide resolved

reduce row group size for audio as well

08e8246

severo mentioned this pull request Apr 21, 2023

Store /valid and /is-valid in the cache #891

Closed

6 tasks

increase job version

2a9f2a2

lhoestq merged commit cffaca1 into main Apr 21, 2023

lhoestq deleted the lower-parquet-row-group-size-for-image-datasets branch April 21, 2023 14:09

lhoestq mentioned this pull request May 3, 2023

Too large row group size for parquet exports of image datasets #1127

Closed

lhoestq mentioned this pull request May 3, 2023

Re-lower row group #1134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower parquet row group size for image datasets #833

Lower parquet row group size for image datasets #833

lhoestq commented Feb 22, 2023

HuggingFaceDocBuilder commented Feb 22, 2023 •

edited

Loading

severo Feb 23, 2023

severo Feb 23, 2023

severo Feb 23, 2023 •

edited

Loading

github-actions bot commented Mar 25, 2023

severo commented Mar 27, 2023

lhoestq commented Mar 27, 2023

codecov-commenter commented Mar 27, 2023

github-actions bot commented Apr 20, 2023

severo commented Apr 20, 2023

lhoestq commented Apr 21, 2023

severo left a comment •

edited

Loading

lhoestq commented Apr 21, 2023

lhoestq commented Apr 21, 2023

severo commented Apr 21, 2023

lhoestq commented Apr 21, 2023

severo commented Apr 21, 2023

Lower parquet row group size for image datasets #833

Lower parquet row group size for image datasets #833

Conversation

lhoestq commented Feb 22, 2023

HuggingFaceDocBuilder commented Feb 22, 2023 • edited Loading

severo Feb 23, 2023

Choose a reason for hiding this comment

severo Feb 23, 2023

Choose a reason for hiding this comment

severo Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Mar 25, 2023

severo commented Mar 27, 2023

lhoestq commented Mar 27, 2023

codecov-commenter commented Mar 27, 2023

Codecov Report

github-actions bot commented Apr 20, 2023

severo commented Apr 20, 2023

lhoestq commented Apr 21, 2023

severo left a comment • edited Loading

Choose a reason for hiding this comment

lhoestq commented Apr 21, 2023

lhoestq commented Apr 21, 2023

severo commented Apr 21, 2023

lhoestq commented Apr 21, 2023

severo commented Apr 21, 2023

HuggingFaceDocBuilder commented Feb 22, 2023 •

edited

Loading

severo Feb 23, 2023 •

edited

Loading

severo left a comment •

edited

Loading