New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Random rows #875

Merged

severo merged 32 commits into main from random-rows

Mar 27, 2023

Collaborator

severo commented Mar 1, 2023 •

edited

Loading

Add a new endpoint: /rows

Parameters: dataset, config, split, offset and limit.

It returns a list of limit rows of the split, from row idx=offset, by fetching them from the parquet files published on the Hub.

Replaces #687

severo mentioned this pull request

feat: 🎸 quick and dirty POC for the random rows endpoint #687

Closed

Collaborator

HuggingFaceDocBuilder commented Mar 1, 2023 •

edited

Loading

The documentation is not available anymore as the PR was closed or merged.

codecov-commenter commented Mar 1, 2023 •

edited

Loading

Codecov Report

Patch coverage: 57.81% and project coverage change: -3.10 ⚠️

Comparison is base (dc51be6) 90.89% compared to head (e300fb7) 87.79%.

❗ Current head e300fb7 differs from pull request most recent head 03c00cb. Consider uploading reports for the commit 03c00cb to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #875      +/-   ##
==========================================
- Coverage   90.89%   87.79%   -3.10%     
==========================================
  Files         139       94      -45     
  Lines        7379     4114    -3265     
==========================================
- Hits         6707     3612    -3095     
+ Misses        672      502     -170

Flag	Coverage Δ
jobs_cache_refresh	`98.50% <ø> (ø)`
jobs_mongodb_migration	`80.57% <86.11%> (+0.63%)`	⬆️
libs_libcommon	`93.54% <80.00%> (-0.17%)`	⬇️
services_admin	`87.32% <ø> (ø)`
services_api	`84.70% <43.78%> (-6.87%)`	⬇️
services_worker	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
libs/libcommon/src/libcommon/config.py	`78.33% <ø> (ø)`
services/api/src/api/authentication.py	`100.00% <ø> (ø)`
services/api/src/api/config.py	`100.00% <ø> (ø)`
services/api/src/api/routes/endpoint.py	`78.63% <ø> (ø)`
services/api/tests/conftest.py	`98.46% <ø> (ø)`
libs/libcommon/src/libcommon/queue.py	`92.46% <25.00%> (-1.39%)`	⬇️
services/api/src/api/routes/rows.py	`37.16% <37.16%> (ø)`
services/api/src/api/utils.py	`91.80% <66.66%> (-1.31%)`	⬇️
...n/migrations/_20230323155000_cache_dataset_info.py	`75.00% <75.00%> (ø)`
...n/migrations/_20230323160000_queue_dataset_info.py	`75.00% <75.00%> (ø)`
... and 7 more

... and 48 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

AndreaFrancis reviewed

View reviewed changes

services/api/pyproject.toml Outdated Show resolved Hide resolved

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

dbe22ab

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from 16de983 to 5565f28 Compare

March 14, 2023 15:33

lhoestq reviewed

View reviewed changes

services/api/pyproject.toml Outdated Show resolved Hide resolved

lhoestq reviewed

View reviewed changes

services/api/src/api/routes/rows.py Outdated Show resolved Hide resolved

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

2b43c31

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from df6837a to 80b3f08 Compare

March 15, 2023 09:16

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

8af0aa3

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from 80b3f08 to 793475c Compare

March 15, 2023 09:24

severo changed the base branch from main to fix-type

March 15, 2023 09:24

Base automatically changed from fix-type to main

March 15, 2023 09:46

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

f8e2d54

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from fbffced to 8631d7c Compare

March 15, 2023 09:55

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

525ffbb

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from 8631d7c to 0bfe8bb Compare

March 15, 2023 12:33

AndreaFrancis reviewed

View reviewed changes

services/api/src/api/routes/rows.py Outdated Show resolved Hide resolved

AndreaFrancis reviewed

View reviewed changes

services/api/src/api/routes/rows.py Show resolved Hide resolved

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

0a2d5e7

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from a062a16 to ca7df52 Compare

March 16, 2023 15:28

severo added a commit that referenced this pull request


          chore: 🤖 upgrade pyarrow (to use filesystem)

fc114c2

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)

severo force-pushed the random-rows branch from ca7df52 to 2fb7451 Compare

March 21, 2023 09:48

severo requested review from lhoestq, AndreaFrancis, albertvillanova, polinaeterna and mariosasko

March 21, 2023 20:08

severo and others added 23 commits

March 24, 2023 18:11


          refactor: 💡 fix type

feae924


          feat: 🎸 set the hffs commit

22f5277

until hffs has a proper release


          refactor: 💡 don't show the tqdm bars

dc39322


          fix: 🐛 replace the /parquet step by config-parquet

094d31c


          feat: 🎸 add code profiling

f789cc4


          test: 💍 nit: typos

f7c6d38


          feat: 🎸 get parquet files from cache, and add code profiling

cac22da


          fix: 🐛 fix style and test

d699193


          fix: 🐛 pass hf_token to mount the filesystem on gated datasets

6f2b342

also: fix parameter to disable tqdm. also: add e2e tests


          refactor: 💡 remove dead code

13a96f0


          ci: 🎡 increase the timeout limit for e2e tests

2008b69

in case it's what makes the e2e fail (see
https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131
for example)


          ci: 🎡 no need to increase to 30s

50c9c5e


          Update services/api/src/api/routes/rows.py

39d8783

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>


          style: 💄 fix style

4c50453


          feat: 🎸 use the same format as /first-rows for /rows

8ff6414


          ci: 🎡 fix mypy and pip-audit

6523d11


          feat: 🎸 memoïze the result of the parquet query

986838d

I put it to 1024, because we memoïze the index() function for 128
splits, which means here that we memoïze the result for 8 queries per
split in average.


          refactor: 💡 refactor as two classes: Indexer and RowsIndex

32d492f

The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows
of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split
in average).


          Update services/api/src/api/routes/rows.py

e139926

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>


          Update services/api/src/api/routes/rows.py

f9d975c

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>


          style: 💄 fix long line

a3eb150


          fix: 🐛 fix a bug: the dataset names can contain a dash

341e789

ie. openwebtext-10k


          fix: 🐛 another fix on parsing the parquet file names

03c00cb

Because the previous fix (to support builder names with dashes into
them) was breaking the detection of the shard number. Note that we
cannot support split names that contain dashes!!! I think it's a
limitation, maybe we should store each split in its own directory
instead of trying to parse.

severo force-pushed the random-rows branch from e300fb7 to 03c00cb Compare

March 24, 2023 18:11

AndreaFrancis mentioned this pull request

Split first rows from parquet new Job Runner #988

Merged

Collaborator Author

severo commented Mar 27, 2023

Let's go!

severo merged commit 4788650 into main

severo deleted the random-rows branch

March 27, 2023 08:27

severo mentioned this pull request

Lower parquet row group size for image datasets #833

Merged

mattstern31 added a commit to mattstern31/datasets-server-storage-admin that referenced this pull request


          Random rows (#875)

* chore: 🤖 add hffs and pyarrow dependencies

* feat: 🎸 add basic (and old) logic from #687

* feat: 🎸 change from/to parameters to offset/length

and use pa.Table.take to pa.Table.slice (thanks @lhoestq -
huggingface/dataset-viewer#687 (comment))

* style: 💄 fix style

* ci: 🎡 ignore hffs and pyarrow in mypy checks

* chore: 🤖 upgrade pyarrow (to use filesystem)

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
huggingface/dataset-viewer#875 (comment)

* feat: 🎸 use row groups to reduce the response time

based on @lhoestq implementation in
https://huggingface.co/spaces/lhoestq/datasets-explorer/blob/main/app.py

Still a POC. We are querying datasets-server.huggingface.co (hardcoded)
to get the list of parquet files.

* refactor: 💡 factorize mypy exceptions

* style: 💄 fix style

* refactor: 💡 fix type

* feat: 🎸 set the hffs commit

until hffs has a proper release

* refactor: 💡 don't show the tqdm bars

* fix: 🐛 replace the /parquet step by config-parquet

* feat: 🎸 add code profiling

* test: 💍 nit: typos

* feat: 🎸 get parquet files from cache, and add code profiling

* fix: 🐛 fix style and test

* fix: 🐛 pass hf_token to mount the filesystem on gated datasets

also: fix parameter to disable tqdm. also: add e2e tests

* refactor: 💡 remove dead code

* ci: 🎡 increase the timeout limit for e2e tests

in case it's what makes the e2e fail (see
https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131
for example)

* ci: 🎡 no need to increase to 30s

* Update services/api/src/api/routes/rows.py

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>

* style: 💄 fix style

* feat: 🎸 use the same format as /first-rows for /rows

* ci: 🎡 fix mypy and pip-audit

* feat: 🎸 memoïze the result of the parquet query

I put it to 1024, because we memoïze the index() function for 128
splits, which means here that we memoïze the result for 8 queries per
split in average.

* refactor: 💡 refactor as two classes: Indexer and RowsIndex

The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows
of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split
in average).

* Update services/api/src/api/routes/rows.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update services/api/src/api/routes/rows.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* style: 💄 fix long line

* fix: 🐛 fix a bug: the dataset names can contain a dash

ie. openwebtext-10k

* fix: 🐛 another fix on parsing the parquet file names

Because the previous fix (to support builder names with dashes into
them) was breaking the detection of the shard number. Note that we
cannot support split names that contain dashes!!! I think it's a
limitation, maybe we should store each split in its own directory
instead of trying to parse.

---------

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

lhoestq lhoestq approved these changes

AndreaFrancis AndreaFrancis approved these changes

albertvillanova Awaiting requested review from albertvillanova

polinaeterna Awaiting requested review from polinaeterna

mariosasko Awaiting requested review from mariosasko

Labels

None yet