Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random rows #875

Merged
merged 32 commits into from
Mar 27, 2023
Merged

Random rows #875

merged 32 commits into from
Mar 27, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Mar 1, 2023

Add a new endpoint: /rows

Parameters: dataset, config, split, offset and limit.

It returns a list of limit rows of the split, from row idx=offset, by fetching them from the parquet files published on the Hub.

Replaces #687

@HuggingFaceDocBuilder
Copy link
Collaborator

HuggingFaceDocBuilder commented Mar 1, 2023

The documentation is not available anymore as the PR was closed or merged.

@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2023

Codecov Report

Patch coverage: 57.81% and project coverage change: -3.10 ⚠️

Comparison is base (dc51be6) 90.89% compared to head (e300fb7) 87.79%.

❗ Current head e300fb7 differs from pull request most recent head 03c00cb. Consider uploading reports for the commit 03c00cb to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #875      +/-   ##
==========================================
- Coverage   90.89%   87.79%   -3.10%     
==========================================
  Files         139       94      -45     
  Lines        7379     4114    -3265     
==========================================
- Hits         6707     3612    -3095     
+ Misses        672      502     -170     
Flag Coverage Δ
jobs_cache_refresh 98.50% <ø> (ø)
jobs_mongodb_migration 80.57% <86.11%> (+0.63%) ⬆️
libs_libcommon 93.54% <80.00%> (-0.17%) ⬇️
services_admin 87.32% <ø> (ø)
services_api 84.70% <43.78%> (-6.87%) ⬇️
services_worker ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
libs/libcommon/src/libcommon/config.py 78.33% <ø> (ø)
services/api/src/api/authentication.py 100.00% <ø> (ø)
services/api/src/api/config.py 100.00% <ø> (ø)
services/api/src/api/routes/endpoint.py 78.63% <ø> (ø)
services/api/tests/conftest.py 98.46% <ø> (ø)
libs/libcommon/src/libcommon/queue.py 92.46% <25.00%> (-1.39%) ⬇️
services/api/src/api/routes/rows.py 37.16% <37.16%> (ø)
services/api/src/api/utils.py 91.80% <66.66%> (-1.31%) ⬇️
...n/migrations/_20230323155000_cache_dataset_info.py 75.00% <75.00%> (ø)
...n/migrations/_20230323160000_queue_dataset_info.py 75.00% <75.00%> (ø)
... and 7 more

... and 48 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

severo added a commit that referenced this pull request Mar 1, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo added a commit that referenced this pull request Mar 14, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo added a commit that referenced this pull request Mar 15, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo added a commit that referenced this pull request Mar 15, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
@severo severo changed the base branch from main to fix-type March 15, 2023 09:24
Base automatically changed from fix-type to main March 15, 2023 09:46
severo added a commit that referenced this pull request Mar 15, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo added a commit that referenced this pull request Mar 15, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo added a commit that referenced this pull request Mar 16, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo added a commit that referenced this pull request Mar 21, 2023
also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
#875 (comment)
severo and others added 23 commits March 24, 2023 18:11
until hffs has a proper release
also: fix parameter to disable tqdm. also: add e2e tests
Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>
I put it to 1024, because we memoïze the index() function for 128
splits, which means here that we memoïze the result for 8 queries per
split in average.
The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows
of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split
in average).
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Because the previous fix (to support builder names with dashes into
them) was breaking the detection of the shard number. Note that we
cannot support split names that contain dashes!!! I think it's a
limitation, maybe we should store each split in its own directory
instead of trying to parse.
@severo
Copy link
Collaborator Author

severo commented Mar 27, 2023

Let's go!

@severo severo merged commit 4788650 into main Mar 27, 2023
@severo severo deleted the random-rows branch March 27, 2023 08:27
mattstern31 added a commit to mattstern31/datasets-server-storage-admin that referenced this pull request Nov 11, 2023
* chore: 🤖 add hffs and pyarrow dependencies

* feat: 🎸 add basic (and old) logic from #687

* feat: 🎸 change from/to parameters to offset/length

and use pa.Table.take to pa.Table.slice (thanks @lhoestq -
huggingface/dataset-viewer#687 (comment))

* style: 💄 fix style

* ci: 🎡 ignore hffs and pyarrow in mypy checks

* chore: 🤖 upgrade pyarrow (to use filesystem)

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
huggingface/dataset-viewer#875 (comment)

* feat: 🎸 use row groups to reduce the response time

based on @lhoestq implementation in
https://huggingface.co/spaces/lhoestq/datasets-explorer/blob/main/app.py

Still a POC. We are querying datasets-server.huggingface.co (hardcoded)
to get the list of parquet files.

* refactor: 💡 factorize mypy exceptions

* style: 💄 fix style

* refactor: 💡 fix type

* feat: 🎸 set the hffs commit

until hffs has a proper release

* refactor: 💡 don't show the tqdm bars

* fix: 🐛 replace the /parquet step by config-parquet

* feat: 🎸 add code profiling

* test: 💍 nit: typos

* feat: 🎸 get parquet files from cache, and add code profiling

* fix: 🐛 fix style and test

* fix: 🐛 pass hf_token to mount the filesystem on gated datasets

also: fix parameter to disable tqdm. also: add e2e tests

* refactor: 💡 remove dead code

* ci: 🎡 increase the timeout limit for e2e tests

in case it's what makes the e2e fail (see
https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131
for example)

* ci: 🎡 no need to increase to 30s

* Update services/api/src/api/routes/rows.py

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>

* style: 💄 fix style

* feat: 🎸 use the same format as /first-rows for /rows

* ci: 🎡 fix mypy and pip-audit

* feat: 🎸 memoïze the result of the parquet query

I put it to 1024, because we memoïze the index() function for 128
splits, which means here that we memoïze the result for 8 queries per
split in average.

* refactor: 💡 refactor as two classes: Indexer and RowsIndex

The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows
of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split
in average).

* Update services/api/src/api/routes/rows.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update services/api/src/api/routes/rows.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* style: 💄 fix long line

* fix: 🐛 fix a bug: the dataset names can contain a dash

ie. openwebtext-10k

* fix: 🐛 another fix on parsing the parquet file names

Because the previous fix (to support builder names with dashes into
them) was breaking the detection of the shard number. Note that we
cannot support split names that contain dashes!!! I think it's a
limitation, maybe we should store each split in its own directory
instead of trying to parse.

---------

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants