-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random rows #875
Random rows #875
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #875 +/- ##
==========================================
- Coverage 90.89% 87.79% -3.10%
==========================================
Files 139 94 -45
Lines 7379 4114 -3265
==========================================
- Hits 6707 3612 -3095
+ Misses 672 502 -170
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 48 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
also list all the modules to ignore for mypy at the same time. thanks @andreasoria #875 (comment)
until hffs has a proper release
also: fix parameter to disable tqdm. also: add e2e tests
in case it's what makes the e2e fail (see https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131 for example)
Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>
I put it to 1024, because we memoïze the index() function for 128 splits, which means here that we memoïze the result for 8 queries per split in average.
The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split in average).
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
ie. openwebtext-10k
Because the previous fix (to support builder names with dashes into them) was breaking the detection of the shard number. Note that we cannot support split names that contain dashes!!! I think it's a limitation, maybe we should store each split in its own directory instead of trying to parse.
Let's go! |
* chore: 🤖 add hffs and pyarrow dependencies * feat: 🎸 add basic (and old) logic from #687 * feat: 🎸 change from/to parameters to offset/length and use pa.Table.take to pa.Table.slice (thanks @lhoestq - huggingface/dataset-viewer#687 (comment)) * style: 💄 fix style * ci: 🎡 ignore hffs and pyarrow in mypy checks * chore: 🤖 upgrade pyarrow (to use filesystem) also list all the modules to ignore for mypy at the same time. thanks @andreasoria huggingface/dataset-viewer#875 (comment) * feat: 🎸 use row groups to reduce the response time based on @lhoestq implementation in https://huggingface.co/spaces/lhoestq/datasets-explorer/blob/main/app.py Still a POC. We are querying datasets-server.huggingface.co (hardcoded) to get the list of parquet files. * refactor: 💡 factorize mypy exceptions * style: 💄 fix style * refactor: 💡 fix type * feat: 🎸 set the hffs commit until hffs has a proper release * refactor: 💡 don't show the tqdm bars * fix: 🐛 replace the /parquet step by config-parquet * feat: 🎸 add code profiling * test: 💍 nit: typos * feat: 🎸 get parquet files from cache, and add code profiling * fix: 🐛 fix style and test * fix: 🐛 pass hf_token to mount the filesystem on gated datasets also: fix parameter to disable tqdm. also: add e2e tests * refactor: 💡 remove dead code * ci: 🎡 increase the timeout limit for e2e tests in case it's what makes the e2e fail (see https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131 for example) * ci: 🎡 no need to increase to 30s * Update services/api/src/api/routes/rows.py Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co> * style: 💄 fix style * feat: 🎸 use the same format as /first-rows for /rows * ci: 🎡 fix mypy and pip-audit * feat: 🎸 memoïze the result of the parquet query I put it to 1024, because we memoïze the index() function for 128 splits, which means here that we memoïze the result for 8 queries per split in average. * refactor: 💡 refactor as two classes: Indexer and RowsIndex The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split in average). * Update services/api/src/api/routes/rows.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update services/api/src/api/routes/rows.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * style: 💄 fix long line * fix: 🐛 fix a bug: the dataset names can contain a dash ie. openwebtext-10k * fix: 🐛 another fix on parsing the parquet file names Because the previous fix (to support builder names with dashes into them) was breaking the detection of the shard number. Note that we cannot support split names that contain dashes!!! I think it's a limitation, maybe we should store each split in its own directory instead of trying to parse. --------- Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Add a new endpoint:
/rows
Parameters:
dataset
,config
,split
,offset
andlimit
.It returns a list of
limit
rows of the split, from row idx=offset
, by fetching them from the parquet files published on the Hub.Replaces #687