-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* chore: 🤖 add hffs and pyarrow dependencies * feat: 🎸 add basic (and old) logic from #687 * feat: 🎸 change from/to parameters to offset/length and use pa.Table.take to pa.Table.slice (thanks @lhoestq - huggingface/dataset-viewer#687 (comment)) * style: 💄 fix style * ci: 🎡 ignore hffs and pyarrow in mypy checks * chore: 🤖 upgrade pyarrow (to use filesystem) also list all the modules to ignore for mypy at the same time. thanks @andreasoria huggingface/dataset-viewer#875 (comment) * feat: 🎸 use row groups to reduce the response time based on @lhoestq implementation in https://huggingface.co/spaces/lhoestq/datasets-explorer/blob/main/app.py Still a POC. We are querying datasets-server.huggingface.co (hardcoded) to get the list of parquet files. * refactor: 💡 factorize mypy exceptions * style: 💄 fix style * refactor: 💡 fix type * feat: 🎸 set the hffs commit until hffs has a proper release * refactor: 💡 don't show the tqdm bars * fix: 🐛 replace the /parquet step by config-parquet * feat: 🎸 add code profiling * test: 💍 nit: typos * feat: 🎸 get parquet files from cache, and add code profiling * fix: 🐛 fix style and test * fix: 🐛 pass hf_token to mount the filesystem on gated datasets also: fix parameter to disable tqdm. also: add e2e tests * refactor: 💡 remove dead code * ci: 🎡 increase the timeout limit for e2e tests in case it's what makes the e2e fail (see https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131 for example) * ci: 🎡 no need to increase to 30s * Update services/api/src/api/routes/rows.py Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co> * style: 💄 fix style * feat: 🎸 use the same format as /first-rows for /rows * ci: 🎡 fix mypy and pip-audit * feat: 🎸 memoïze the result of the parquet query I put it to 1024, because we memoïze the index() function for 128 splits, which means here that we memoïze the result for 8 queries per split in average. * refactor: 💡 refactor as two classes: Indexer and RowsIndex The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split in average). * Update services/api/src/api/routes/rows.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update services/api/src/api/routes/rows.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * style: 💄 fix long line * fix: 🐛 fix a bug: the dataset names can contain a dash ie. openwebtext-10k * fix: 🐛 another fix on parsing the parquet file names Because the previous fix (to support builder names with dashes into them) was breaking the detection of the shard number. Note that we cannot support split names that contain dashes!!! I think it's a limitation, maybe we should store each split in its own directory instead of trying to parse. --------- Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
- Loading branch information
1 parent
a8ac23c
commit 9577665
Showing
16 changed files
with
1,251 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.