Random rows (#875)

* chore: 🤖 add hffs and pyarrow dependencies * feat: 🎸 add basic (and old) logic from #687 * feat: 🎸 change from/to parameters to offset/length and use pa.Table.take to pa.Table.slice (thanks @lhoestq - huggingface/dataset-viewer#687 (comment)) * style: 💄 fix style * ci: 🎡 ignore hffs and pyarrow in mypy checks * chore: 🤖 upgrade pyarrow (to use filesystem) also list all the modules to ignore for mypy at the same time. thanks @andreasoria huggingface/dataset-viewer#875 (comment) * feat: 🎸 use row groups to reduce the response time based on @lhoestq implementation in https://huggingface.co/spaces/lhoestq/datasets-explorer/blob/main/app.py Still a POC. We are querying datasets-server.huggingface.co (hardcoded) to get the list of parquet files. * refactor: 💡 factorize mypy exceptions * style: 💄 fix style * refactor: 💡 fix type * feat: 🎸 set the hffs commit until hffs has a proper release * refactor: 💡 don't show the tqdm bars * fix: 🐛 replace the /parquet step by config-parquet * feat: 🎸 add code profiling * test: 💍 nit: typos * feat: 🎸 get parquet files from cache, and add code profiling * fix: 🐛 fix style and test * fix: 🐛 pass hf_token to mount the filesystem on gated datasets also: fix parameter to disable tqdm. also: add e2e tests * refactor: 💡 remove dead code * ci: 🎡 increase the timeout limit for e2e tests in case it's what makes the e2e fail (see https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131 for example) * ci: 🎡 no need to increase to 30s * Update services/api/src/api/routes/rows.py Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co> * style: 💄 fix style * feat: 🎸 use the same format as /first-rows for /rows * ci: 🎡 fix mypy and pip-audit * feat: 🎸 memoïze the result of the parquet query I put it to 1024, because we memoïze the index() function for 128 splits, which means here that we memoïze the result for 8 queries per split in average. * refactor: 💡 refactor as two classes: Indexer and RowsIndex The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split in average). * Update services/api/src/api/routes/rows.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Update services/api/src/api/routes/rows.py Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * style: 💄 fix long line * fix: 🐛 fix a bug: the dataset names can contain a dash ie. openwebtext-10k * fix: 🐛 another fix on parsing the parquet file names Because the previous fix (to support builder names with dashes into them) was breaking the detection of the shard number. Note that we cannot support split names that contain dashes!!! I think it's a limitation, maybe we should store each split in its own directory instead of trying to parse. --------- Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
mattstern31 · Mar 27, 2023 · 9577665 · 9577665
1 parent a8ac23c
commit 9577665
Show file tree

Hide file tree

Showing 16 changed files with 1,251 additions and 17 deletions.
diff --git a/.github/workflows/_quality-python.yml b/.github/workflows/_quality-python.yml
@@ -53,4 +53,4 @@ jobs:
       - name: Run bandit
         run: poetry run bandit -r src
       - name: Run pip-audit
-        run: bash -c "poetry run pip-audit -r <(poetry export -f requirements.txt --with dev | sed '/^pymongo==/,+109 d' | sed '/^requests==2.28.2 ;/,+2 d' | sed '/^kenlm @/d' | sed '/^torch @/d' | sed '/^torchaudio @/d' | sed '/^libcommon @/d' | sed '/^trec-car-tools @/d' | sed '/^hffs @/d')"
+        run: bash -c "poetry run pip-audit -r <(poetry export -f requirements.txt --with dev | sed '/^pymongo==/,+109 d' | sed '/^requests==2.28.2 ;/,+2 d' | sed '/^kenlm @/d' | sed '/^fsspec==/,+2 d' | sed '/^torch @/d' | sed '/^torchaudio @/d' | sed '/^libcommon @/d' | sed '/^trec-car-tools @/d' | sed '/^hffs @/d')"
diff --git a/e2e/tests/test_11_auth.py b/e2e/tests/test_11_auth.py
@@ -50,3 +50,28 @@ def test_auth_e2e(
             expected_error_code=expected_error_code,
             headers=headers,
         )
+
+    # ensure the /rows endpoint works as well
+    offset = 1
+    limit = 10
+    rows_response = poll_until_ready_and_assert(
+        relative_url=f"/rows?dataset={dataset}&config={config}&split={split}&offset={offset}&limit={limit}",
+        expected_status_code=expected_status_code,
+        expected_error_code=expected_error_code,
+        headers=headers,
+    )
+    if not expected_error_code:
+        content = rows_response.json()
+        assert "rows" in content, rows_response
+        rows = content["rows"]
+        assert isinstance(rows, list), rows
+        assert len(rows) == 3, rows
+        assert rows[0] == {"row_idx": 1, "row": {"col_1": 1, "col_2": 1, "col_3": 1.0}, "truncated_cells": []}, rows[0]
+        assert "features" in content, rows_response
+        features = content["features"]
+        assert isinstance(features, list), features
+        assert features == [
+            {"feature_idx": 0, "name": "col_1", "type": {"dtype": "int64", "_type": "Value"}},
+            {"feature_idx": 1, "name": "col_2", "type": {"dtype": "int64", "_type": "Value"}},
+            {"feature_idx": 2, "name": "col_3", "type": {"dtype": "float64", "_type": "Value"}},
+        ], features
diff --git a/e2e/tests/utils.py b/e2e/tests/utils.py
@@ -108,7 +108,7 @@ def poll_until_ready_and_assert(
     expected_error_code: Optional[str],
     headers: Optional[Headers] = None,
     url: str = URL,
-) -> None:
+) -> Any:
     if headers is None:
         headers = {}
     interval = INTERVAL
@@ -126,6 +126,7 @@ def poll_until_ready_and_assert(
         raise RuntimeError("Poll timeout")
     assert response.status_code == expected_status_code, log(response, url)
     assert response.headers.get("X-Error-Code") == expected_error_code, log(response, url)
+    return response
 
 
 # explicit re-export

diff --git a/services/api/poetry.lock b/services/api/poetry.lock
diff --git a/services/api/pyproject.toml b/services/api/pyproject.toml
@@ -6,10 +6,13 @@ version = "0.1.3"
 license = "Apache-2.0"
 
 [tool.poetry.dependencies]
+datasets = "^2.10.1"
 environs = "^9.5.0"
+hffs = {git = "https://github.com/huggingface/hffs.git", rev="0e187e74d38e9436353691f4a7a26b15f0663f58"}
 jsonschema = "^4.17.0"
 libcommon = {path = "../../libs/libcommon", develop = true}
 pyjwt = { extras = ["crypto"], version = "^2.6.0" }
+pyarrow = "^11.0.0"
 python = "3.9.15"
 requests = "^2.28.2"
 starlette = "^0.25.0"
@@ -56,9 +59,15 @@ preview = true
 [tool.mypy]
 strict = true
 disallow_untyped_calls = false
-# ^ call to expected_algorithm.from_jwk force to set this to false
+# ^ call to expected_algorithm.from_jwk forces to set this to false
 
 [[tool.mypy.overrides]]
-module = "prometheus_client.*"
-ignore_missing_imports = true
+module = [
+    "datasets.*",
+    "hffs.*",
+    "prometheus_client.*",
+    "pyarrow.*",
+    "tqdm.*"
+]
 # ^ prometheus_client is now typed, but starlette-prometheus requires an old version
+ignore_missing_imports = true
diff --git a/services/api/src/api/app.py b/services/api/src/api/app.py
@@ -17,6 +17,7 @@
 from api.prometheus import Prometheus
 from api.routes.endpoint import EndpointsDefinition, create_endpoint
 from api.routes.healthcheck import healthcheck_endpoint
+from api.routes.rows import create_rows_endpoint
 from api.routes.valid import create_is_valid_endpoint, create_valid_endpoint
 from api.routes.webhook import create_webhook_endpoint
 
@@ -46,6 +47,10 @@ def create_app_with_config(app_config: AppConfig, endpoint_config: EndpointConfi
         if app_config.api.hf_jwt_public_key_url and app_config.api.hf_jwt_algorithm
         else None
     )
+    parquet_processing_steps_by_input_type = endpoints_definition.steps_by_input_type_and_endpoint.get("/parquet")
+    if not parquet_processing_steps_by_input_type or not parquet_processing_steps_by_input_type["config"]:
+        raise RuntimeError("The parquet endpoint is not configured. Exiting.")
+    config_parquet_processing_steps = parquet_processing_steps_by_input_type["config"]
 
     middleware = [
         Middleware(
@@ -118,6 +123,21 @@ def create_app_with_config(app_config: AppConfig, endpoint_config: EndpointConfi
             methods=["POST"],
         ),
         # ^ called by the Hub webhooks
+        Route(
+            "/rows",
+            endpoint=create_rows_endpoint(
+                config_parquet_processing_steps=config_parquet_processing_steps,
+                init_processing_steps=init_processing_steps,
+                hf_endpoint=app_config.common.hf_endpoint,
+                hf_token=app_config.common.hf_token,
+                hf_jwt_public_key=hf_jwt_public_key,
+                hf_jwt_algorithm=app_config.api.hf_jwt_algorithm,
+                external_auth_url=app_config.api.external_auth_url,
+                hf_timeout_seconds=app_config.api.hf_timeout_seconds,
+                max_age_long=app_config.api.max_age_long,
+                max_age_short=app_config.api.max_age_short,
+            ),
+        ),
     ]
 
     return Starlette(routes=routes, middleware=middleware, on_shutdown=[resource.release for resource in resources])

diff --git a/services/api/src/api/authentication.py b/services/api/src/api/authentication.py
@@ -109,7 +109,7 @@ def auth_check(
             try:
                 logging.debug(
                     f"Checking authentication on the Hugging Face Hub for dataset {dataset}, url: {url}, timeout:"
-                    f" {hf_timeout_seconds}"
+                    f" {hf_timeout_seconds}, authorization: {auth.authorization}"
                 )
                 response = requests.get(url, auth=auth, timeout=hf_timeout_seconds)
             except Exception as err:

diff --git a/services/api/src/api/routes/endpoint.py b/services/api/src/api/routes/endpoint.py
@@ -62,7 +62,14 @@ def get_cache_entry_from_steps(
     If no successful result is found, it will return the last one even if it's an error,
     Checks if job is still in progress by each processing step in case of no entry found.
     Raises:
-        ResponseNotReadyError: if no result is found.
+        - [`~libcommon.dataset.AskAccessHubRequestError`]: if the request to the Hub to get access to the
+            dataset failed or timed out.
+        - [`~libcommon.dataset.DatasetInfoHubRequestError`]: if the request to the Hub to get the dataset
+            info failed or timed out.
+        - [`~libcommon.operations.PreviousStepError`]: a previous step has an error
+        - [`~libcommon.dataset.DatasetError`]: if the dataset could not be accessed or is not supported
+        - [`~api.utils.ResponseNotFoundError`]: if no result is found.
+        - [`~api.utils.ResponseNotReadyError`]: if the response is not ready yet.
 
     Returns: the cached record
     """