Skip to content

Commit

Permalink
Random rows (#875)
Browse files Browse the repository at this point in the history
* chore: 🤖 add hffs and pyarrow dependencies

* feat: 🎸 add basic (and old) logic from #687

* feat: 🎸 change from/to parameters to offset/length

and use pa.Table.take to pa.Table.slice (thanks @lhoestq -
huggingface/dataset-viewer#687 (comment))

* style: 💄 fix style

* ci: 🎡 ignore hffs and pyarrow in mypy checks

* chore: 🤖 upgrade pyarrow (to use filesystem)

also list all the modules to ignore for mypy at the same time. thanks
@andreasoria
huggingface/dataset-viewer#875 (comment)

* feat: 🎸 use row groups to reduce the response time

based on @lhoestq implementation in
https://huggingface.co/spaces/lhoestq/datasets-explorer/blob/main/app.py

Still a POC. We are querying datasets-server.huggingface.co (hardcoded)
to get the list of parquet files.

* refactor: 💡 factorize mypy exceptions

* style: 💄 fix style

* refactor: 💡 fix type

* feat: 🎸 set the hffs commit

until hffs has a proper release

* refactor: 💡 don't show the tqdm bars

* fix: 🐛 replace the /parquet step by config-parquet

* feat: 🎸 add code profiling

* test: 💍 nit: typos

* feat: 🎸 get parquet files from cache, and add code profiling

* fix: 🐛 fix style and test

* fix: 🐛 pass hf_token to mount the filesystem on gated datasets

also: fix parameter to disable tqdm. also: add e2e tests

* refactor: 💡 remove dead code

* ci: 🎡 increase the timeout limit for e2e tests

in case it's what makes the e2e fail (see
https://github.com/huggingface/datasets-server/actions/runs/4428593828/jobs/7768515700#step:7:131
for example)

* ci: 🎡 no need to increase to 30s

* Update services/api/src/api/routes/rows.py

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>

* style: 💄 fix style

* feat: 🎸 use the same format as /first-rows for /rows

* ci: 🎡 fix mypy and pip-audit

* feat: 🎸 memoïze the result of the parquet query

I put it to 1024, because we memoïze the index() function for 128
splits, which means here that we memoïze the result for 8 queries per
split in average.

* refactor: 💡 refactor as two classes: Indexer and RowsIndex

The LRU cache will store up to 128 RowsIndexes (ie. an index of the rows
of 128 dataset splits), and up to 1,024 queries (ie. 8 queries per split
in average).

* Update services/api/src/api/routes/rows.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* Update services/api/src/api/routes/rows.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* style: 💄 fix long line

* fix: 🐛 fix a bug: the dataset names can contain a dash

ie. openwebtext-10k

* fix: 🐛 another fix on parsing the parquet file names

Because the previous fix (to support builder names with dashes into
them) was breaking the detection of the shard number. Note that we
cannot support split names that contain dashes!!! I think it's a
limitation, maybe we should store each split in its own directory
instead of trying to parse.

---------

Co-authored-by: Andrea Francis Soria Jimenez <andrea@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
3 people committed Mar 27, 2023
1 parent a8ac23c commit 9577665
Show file tree
Hide file tree
Showing 16 changed files with 1,251 additions and 17 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/_quality-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,4 @@ jobs:
- name: Run bandit
run: poetry run bandit -r src
- name: Run pip-audit
run: bash -c "poetry run pip-audit -r <(poetry export -f requirements.txt --with dev | sed '/^pymongo==/,+109 d' | sed '/^requests==2.28.2 ;/,+2 d' | sed '/^kenlm @/d' | sed '/^torch @/d' | sed '/^torchaudio @/d' | sed '/^libcommon @/d' | sed '/^trec-car-tools @/d' | sed '/^hffs @/d')"
run: bash -c "poetry run pip-audit -r <(poetry export -f requirements.txt --with dev | sed '/^pymongo==/,+109 d' | sed '/^requests==2.28.2 ;/,+2 d' | sed '/^kenlm @/d' | sed '/^fsspec==/,+2 d' | sed '/^torch @/d' | sed '/^torchaudio @/d' | sed '/^libcommon @/d' | sed '/^trec-car-tools @/d' | sed '/^hffs @/d')"
25 changes: 25 additions & 0 deletions e2e/tests/test_11_auth.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,28 @@ def test_auth_e2e(
expected_error_code=expected_error_code,
headers=headers,
)

# ensure the /rows endpoint works as well
offset = 1
limit = 10
rows_response = poll_until_ready_and_assert(
relative_url=f"/rows?dataset={dataset}&config={config}&split={split}&offset={offset}&limit={limit}",
expected_status_code=expected_status_code,
expected_error_code=expected_error_code,
headers=headers,
)
if not expected_error_code:
content = rows_response.json()
assert "rows" in content, rows_response
rows = content["rows"]
assert isinstance(rows, list), rows
assert len(rows) == 3, rows
assert rows[0] == {"row_idx": 1, "row": {"col_1": 1, "col_2": 1, "col_3": 1.0}, "truncated_cells": []}, rows[0]
assert "features" in content, rows_response
features = content["features"]
assert isinstance(features, list), features
assert features == [
{"feature_idx": 0, "name": "col_1", "type": {"dtype": "int64", "_type": "Value"}},
{"feature_idx": 1, "name": "col_2", "type": {"dtype": "int64", "_type": "Value"}},
{"feature_idx": 2, "name": "col_3", "type": {"dtype": "float64", "_type": "Value"}},
], features
3 changes: 2 additions & 1 deletion e2e/tests/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def poll_until_ready_and_assert(
expected_error_code: Optional[str],
headers: Optional[Headers] = None,
url: str = URL,
) -> None:
) -> Any:
if headers is None:
headers = {}
interval = INTERVAL
Expand All @@ -126,6 +126,7 @@ def poll_until_ready_and_assert(
raise RuntimeError("Poll timeout")
assert response.status_code == expected_status_code, log(response, url)
assert response.headers.get("X-Error-Code") == expected_error_code, log(response, url)
return response


# explicit re-export
Expand Down
811 changes: 809 additions & 2 deletions services/api/poetry.lock

Large diffs are not rendered by default.

15 changes: 12 additions & 3 deletions services/api/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,13 @@ version = "0.1.3"
license = "Apache-2.0"

[tool.poetry.dependencies]
datasets = "^2.10.1"
environs = "^9.5.0"
hffs = {git = "https://github.com/huggingface/hffs.git", rev="0e187e74d38e9436353691f4a7a26b15f0663f58"}
jsonschema = "^4.17.0"
libcommon = {path = "../../libs/libcommon", develop = true}
pyjwt = { extras = ["crypto"], version = "^2.6.0" }
pyarrow = "^11.0.0"
python = "3.9.15"
requests = "^2.28.2"
starlette = "^0.25.0"
Expand Down Expand Up @@ -56,9 +59,15 @@ preview = true
[tool.mypy]
strict = true
disallow_untyped_calls = false
# ^ call to expected_algorithm.from_jwk force to set this to false
# ^ call to expected_algorithm.from_jwk forces to set this to false

[[tool.mypy.overrides]]
module = "prometheus_client.*"
ignore_missing_imports = true
module = [
"datasets.*",
"hffs.*",
"prometheus_client.*",
"pyarrow.*",
"tqdm.*"
]
# ^ prometheus_client is now typed, but starlette-prometheus requires an old version
ignore_missing_imports = true
20 changes: 20 additions & 0 deletions services/api/src/api/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from api.prometheus import Prometheus
from api.routes.endpoint import EndpointsDefinition, create_endpoint
from api.routes.healthcheck import healthcheck_endpoint
from api.routes.rows import create_rows_endpoint
from api.routes.valid import create_is_valid_endpoint, create_valid_endpoint
from api.routes.webhook import create_webhook_endpoint

Expand Down Expand Up @@ -46,6 +47,10 @@ def create_app_with_config(app_config: AppConfig, endpoint_config: EndpointConfi
if app_config.api.hf_jwt_public_key_url and app_config.api.hf_jwt_algorithm
else None
)
parquet_processing_steps_by_input_type = endpoints_definition.steps_by_input_type_and_endpoint.get("/parquet")
if not parquet_processing_steps_by_input_type or not parquet_processing_steps_by_input_type["config"]:
raise RuntimeError("The parquet endpoint is not configured. Exiting.")
config_parquet_processing_steps = parquet_processing_steps_by_input_type["config"]

middleware = [
Middleware(
Expand Down Expand Up @@ -118,6 +123,21 @@ def create_app_with_config(app_config: AppConfig, endpoint_config: EndpointConfi
methods=["POST"],
),
# ^ called by the Hub webhooks
Route(
"/rows",
endpoint=create_rows_endpoint(
config_parquet_processing_steps=config_parquet_processing_steps,
init_processing_steps=init_processing_steps,
hf_endpoint=app_config.common.hf_endpoint,
hf_token=app_config.common.hf_token,
hf_jwt_public_key=hf_jwt_public_key,
hf_jwt_algorithm=app_config.api.hf_jwt_algorithm,
external_auth_url=app_config.api.external_auth_url,
hf_timeout_seconds=app_config.api.hf_timeout_seconds,
max_age_long=app_config.api.max_age_long,
max_age_short=app_config.api.max_age_short,
),
),
]

return Starlette(routes=routes, middleware=middleware, on_shutdown=[resource.release for resource in resources])
Expand Down
2 changes: 1 addition & 1 deletion services/api/src/api/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def auth_check(
try:
logging.debug(
f"Checking authentication on the Hugging Face Hub for dataset {dataset}, url: {url}, timeout:"
f" {hf_timeout_seconds}"
f" {hf_timeout_seconds}, authorization: {auth.authorization}"
)
response = requests.get(url, auth=auth, timeout=hf_timeout_seconds)
except Exception as err:
Expand Down
9 changes: 8 additions & 1 deletion services/api/src/api/routes/endpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,14 @@ def get_cache_entry_from_steps(
If no successful result is found, it will return the last one even if it's an error,
Checks if job is still in progress by each processing step in case of no entry found.
Raises:
ResponseNotReadyError: if no result is found.
- [`~libcommon.dataset.AskAccessHubRequestError`]: if the request to the Hub to get access to the
dataset failed or timed out.
- [`~libcommon.dataset.DatasetInfoHubRequestError`]: if the request to the Hub to get the dataset
info failed or timed out.
- [`~libcommon.operations.PreviousStepError`]: a previous step has an error
- [`~libcommon.dataset.DatasetError`]: if the dataset could not be accessed or is not supported
- [`~api.utils.ResponseNotFoundError`]: if no result is found.
- [`~api.utils.ResponseNotReadyError`]: if the response is not ready yet.
Returns: the cached record
"""
Expand Down
Loading

0 comments on commit 9577665

Please sign in to comment.