Skip to content

Commit

Permalink
Details (#589)
Browse files Browse the repository at this point in the history
* chore: 🤖 remove useless file

* docs: ✏️ replace deprecated /rows with /first-rows

* feat: 🎸 update protobuf (fixes a security vulnerability)
  • Loading branch information
severo authored Sep 26, 2022
1 parent 25f6a81 commit 72963ce
Show file tree
Hide file tree
Showing 7 changed files with 16 additions and 102 deletions.
2 changes: 1 addition & 1 deletion chart/static-files/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -1154,7 +1154,7 @@
"description": "The list of the 100 first rows of a dataset split.",
"externalDocs": {
"description": "See First rows (Hub docs)",
"url": "https://huggingface.co/docs/datasets-server/rows"
"url": "https://huggingface.co/docs/datasets-server/first-rows"
},
"operationId": "listFirstRows",
"security": [
Expand Down
16 changes: 8 additions & 8 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,15 +132,15 @@ worker:
maxMemoryPct: 0
# Max size (in bytes) of the dataset to fallback in normal mode if streaming fails
maxSizeFallback: "100_000_000"
# Min size of a cell in the /rows endpoint response in bytes
# Min size of a cell in the /first-rows endpoint response in bytes
minCellBytes: 100
# Directory of the "numba" library cache
numbaCacheDirectory: "/numba-cache"
# Max size of the /rows endpoint response in bytes
# Max size of the /first-rows endpoint response in bytes
rowMaxBytes: "1_000_000"
# Max number of rows in the /rows endpoint response
# Max number of rows in the /first-rows endpoint response
rowsMaxNumber: 100
# Min number of rows in the /rows endpoint response
# Min number of rows in the /first-rows endpoint response
rowsMinNumber: 10
# Number of seconds a worker will sleep before trying to process a new job
workerSleepSeconds: 15
Expand Down Expand Up @@ -176,15 +176,15 @@ worker:
maxMemoryPct: 0
# Max size (in bytes) of the dataset to fallback in normal mode if streaming fails
maxSizeFallback: "100_000_000"
# Min size of a cell in the /rows endpoint response in bytes
# Min size of a cell in the /first-rows endpoint response in bytes
minCellBytes: 100
# Directory of the "numba" library cache
numbaCacheDirectory: "/numba-cache"
# Max size of the /rows endpoint response in bytes
# Max size of the /first-rows endpoint response in bytes
rowMaxBytes: "1_000_000"
# Max number of rows in the /rows endpoint response
# Max number of rows in the /first-rows endpoint response
rowsMaxNumber: 100
# Min number of rows in the /rows endpoint response
# Min number of rows in the /first-rows endpoint response
rowsMinNumber: 10
# Number of seconds a worker will sleep before trying to process a new job
workerSleepSeconds: 15
Expand Down
2 changes: 1 addition & 1 deletion services/api/src/api/routes/first_rows.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ async def first_rows_endpoint(request: Request) -> Response:
dataset = request.query_params.get("dataset")
config = request.query_params.get("config")
split = request.query_params.get("split")
logger.info(f"/rows, dataset={dataset}, config={config}, split={split}")
logger.info(f"/first-rows, dataset={dataset}, config={config}, split={split}")

if not are_valid_parameters([dataset, config, split]):
raise MissingRequiredParameterError("Parameters 'dataset', 'config' and 'split' are required")
Expand Down
59 changes: 0 additions & 59 deletions services/worker/.env.example

This file was deleted.

6 changes: 3 additions & 3 deletions services/worker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ Set environment variables to configure the following aspects:
- `MONGO_QUEUE_DATABASE`: the name of the database used for storing the queue. Defaults to `"datasets_server_queue"`.
- `MONGO_URL`: the URL used to connect to the mongo db server. Defaults to `"mongodb://localhost:27017"`.
- `NUMBA_CACHE_DIR`: directory where the `numba` decorators (used by `librosa`) can write cache. Required on cloud infrastructure (see https://stackoverflow.com/a/63367171/7351594).
- `ROWS_MAX_BYTES`: the max size of the /rows endpoint response in bytes. Defaults to `1_000_000` (1 MB).
- `ROWS_MAX_NUMBER`: the max number of rows fetched by the worker for the split, and provided in the /rows endpoint response. Defaults to `100`.
- `ROWS_MIN_NUMBER`: the min number of rows fetched by the worker for the split, and provided in the /rows endpoint response. Defaults to `10`.
- `ROWS_MAX_BYTES`: the max size of the /first-rows endpoint response in bytes. Defaults to `1_000_000` (1 MB).
- `ROWS_MAX_NUMBER`: the max number of rows fetched by the worker for the split, and provided in the /first-rows endpoint response. Defaults to `100`.
- `ROWS_MIN_NUMBER`: the min number of rows fetched by the worker for the split, and provided in the /first-rows endpoint response. Defaults to `10`.
- `WORKER_QUEUE`: name of the queue the worker will pull jobs from. It can be equal to `splits_responses` or `first_rows_responses`. The `splits_responses` jobs should be a lot faster than the `first_rows_responses` ones, so that we should need a lot more workers for `first_rows_responses` than for `splits_responses`. Defaults to `splits_responses`.
- `WORKER_SLEEP_SECONDS`: duration in seconds of a worker wait loop iteration, before checking if resources are available and processing a job if any is available. Note that the worker does not sleep on the first loop after finishing a job. Defaults to `15`.
31 changes: 2 additions & 29 deletions services/worker/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion services/worker/src/worker/responses/first_rows.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def get_rows(
def get_size_in_bytes(obj: Any):
return sys.getsizeof(orjson_dumps(obj))
# ^^ every row is transformed here in a string, because it corresponds to
# the size the row will contribute in the JSON response to /rows endpoint.
# the size the row will contribute in the JSON response to /first-rows endpoint.
# The size of the string is measured in bytes.
# An alternative would have been to look at the memory consumption (pympler) but it's
# less related to what matters here (size of the JSON, number of characters in the
Expand Down

0 comments on commit 72963ce

Please sign in to comment.