The API is unavailable in production #279

severo · 2022-05-17T13:32:24Z

k logs datasets-server-prod-api-6c9f9d5cc-6h52g -f

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/src/services/api/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 369, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 59, in __call__
    return await self.app(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/gzip.py", line 23, in __call__
    await responder(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/gzip.py", line 42, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/base.py", line 57, in __call__
    task_group.cancel_scope.cancel()
  File "/src/services/api/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/base.py", line 30, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/routing.py", line 64, in app
    await response(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/responses.py", line 139, in __call__
    await send({"type": "http.response.body", "body": self.body})
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 68, in sender
    await send(message)
  File "/src/services/api/.venv/lib/python3.9/site-packages/anyio/streams/memory.py", line 221, in send
    raise BrokenResourceError
anyio.BrokenResourceError

It seems like a known error: fastapi/fastapi#4041.

Possibly due to a middleware (maybe PrometheusMiddleware here: https://github.com/huggingface/datasets-server/blob/main/services/api/src/api/app.py#L48)

The text was updated successfully, but these errors were encountered:

Hopefully it fixes #279. See fastapi/fastapi#4041

severo · 2022-05-17T13:55:24Z

Possibly it's due to the multiple queries to the database done on every call to the /metrics endpoint:

https://github.com/huggingface/datasets-server/blob/443f00f9eac8165ac53541873c8331091f91821e/services/api/src/api/prometheus.py#L47-L55

I don't know how we should manage these data, possibly the count should be sent by the worker to a prometheus gateway, instead of computing the count on every call. Another way (at least for the queue) is to use rabbitmq, instead of having the queue logic in the code, surely we can get the metrics easily from there.

cc @McPatate ?

severo · 2022-05-17T14:03:07Z

Confirmed: when only the API is running, the CPU load on the MongoDB machine increases to 100%

I'll remove the metrics for now 😢 and try to implement them better.

Possibly an index could help, but I'm not sure it's a good idea anyway to query the database when /metrics is called.

XciD · 2022-05-18T07:17:46Z

RabbitMQ will resolve some of your problems, but will generate new ones :)

Querying mongo should not create that high CPU, as you said, adding indexes should help.
Also, we can increase the time between queries in Prometheus, you may don't need 15s granularity.

severo · 2022-05-18T16:11:06Z

OK, perfect, I'll try to improve the queries and re-enable the metrics

McPatate · 2022-05-18T21:12:36Z

Also if you can, you should be able to store all the running jobs in memory without querying the database :)

See #250 and #279

* feat: 🎸 add again the metrics about cache and queue See #250 and #279 * feat: 🎸 add again the starlette metrics

* fix: 🐛 reserve 256M for the API and nginx pods The API service seems to need about 52M: ``` process_virtual_memory_bytes 7.59681024e+08 process_resident_memory_bytes 5.2195328e+07 ``` * fix: 🐛 remove the PrometheusMiddleware to reduce the RAM usage Hopefully it fixes huggingface/dataset-viewer#279. See fastapi/fastapi#4041

severo added bug Something isn't working high-priority labels May 17, 2022

severo added a commit that referenced this issue May 17, 2022

fix: 🐛 remove the PrometheusMiddleware to reduce the RAM usage

969fda3

Hopefully it fixes #279. See fastapi/fastapi#4041

severo closed this as completed in 443f00f May 17, 2022

This was referenced May 17, 2022

Use a kubernetes infrastructure #223

Closed

Setup prometheus + grafana #250

Closed

severo added a commit that referenced this issue May 23, 2022

feat: 🎸 add again the metrics about cache and queue

34e6444

See #250 and #279

severo added a commit that referenced this issue May 23, 2022

Reenable metrics (#298)

f68c206

* feat: 🎸 add again the metrics about cache and queue See #250 and #279 * feat: 🎸 add again the starlette metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The API is unavailable in production #279

The API is unavailable in production #279

severo commented May 17, 2022

severo commented May 17, 2022 •

edited

Loading

severo commented May 17, 2022 •

edited

Loading

XciD commented May 18, 2022

severo commented May 18, 2022

McPatate commented May 18, 2022

The API is unavailable in production #279

The API is unavailable in production #279

Comments

severo commented May 17, 2022

severo commented May 17, 2022 • edited Loading

severo commented May 17, 2022 • edited Loading

XciD commented May 18, 2022

severo commented May 18, 2022

McPatate commented May 18, 2022

severo commented May 17, 2022 •

edited

Loading

severo commented May 17, 2022 •

edited

Loading