Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The API is unavailable in production #279

Closed
severo opened this issue May 17, 2022 · 5 comments
Closed

The API is unavailable in production #279

severo opened this issue May 17, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@severo
Copy link
Collaborator

severo commented May 17, 2022

k logs datasets-server-prod-api-6c9f9d5cc-6h52g -f
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/src/services/api/.venv/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 369, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 59, in __call__
    return await self.app(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/gzip.py", line 23, in __call__
    await responder(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/gzip.py", line 42, in __call__
    await self.app(scope, receive, self.send_with_gzip)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/base.py", line 57, in __call__
    task_group.cancel_scope.cancel()
  File "/src/services/api/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/middleware/base.py", line 30, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/routing.py", line 64, in app
    await response(scope, receive, send)
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/responses.py", line 139, in __call__
    await send({"type": "http.response.body", "body": self.body})
  File "/src/services/api/.venv/lib/python3.9/site-packages/starlette/exceptions.py", line 68, in sender
    await send(message)
  File "/src/services/api/.venv/lib/python3.9/site-packages/anyio/streams/memory.py", line 221, in send
    raise BrokenResourceError
anyio.BrokenResourceError

It seems like a known error: fastapi/fastapi#4041.

Possibly due to a middleware (maybe PrometheusMiddleware here: https://github.com/huggingface/datasets-server/blob/main/services/api/src/api/app.py#L48)

@severo severo added bug Something isn't working high-priority labels May 17, 2022
@severo severo closed this as completed in 443f00f May 17, 2022
@severo
Copy link
Collaborator Author

severo commented May 17, 2022

Possibly it's due to the multiple queries to the database done on every call to the /metrics endpoint:

https://github.com/huggingface/datasets-server/blob/443f00f9eac8165ac53541873c8331091f91821e/services/api/src/api/prometheus.py#L47-L55

I don't know how we should manage these data, possibly the count should be sent by the worker to a prometheus gateway, instead of computing the count on every call. Another way (at least for the queue) is to use rabbitmq, instead of having the queue logic in the code, surely we can get the metrics easily from there.

cc @McPatate ?

@severo
Copy link
Collaborator Author

severo commented May 17, 2022

Confirmed: when only the API is running, the CPU load on the MongoDB machine increases to 100%

Capture d’écran 2022-05-17 à 16 00 55

I'll remove the metrics for now 😢 and try to implement them better.

Possibly an index could help, but I'm not sure it's a good idea anyway to query the database when /metrics is called.

@XciD
Copy link
Member

XciD commented May 18, 2022

RabbitMQ will resolve some of your problems, but will generate new ones :)

Querying mongo should not create that high CPU, as you said, adding indexes should help.
Also, we can increase the time between queries in Prometheus, you may don't need 15s granularity.

@severo
Copy link
Collaborator Author

severo commented May 18, 2022

OK, perfect, I'll try to improve the queries and re-enable the metrics

@McPatate
Copy link
Member

Also if you can, you should be able to store all the running jobs in memory without querying the database :)

severo added a commit that referenced this issue May 23, 2022
* feat: 🎸 add again the metrics about cache and queue

See #250 and #279

* feat: 🎸 add again the starlette metrics
mattstern31 added a commit to mattstern31/datasets-server-storage-admin that referenced this issue Nov 11, 2023
* fix: 🐛 reserve 256M for the API and nginx pods

The API service seems to need about 52M:

```
process_virtual_memory_bytes 7.59681024e+08
process_resident_memory_bytes 5.2195328e+07
```

* fix: 🐛 remove the PrometheusMiddleware to reduce the RAM usage

Hopefully it fixes
huggingface/dataset-viewer#279. See
fastapi/fastapi#4041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants