Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument the application #2

Closed
severo opened this issue Aug 3, 2021 · 9 comments
Closed

Instrument the application #2

severo opened this issue Aug 3, 2021 · 9 comments

Comments

@severo
Copy link
Collaborator

severo commented Aug 3, 2021

Measure the response time, status code, RAM usage, etc. to be able to take decision (see #1). Also statistics about the most common requests (endpoint, dataset, parameters)

@julien-c
Copy link
Member

julien-c commented Aug 6, 2021

also probably disk usage as datasets is probably just filling its cache for you forever on disk

@severo
Copy link
Collaborator Author

severo commented Aug 9, 2021

Yes. The disk usage for datasets cache should be low, since we use streaming (or we error): we never download the data itself. But it's better to monitor than to assume.

@severo
Copy link
Collaborator Author

severo commented Sep 2, 2021

Note that issues like huggingface/datasets#2859 can affect the CPU and memory usage (even block the server completely)

@severo
Copy link
Collaborator Author

severo commented Sep 2, 2021

Two solutions proposed by the infra team:

https://huggingface.slack.com/archives/CTKK32GE8/p1630588820038000


Morgan:

We have been using Prometheus with success to report this kind of metrics on the Inference API.
Some pointers, if it can help:

Philip:

You could also use directly cloudwatch with either the:

AFAIK most of our production service are using prometheus + grafana sofar, with ELK as additional informations. So you could directly log into the existing system for this already

@severo
Copy link
Collaborator Author

severo commented Sep 23, 2021

Also: alert when 500 errors occur. See #21 (comment)

@severo
Copy link
Collaborator Author

severo commented May 12, 2022

Best practices on what to monitor: https://prometheus.io/docs/practices/instrumentation

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator Author

severo commented Sep 16, 2022

Already in place. Let's improve when needed

@severo severo closed this as completed Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants