-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability additions #2963
Observability additions #2963
Conversation
…or nvidia gpus) Add dashboards for docker / gpus Add mono-board WIP with variable for Datasource and Job
Not sure how to integrate the new containers in the deployment process, please let me know how that works |
Our deployment setup starts in https://github.com/LAION-AI/Open-Assistant/blob/main/.github/workflows/deploy-to-node.yaml which references the files in https://github.com/LAION-AI/Open-Assistant/tree/main/ansible
|
Do you know what Grafana version those dashboards use/need? I tried to copy them into Not sure if might be some other issue or not - was hoping to just copy paste in |
It'll be easier to just import it as a new dashboard, rather than pasting it into an existing one. I was using a freshly pulled grafana/grafana docker image |
I was trying to run in a codespace on this branch and got this:
Not sure what might already be using port 2000. |
You can try following some of these to find what is using it: https://www.cyberciti.biz/faq/unix-linux-check-if-port-is-in-use-command/ Alternatively, change Open-Assistant/docker-compose.yaml:298 to |
i am hoping to get this merged so i can just use a github codespace then to test this properly (just to make sure anyone can run it via a clean github codespace) : |
This PR contains quite a lot of elements for which I don't see a clear use case. For example why is there nvidia-gpus.json ? We don't have any GPUs in our backend server .. |
I general we need a proper plan for monitoring. At least Grafana should be deployed to a separate machine to monitor the health of the system (and send out alerts). |
@andreaskoepf We do need to discuss things properly. I was in the process of writing proper dockerfiles and updating the github actions / ansible for all these new containers yesterday, but I had to pause that and was afk for the rest of the day. I have added two new containers, cadvisor will allow monitoring of docker containers (cpu, memory, network, etc), and should be deployed on any server that is running docker containers dcgm-exporter gives stats for nvidia gpus, and should be deployed onto the inference worker servers so that the gpus can be monitored for utilisation and vram usage (or anything else that would be useful), I mentioned this in the discord:
|
I am closing this since we haven't any clear use case for this right now. |
Add new data sources (cadvisor for docker containers, dcgm-exporter for nvidia gpus)
Add dashboards for docker / gpus
Add mono-board WIP with variable for Datasource and Job