You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
Currently, the NodeMemoryUsage alert will be triggered when node's memory usage is higher than 95%. For a server with high mem capacity like 256G, it means it will be triggered when left 12.8G memory. And as a computing node, it's expected, if user requests and uses 240G memory as design. But if OpenPAI uses more than 4G memory, the alert will be triggered.
So the alert should be triggered when the system uses more memory than expected. But not triggered, when user uses 100% of requested memory.
If user needs more memory than requested and is OOM, user should get this information from job details page, and fix it. And admin doesn't need care about it.
If system's processes uses more memory than designed, there should be an alert. Even the free memory is enough. As there may be potential issue on system.
The text was updated successfully, but these errors were encountered:
Currently, the NodeMemoryUsage alert will be triggered when node's memory usage is higher than 95%. For a server with high mem capacity like 256G, it means it will be triggered when left 12.8G memory. And as a computing node, it's expected, if user requests and uses 240G memory as design. But if OpenPAI uses more than 4G memory, the alert will be triggered.
So the alert should be triggered when the system uses more memory than expected. But not triggered, when user uses 100% of requested memory.
If user needs more memory than requested and is OOM, user should get this information from job details page, and fix it. And admin doesn't need care about it.
If system's processes uses more memory than designed, there should be an alert. Even the free memory is enough. As there may be potential issue on system.
The text was updated successfully, but these errors were encountered: