Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Agents DoS on Migration #676

Closed
mpass99 opened this issue Sep 4, 2024 · 1 comment · Fixed by #681
Closed

Nomad Agents DoS on Migration #676

mpass99 opened this issue Sep 4, 2024 · 1 comment · Fixed by #681
Labels
bug Something isn't working

Comments

@mpass99
Copy link
Collaborator

mpass99 commented Sep 4, 2024

In #612, we noticed that the Nomad Agents are being caught in a crash loop, when too many allocations are being migrated (in response to a restart).

We find OOM killing errors (telegraf, systemd) and Nomad errors.

  • Investigate if this error is reproducable without telegraf.
@mpass99 mpass99 added the bug Something isn't working label Sep 4, 2024
@mpass99
Copy link
Collaborator Author

mpass99 commented Sep 5, 2024

  • Having telegraf running and Nomad not, the server idles around.

  • Nomad starts successfully (when the server does not want to schedule many allocations on that agent)

  • Up to 60 runners, the agent works fine.

  • When requesting 80 runners, the agent crashes.

    • The server displays it as down
    • The CPU Usage is at 100%, and the memory usage is at 96%.
    • Telegraf gets OOM Killed which changes almost nothing at the CPU and Memory usage.
    • Stopping the nomad server frees the CPU usage; the memory usage remains at 65%
  • Disabling telegraf

  • Restarting the node

  • When starting Nomad, the agent still crashes

    • The server no longer tries to place many allocations on the node (Only 4)
    • It appears some locally stored data make the agent create many containers (~100; both CNI and runners)
    • Everything gets OOM Killed
  • It normalizes when restarting Docker and Nomad

  • Therefore, telegraf does not seem to have a huge impact on the observed behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant