Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] OSError: no such device #13861

Closed
danking opened this issue Oct 19, 2023 · 1 comment · Fixed by #13879
Closed

[batch] OSError: no such device #13861

danking opened this issue Oct 19, 2023 · 1 comment · Fixed by #13879
Labels

Comments

@danking
Copy link
Contributor

danking commented Oct 19, 2023

What happened?

This input container is about to get cleaned up. Maybe we have a race where we measure resource even though the job is complete?

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/batch/resource_usage.py", line 250, in periodically_measure
    await self.measure()
  File "/usr/local/lib/python3.9/dist-packages/batch/resource_usage.py", line 210, in measure
    percent_cpu_usage = self.percent_cpu_usage()
  File "/usr/local/lib/python3.9/dist-packages/batch/resource_usage.py", line 122, in percent_cpu_usage
    now_cpu_ns = self.cpu_ns()
  File "/usr/local/lib/python3.9/dist-packages/batch/resource_usage.py", line 114, in cpu_ns
    for line in f.readlines():
OSError: [Errno 19] No such device

https://cloudlogging.app.goo.gl/tte29H271hPvp4tn9

Version

0.2.124

Relevant log output

No response

@danking danking added the bug label Oct 19, 2023
@daniel-goldstein
Copy link
Contributor

There is always a race because crun will delete the cgroup once the container completes. Looks like the memory tracking checks for this exact error but not cpu, it would assume it should be both.

danking pushed a commit to danking/hail that referenced this issue Oct 20, 2023
Fixes hail-is#13861. CPU monitor races with container deletion just like RAM monitor. I also switched to
catching FileNotFoundError instead of exists since technically the file could disappear between
us checking `exists` and us `open`ing it.
danking added a commit that referenced this issue Oct 20, 2023
)

Fixes #13861. CPU monitor races with container deletion just like RAM
monitor. I also switched to catching FileNotFoundError instead of exists
since technically the file could disappear between us checking `exists`
and us `open`ing it.
danking added a commit to danking/hail that referenced this issue Oct 23, 2023
…l-is#13879)

Fixes hail-is#13861. CPU monitor races with container deletion just like RAM
monitor. I also switched to catching FileNotFoundError instead of exists
since technically the file could disappear between us checking `exists`
and us `open`ing it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants