Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some nodes rootfs is full #2337

Closed
AbdelrahmanElawady opened this issue May 21, 2024 · 2 comments
Closed

Some nodes rootfs is full #2337

AbdelrahmanElawady opened this issue May 21, 2024 · 2 comments
Assignees
Labels
type_bug Something isn't working
Milestone

Comments

@AbdelrahmanElawady
Copy link
Contributor

Description

Some nodes on devnet have a problem updating and deploying workloads due to rootfs being filled up. After inspecting some nodes, it turned out the issue is due to the way ZOS handles updates. So, whenever the node decides to update its packages old files get removed. However, these files are not completely removed from rootfs due to them being used by another processes. For example: cloud-hypervisor, virtiofsd, containerd, rfs, etc...
These processes are related to users workloads so we can't just stop them or restart them and with time these files fill up the rootfs and eventually no space left on rootfs.

Possible Solutions

Since we can't remove these old files (because they are related to user workloads), we can try to minimize the rate of this situation occurring.
For example: we can check before writing the content of a new package that packages have different version than the one on the node. That way if it's the same package we won't create these deleted-but-still-used files.

@AbdelrahmanElawady AbdelrahmanElawady added the type_bug Something isn't working label May 21, 2024
@AbdelrahmanElawady
Copy link
Contributor Author

of course it is a problem on all networks but is just appeared on devnet.

@muhamadazmy
Copy link
Member

I would like to add that most services are restarted, so all zos binaries has no issue, containerd is also not an issue (and is restarted) but any user related process (usually not managed by zinit but by one of zos daemons) are not other wise user workload will get a downtime (like cloud-hypervisor, virtiofsd, rfs, etc...) which causes this problem

If a node is running for a really long time and went through many updates, and if the node has long running user workloads there workloads endup holding the files they used to start (say cloud-hypervisor binary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

4 participants