Some nodes rootfs is full #2337

AbdelrahmanElawady · 2024-05-21T13:07:34Z

Description

Some nodes on devnet have a problem updating and deploying workloads due to rootfs being filled up. After inspecting some nodes, it turned out the issue is due to the way ZOS handles updates. So, whenever the node decides to update its packages old files get removed. However, these files are not completely removed from rootfs due to them being used by another processes. For example: cloud-hypervisor, virtiofsd, containerd, rfs, etc...
These processes are related to users workloads so we can't just stop them or restart them and with time these files fill up the rootfs and eventually no space left on rootfs.

Possible Solutions

Since we can't remove these old files (because they are related to user workloads), we can try to minimize the rate of this situation occurring.
For example: we can check before writing the content of a new package that packages have different version than the one on the node. That way if it's the same package we won't create these deleted-but-still-used files.

AbdelrahmanElawady · 2024-05-21T13:08:06Z

of course it is a problem on all networks but is just appeared on devnet.

muhamadazmy · 2024-05-21T13:23:04Z

I would like to add that most services are restarted, so all zos binaries has no issue, containerd is also not an issue (and is restarted) but any user related process (usually not managed by zinit but by one of zos daemons) are not other wise user workload will get a downtime (like cloud-hypervisor, virtiofsd, rfs, etc...) which causes this problem

If a node is running for a really long time and went through many updates, and if the node has long running user workloads there workloads endup holding the files they used to start (say cloud-hypervisor binary)

AbdelrahmanElawady added the type_bug Something isn't working label May 21, 2024

ashraffouda added this to 3.15.x Jun 2, 2024

ashraffouda added this to the 3.12 milestone Jun 2, 2024

ashraffouda mentioned this issue Jun 11, 2024

Failed to deploy a vm (no space left on device) (failed to spawn vm machine process) #2355

Closed

ashraffouda self-assigned this Jun 11, 2024

ashraffouda moved this to In Progress in 3.15.x Jun 11, 2024

ashraffouda mentioned this issue Jun 11, 2024

change upgrade to only upgrade modified packages #2358

Merged

4 tasks

ramezsaeed removed this from 3.15.x Jun 12, 2024

ramezsaeed added this to 3.14.x Jun 12, 2024

ramezsaeed modified the milestones: 3.12, 3.11 Jun 12, 2024

ashraffouda moved this to Done in 3.14.x Jun 13, 2024

ashraffouda closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some nodes rootfs is full #2337

Some nodes rootfs is full #2337

AbdelrahmanElawady commented May 21, 2024

AbdelrahmanElawady commented May 21, 2024

muhamadazmy commented May 21, 2024

Some nodes rootfs is full #2337

Some nodes rootfs is full #2337

Comments

AbdelrahmanElawady commented May 21, 2024

Description

Possible Solutions

AbdelrahmanElawady commented May 21, 2024

muhamadazmy commented May 21, 2024