-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace usage of schedule_timeout with schedule_timeout_interruptible #16150
Conversation
There's a few other places in the code base where it looks like https://github.com/search?q=repo%3Aopenzfs%2Fzfs+%22schedule_timeout%28%22&type=code I didn't have time to investigate and test these areas of the code, but if any of the maintainers also think these should be included in this submission without testing I could do so. |
I would ask if you can explain more why this causes an OOM killer to fire - unless we're leaking memory somehow, this should at worst be a busywait, no? |
Sure! I originally had written this up, but removed it to cut down on the size of the Description since it wasn't entirely relevant. The OOM killing behavior is particular to the kernel configuration (which I believe should be common) that we were running with and is primarily unrelated to ZFS. I haven't 100% confirmed this, but I have a high degree of confidence it's related to the kernel audit configuration we had on the host. Each time the OOM killer ran we would notice a particularly large amount of slab_unreclaimable memory being allocated on the
After some code inspection, we can see that for each file path lookup loop execution, kern_path will allocate some memory from the When the code goes to call This indicates to me the possibility that each loop execution is allocating memory that is not immediately being free'd, depending on how the audit system handles the deallocation/free'ing of references its holding onto. This caused memory usage to explode during the execution of the loop as these objects were not being free'd as quickly as they were alloc'd. I haven't gone much deeper than this beyond using some bcc/bpf tooling to approximate memory allocations coming from this loop, but this seems like the most likely explanation for what were seeing regarding the OOM killers we saw whenever we hit this scenario. |
Here's some example output of the memleak bcc tool I used to get some hints where memory was being allocated during import. This indicates ~400MB of memory was allocated from this code path alone (for allocations that it was able to capture)
|
Another thing to mention is that this memory is not leaked and it is eventually free'd by the audit processes, but it is not immediate and often not before the OOM killer was triggered. We also did another test with kernel auditing disabled and this memory allocation behavior was not present. |
@Dan-Perry thanks for catching this and the details about the OOM behavior. I've seen similar unfortunate-ness with memory not being freed as quickly as you might expect when kernel auditing is enabled. I'm surprised this wasn't noticed until now, but you're right we're definitely busy waiting. It looks like the other two cases of this in These days it looks like the right function to use in this situation in |
6e695fd
to
1ec0720
Compare
Thank you for taking a look! I updated the branch with the corresponding changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Can you just address the "line > 80 characters" cstyle warning in zvol_os.c
.
This commit replaces current usages of schedule_timeout() with schedule_timeout_interruptible() in code paths that expect the running task to sleep for a short period of time. When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. By calling schedule_timeout_interruptible() to set the task state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of putting the task to sleep for the specified timeout. Signed-off-by: Daniel Perry <dtperry@amazon.com>
1ec0720
to
824ee31
Compare
Should be all set now |
…openzfs#16150) This commit replaces current usages of schedule_timeout() with schedule_timeout_interruptible() in code paths that expect the running task to sleep for a short period of time. When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. By calling schedule_timeout_interruptible() to set the task state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of putting the task to sleep for the specified timeout. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Daniel Perry <dtperry@amazon.com> Closes openzfs#16150
…openzfs#16150) This commit replaces current usages of schedule_timeout() with schedule_timeout_interruptible() in code paths that expect the running task to sleep for a short period of time. When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. By calling schedule_timeout_interruptible() to set the task state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of putting the task to sleep for the specified timeout. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Daniel Perry <dtperry@amazon.com> Closes openzfs#16150
This commit replaces the current usage of schedule_timeout() with schedule_timeout_interruptible() when device paths cannot be found in vdev_disk_open(). When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. It will instead continue to busy-loop until either zfs_vdev_open_timeout_ms has passed or the device path is found.
By calling schedule_timeout_interruptible() to set the task's state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of making the task sleep in between block device path lookup attempts.
Motivation and Context
This commit fixes a bug in the current implementation of vdev_disk_open() when a block device path cannot be found during pool import that in our setup occasionally resulted in OOM killers running. This behavior was exhibited on a server with zfs 2.1.7 running on top of a 6.1 based Linux kernel, and should apply to the current master branch of ZFS as well.
Description
If a block device path is missing during zpool import, the intended behavior of vdev_disk_open() is to retry lookup attempts multiple times until the device path is found or until zfs_vdev_open_timeout_ms milliseconds have passed - sleeping for ~10ms inbetween attempts to yield the CPU to other processes.
However given the current implementation, this intended behavior is not properly realized and vdev_disk_open() spends zero time sleeping and will continuously retry block device path lookups until one of the previously mentioned conditions are met. This is because the task will not sleep if the current task state is TASK_RUNNING prior to calling schedule_timeout(). The behavior of schedule_timeout() is summarized in some detail here: https://elixir.bootlin.com/linux/v6.1.89/source/kernel/time/timer.c#L1895.
The effects of this can be reproduced via the following:
This experiment shows that the device path finding loop contained in vdev_disk_open() was looped through roughly ~2 million times during the import of the zpool.
A Flamegraph also confirms where the CPU is spending most of its cycles during import with a missing device path:
How Has This Been Tested?
These changes were tested on a Linux 6.1 based server running zfs 2.1.7. By reproducing the same experiment as mentioned in the description we can confirm that the number of loop executions in vdev_disk_open() will be far less and upperbounded by the number of times blkdev_get_by_path() is called.
Reproducer Test Results:
These test results indicate that the vdev_disk_open() block device path finding loop executed at most 135 times across all vdevs. This aligns much more with the expected number of times this loop should run with a zfs_vdev_open_timeout_ms setting of 1000 milliseconds.
Types of changes
Checklist:
Signed-off-by
.