Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace usage of schedule_timeout with schedule_timeout_interruptible #16150

Merged
merged 1 commit into from
May 9, 2024

Conversation

Dan-Perry
Copy link
Contributor

@Dan-Perry Dan-Perry commented May 1, 2024

This commit replaces the current usage of schedule_timeout() with schedule_timeout_interruptible() when device paths cannot be found in vdev_disk_open(). When schedule_timeout() is called without previously calling set_current_state(), the running task never sleeps because the task state remains in TASK_RUNNING. It will instead continue to busy-loop until either zfs_vdev_open_timeout_ms has passed or the device path is found.

By calling schedule_timeout_interruptible() to set the task's state to TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the intended/desired behavior of making the task sleep in between block device path lookup attempts.

Motivation and Context

This commit fixes a bug in the current implementation of vdev_disk_open() when a block device path cannot be found during pool import that in our setup occasionally resulted in OOM killers running. This behavior was exhibited on a server with zfs 2.1.7 running on top of a 6.1 based Linux kernel, and should apply to the current master branch of ZFS as well.

Description

If a block device path is missing during zpool import, the intended behavior of vdev_disk_open() is to retry lookup attempts multiple times until the device path is found or until zfs_vdev_open_timeout_ms milliseconds have passed - sleeping for ~10ms inbetween attempts to yield the CPU to other processes.

However given the current implementation, this intended behavior is not properly realized and vdev_disk_open() spends zero time sleeping and will continuously retry block device path lookups until one of the previously mentioned conditions are met. This is because the task will not sleep if the current task state is TASK_RUNNING prior to calling schedule_timeout(). The behavior of schedule_timeout() is summarized in some detail here: https://elixir.bootlin.com/linux/v6.1.89/source/kernel/time/timer.c#L1895.

The effects of this can be reproduced via the following:

# Create temporary block devices
dd if=/dev/zero of=/disk.img bs=64M count=1
losetup /dev/loop0 /disk.img
dd if=/dev/zero of=/cache.img bs=100M count=1
losetup /dev/loop1 /cache.img
parted -s /dev/loop1 mklabel gpt
parted -s /dev/loop1 mkpart logical 17408B 99MB

# Create Pool
sudo zpool create temporary-pool /dev/loop0 cache /dev/loop1p1

# Export pool and remove cache device
zpool export temporary-pool
parted /dev/loop1 rm 1

# Import pool and count the number of times blkdev_get_by_path() is called
bpftrace -e 'kprobe:blkdev_get_by_path { @count++; }' &
pid=$!
zpool import temporary-pool
kill $pid

# Output:
@count: 2092665

This experiment shows that the device path finding loop contained in vdev_disk_open() was looped through roughly ~2 million times during the import of the zpool.

A Flamegraph also confirms where the CPU is spending most of its cycles during import with a missing device path:

image (6)

How Has This Been Tested?

These changes were tested on a Linux 6.1 based server running zfs 2.1.7. By reproducing the same experiment as mentioned in the description we can confirm that the number of loop executions in vdev_disk_open() will be far less and upperbounded by the number of times blkdev_get_by_path() is called.

Reproducer Test Results:

# Create temporary block devices
dd if=/dev/zero of=/disk.img bs=64M count=1
losetup /dev/loop0 /disk.img
dd if=/dev/zero of=/cache.img bs=100M count=1
losetup /dev/loop1 /cache.img
parted -s /dev/loop1 mklabel gpt
parted -s /dev/loop1 mkpart logical 17408B 99MB

# Create Pool
sudo zpool create temporary-pool /dev/loop0 cache /dev/loop1p1

# Export pool and remove cache device
zpool export temporary-pool
parted /dev/loop1 rm 1

# Import pool and count the number of times blkdev_get_by_path() is called
bpftrace -e 'kprobe:blkdev_get_by_path { @count++; }' &
pid=$!
zpool import temporary-pool
kill $pid

# Output:
@count: 135

These test results indicate that the vdev_disk_open() block device path finding loop executed at most 135 times across all vdevs. This aligns much more with the expected number of times this loop should run with a zfs_vdev_open_timeout_ms setting of 1000 milliseconds.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@Dan-Perry
Copy link
Contributor Author

Dan-Perry commented May 1, 2024

There's a few other places in the code base where it looks like schedule_timeout() is also used incorrectly (via not setting the current task state).

https://github.com/search?q=repo%3Aopenzfs%2Fzfs+%22schedule_timeout%28%22&type=code

I didn't have time to investigate and test these areas of the code, but if any of the maintainers also think these should be included in this submission without testing I could do so.

@rincebrain
Copy link
Contributor

I would ask if you can explain more why this causes an OOM killer to fire - unless we're leaking memory somehow, this should at worst be a busywait, no?

@Dan-Perry
Copy link
Contributor Author

Dan-Perry commented May 1, 2024

Sure! I originally had written this up, but removed it to cut down on the size of the Description since it wasn't entirely relevant.

The OOM killing behavior is particular to the kernel configuration (which I believe should be common) that we were running with and is primarily unrelated to ZFS. I haven't 100% confirmed this, but I have a high degree of confidence it's related to the kernel audit configuration we had on the host.

Each time the OOM killer ran we would notice a particularly large amount of slab_unreclaimable memory being allocated on the names_cache (approximately 6GB).

Apr 26 15:52:55 ip-198-19-48-19 kernel: Unreclaimable slab info:
Apr 26 15:52:55 ip-198-19-48-19 kernel: Name Used Total
...
Apr 26 15:52:55 ip-198-19-48-19 kernel: names_cache 6012688KB 6012688KB
...

After some code inspection, we can see that for each file path lookup loop execution, kern_path will allocate some memory from the names_cache in getname_kernel. This is followed by a call to audit_getname which will increment the refcnt on the returned filename struct that will be used to determine if the file path exists.

When the code goes to call putname on this allocated memory, this memory is not actually freed by the time we exit kern_path because the audit structures are still holding a reference.

This indicates to me the possibility that each loop execution is allocating memory that is not immediately being free'd, depending on how the audit system handles the deallocation/free'ing of references its holding onto. This caused memory usage to explode during the execution of the loop as these objects were not being free'd as quickly as they were alloc'd.

I haven't gone much deeper than this beyond using some bcc/bpf tooling to approximate memory allocations coming from this loop, but this seems like the most likely explanation for what were seeing regarding the OOM killers we saw whenever we hit this scenario.

@Dan-Perry
Copy link
Contributor Author

Here's some example output of the memleak bcc tool I used to get some hints where memory was being allocated during import. This indicates ~400MB of memory was allocated from this code path alone (for allocations that it was able to capture)

431009792 bytes in 105227 allocations from stack
		0xffffffff892f1cd0	 kmem_cache_alloc+0x1e0 [kernel]
		0xffffffff892f1cd0	 kmem_cache_alloc+0x1e0 [kernel]
		0xffffffff89342605 getname_kernel+0x25 [kernel]
		0xffffffff89349dd5 kern_path+0x15 [kernel]
		0xffffffff8951ebd7	 lookup_bdev+0x37 [kernel]
		0xffffffff8951fdcd   blkdev_get_by_path+0x2d [kernel]
		0xffffffffc07baaac	 vdev_disk_open+0x1cc [zfs]
		0xffffffffc075289f	 vdev_open+0xcf [zfs]
		0xffffffffc0734f15	 spa_load_l2cache_with_ashift+0x375 [zfs]
		0xffffffffc07352a3	 spa_ld_open_aux_vdevs+0x253 [zfs]
		0xffffffffc0738185	 spa_load_impl.constprop.0+0x115 [zfs]
		0xffffffffc07387a7	 spa_load+0x67 [zfs]
		0xffffffffc07388b4  spa_load_best+0x54 [zfs]
		0xffffffffc073950a	 spa_import+0x1ea [zfs]
		0xffffffffc078c850	 zfs_ioc_pool_import+0x140 [zfs]
		0xffffffffc079886d  zfsdev_ioctl_common+0x63d [zfs]
		0xffffffffc07c56b3	 zfsdev_ioctl+0x53 [zfs]
		0xffffffff8934e8b7  __x64_sys_ioctl+0x87 [kernel]
		0xffffffff899fe148	 do_syscall_64+0x38 [kernel]
		0xffffffff89c0011c	 entry_SYSCALL_64_after_hwframe+0x64 [kernel]

@Dan-Perry
Copy link
Contributor Author

Another thing to mention is that this memory is not leaked and it is eventually free'd by the audit processes, but it is not immediate and often not before the OOM killer was triggered.

We also did another test with kernel auditing disabled and this memory allocation behavior was not present.

@Dan-Perry Dan-Perry marked this pull request as ready for review May 1, 2024 20:53
@behlendorf
Copy link
Contributor

@Dan-Perry thanks for catching this and the details about the OOM behavior. I've seen similar unfortunate-ness with memory not being freed as quickly as you might expect when kernel auditing is enabled. I'm surprised this wasn't noticed until now, but you're right we're definitely busy waiting.

It looks like the other two cases of this in module/os/linux/zfs/zvol_os.c and module/os/linux/spl/spl-taskq.c need the same treatment. Can you update them as part of this PR.

These days it looks like the right function to use in this situation in schedule_timeout_idle(). Of course it was only added as of the 4.6 kernel so we'd need to add an autoconf check for it. Let not bother with that yet. For these cases I think sticking with schedule_timeout_interruptible(1) which we already depend on elsewhere is reasonable.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 2, 2024
@Dan-Perry Dan-Perry force-pushed the vdev-disk-open-sleep branch 3 times, most recently from 6e695fd to 1ec0720 Compare May 3, 2024 03:11
@Dan-Perry
Copy link
Contributor Author

Thank you for taking a look! I updated the branch with the corresponding changes to module/os/linux/zfs/zvol_os.c and module/os/linux/spl/spl-taskq.c.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can you just address the "line > 80 characters" cstyle warning in zvol_os.c.

This commit replaces current usages of schedule_timeout() with
schedule_timeout_interruptible() in code paths that expect the running
task to sleep for a short period of time. When schedule_timeout() is
called without previously calling set_current_state(), the running
task never sleeps because the task state remains in TASK_RUNNING.

By calling schedule_timeout_interruptible() to set the task state to
TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the
intended/desired behavior of putting the task to sleep for the
specified timeout.

Signed-off-by: Daniel Perry <dtperry@amazon.com>
@Dan-Perry Dan-Perry force-pushed the vdev-disk-open-sleep branch from 1ec0720 to 824ee31 Compare May 3, 2024 18:01
@Dan-Perry
Copy link
Contributor Author

Looks good. Can you just address the "line > 80 characters" cstyle warning in zvol_os.c.

Should be all set now

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 3, 2024
@behlendorf behlendorf merged commit 2dff752 into openzfs:master May 9, 2024
24 of 25 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
…openzfs#16150)

This commit replaces current usages of schedule_timeout() with
schedule_timeout_interruptible() in code paths that expect the running
task to sleep for a short period of time. When schedule_timeout() is
called without previously calling set_current_state(), the running
task never sleeps because the task state remains in TASK_RUNNING.

By calling schedule_timeout_interruptible() to set the task state to
TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the
intended/desired behavior of putting the task to sleep for the
specified timeout.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Daniel Perry <dtperry@amazon.com>
Closes openzfs#16150
ptr1337 pushed a commit to CachyOS/zfs that referenced this pull request Nov 14, 2024
…openzfs#16150)

This commit replaces current usages of schedule_timeout() with
schedule_timeout_interruptible() in code paths that expect the running
task to sleep for a short period of time. When schedule_timeout() is
called without previously calling set_current_state(), the running
task never sleeps because the task state remains in TASK_RUNNING.

By calling schedule_timeout_interruptible() to set the task state to
TASK_INTERRUPTIBLE before calling schedule_timeout() we achieve the
intended/desired behavior of putting the task to sleep for the
specified timeout.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Daniel Perry <dtperry@amazon.com>
Closes openzfs#16150
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants