-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os.listdir
(and presumably other os functions) fails in the face of signals
#955
Comments
Hey @emeryberger, thanks for reporting this issue and sorry for the delay in the response. We've reproduced it and are able to confirm that the problem exists. Here is a debug log:
From the log we can see that when an application gets interrupted in a A separate problem are possible errors caused by interrupts during other system calls, for which we don't have a reproduction yet. I've created a separate issue for that. |
## Description of change When user application gets interrupted in a `readdir` syscall the underlying chain of `readdir` fuse requests gets reset to an offset which is considered stale by Mountpoint. In that case Mountpoint still completes the interrupted `readdir` request, but kernel partially discards the response. We already cache the last response, so we can use it to serve the request which follows the interrupt. Relevant issues: #955 ## Does this change impact existing behavior? This is not a breaking change. Previously an error was returned, now it'll be handled properly. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). --------- Signed-off-by: Vladislav Volodkin <vlaad@amazon.co.uk> Signed-off-by: Vlad Volodkin <vlaad@amazon.com> Co-authored-by: Vladislav Volodkin <vlaad@amazon.co.uk> Co-authored-by: Vlad Volodkin <vlaad@amazon.com>
Mountpoint for Amazon S3 version
mount-s3 1.7.2
AWS Region
us-east-1
Describe the running environment
Running on EC2, accessing an S3 bucket through my account, using this AMI:
Deep Learning Base Proprietary Nvidia Driver GPU AMI (Ubuntu 20.04) 20240314
. Same setup as here: plasma-umass/scalene#841Mountpoint options
What happened?
This error (failure when running in a mounted S3 system) was brought to my attention with this issue with the Scalene profiler: plasma-umass/scalene#841
The root cause turns out to be the CPU timer signal; if the
os.listdir
function is interrupted by asignal.SIGALRM
, the call fails with anOSError
. I set the frequency below to a level that triggers the failure roughly half the time; setting it to 1 second makes it never happen. Since the default CPU sampling frequency used by Scalene is 0.01 seconds, it fails consistently. Note that the profilerpy-spy
also causes this failure.MRE here:
Example failure:
I have implemented a workaround for this situation (wrapping all
os
functions so that they block theSIGALRM
signal) - plasma-umass/scalene#842 - , but it seems like it is exposing a race condition in mount-s3, or at minimum, undesirable behavior.Relevant log output
2024-07-29T00:47:59.230133Z WARN mountpoint_s3::cli: failed to detect network throughput. Using 10 gbps as throughput. Use --maximum-throughput-gbps CLI flag to configure a target throughput appropriate for the instance. Detection failed due to: failed to get network throughput 2024-07-29T00:48:15.393878Z WARN readdirplus{req=10 ino=1 fh=1 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=3, actual=1
The text was updated successfully, but these errors were encountered: