-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Don't rmtree if unmounting cloudfuse fails #12975
Conversation
if self.cloudfuse: | ||
n_mounted = len([config for config in self.cloudfuse if config['mounted']]) | ||
if n_mounted > 0: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth outlining here what the danger is and adding a comment for defensive pieces of code like this. Otherwise code like this will pile up and we will forget what it was instituted for. I'm very likely to delete what I think is redundant or unnecessary code unless a comment warns me against that line of thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, we should institute a requirement that JVMJob cloudfuse mount destinations start with /cloudfuse
, as that is the only mount point that we set as slave
and could therefore be used in JVMJobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, apologies, I think bailing out early if an unmount fails is self-explanatory enough. I think what initially gave me pause is this happening in a JVMJob. What's the state of a JVM container where an unmount has failed? That means that another job that comes in could potentially see that mount? That seems really not great, I feel like we should kill the JVM container at that point and create a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth taking a separate look at JVMJob fuse mounts separately from DockerJob fuse mounts to properly assess where things could go wrong and what the outcome should be.
I added a finally to bucket mounting operations to mark it as mounted so we always assume it could have been partially mounted. Feel free to push back on this. |
I suppose this is exactly what happened when the user was able to create a duplicate mount. The unmount succeeded, but it did not rid the worker of the fuse process. Perhaps the more appropriate thing to do is instead of relying on the outcome of the unmount operation, we should check for the existence of the FUSE process |
Is it worth making a separate PR that explicitly manages the gcsfuse processes PIDs? If yes, can you do that? |
config, | ||
) | ||
finally: | ||
config['mounted'] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes things more confusing. I don't know what "partially mounted" means, and I would prefer to treat config['mounted'] = True
to mean "we believe it was mounted" and not "we believe it might have been mounted". What happens if the mount failed and then we later try to unmount?
Ya I can do this. Should we try to just keep it simple in this PR then? Like just what you had originally where we return after any calls to unmount fail |
@@ -2138,6 +2148,8 @@ async def run(self): | |||
for config in self.cloudfuse: | |||
bucket = config['bucket'] | |||
assert bucket | |||
assert config['mount_path'] != '/io' | |||
assert config['mount_path'].startswith('/cloudfuse') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did this slip in from another change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Daniel requested this for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind if we move it into a separate PR, just that we should ultimately have these assertions.
except asyncio.CancelledError: | ||
raise | ||
except Exception: | ||
log.exception(f'error while unmounting cloudfuse in jvm job {self.jvm_name}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this config['mounted'] = False
or True
thing feels like a more complicated version of just letting the exception raise all the way up.
If the unmount
function raises an exception, shouldn't we just let it propagate all the way up? That will skip cleanup.
We should, perhaps, separately, inspect /proc/mounts
to verify that there are no mounts under the scratch directory we're about to delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what to do here. What if the user gives us a bad bucket or doesn't have permissions? If we let the exception propagate here even if the mount attempt failed, then that will bork the worker especially for the JVM case where we wouldn't return the JVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user gives us a bad bucket or doesn't have permissions?
Wouldn't this mean that the mount never succeeded in the first place?
closed in favor of #12985 |
I think this change is better than #12975
I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?