Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Don't rmtree if unmounting cloudfuse fails #12975

Closed
wants to merge 2 commits into from

Conversation

jigold
Copy link
Contributor

@jigold jigold commented May 3, 2023

I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?

if self.cloudfuse:
n_mounted = len([config for config in self.cloudfuse if config['mounted']])
if n_mounted > 0:
return
Copy link
Contributor

@daniel-goldstein daniel-goldstein May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth outlining here what the danger is and adding a comment for defensive pieces of code like this. Otherwise code like this will pile up and we will forget what it was instituted for. I'm very likely to delete what I think is redundant or unnecessary code unless a comment warns me against that line of thinking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we should institute a requirement that JVMJob cloudfuse mount destinations start with /cloudfuse, as that is the only mount point that we set as slave and could therefore be used in JVMJobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, apologies, I think bailing out early if an unmount fails is self-explanatory enough. I think what initially gave me pause is this happening in a JVMJob. What's the state of a JVM container where an unmount has failed? That means that another job that comes in could potentially see that mount? That seems really not great, I feel like we should kill the JVM container at that point and create a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth taking a separate look at JVMJob fuse mounts separately from DockerJob fuse mounts to properly assess where things could go wrong and what the outcome should be.

@jigold
Copy link
Contributor Author

jigold commented May 3, 2023

I added a finally to bucket mounting operations to mark it as mounted so we always assume it could have been partially mounted. Feel free to push back on this.

@daniel-goldstein
Copy link
Contributor

I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?

I suppose this is exactly what happened when the user was able to create a duplicate mount. The unmount succeeded, but it did not rid the worker of the fuse process. Perhaps the more appropriate thing to do is instead of relying on the outcome of the unmount operation, we should check for the existence of the FUSE process

@jigold
Copy link
Contributor Author

jigold commented May 3, 2023

I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?

I suppose this is exactly what happened when the user was able to create a duplicate mount. The unmount succeeded, but it did not rid the worker of the fuse process. Perhaps the more appropriate thing to do is instead of relying on the outcome of the unmount operation, we should check for the existence of the FUSE process

Is it worth making a separate PR that explicitly manages the gcsfuse processes PIDs? If yes, can you do that?

config,
)
finally:
config['mounted'] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes things more confusing. I don't know what "partially mounted" means, and I would prefer to treat config['mounted'] = True to mean "we believe it was mounted" and not "we believe it might have been mounted". What happens if the mount failed and then we later try to unmount?

@daniel-goldstein
Copy link
Contributor

Is it worth making a separate PR that explicitly manages the gcsfuse processes PIDs? If yes, can you do that?

Ya I can do this. Should we try to just keep it simple in this PR then? Like just what you had originally where we return after any calls to unmount fail

@@ -2138,6 +2148,8 @@ async def run(self):
for config in self.cloudfuse:
bucket = config['bucket']
assert bucket
assert config['mount_path'] != '/io'
assert config['mount_path'].startswith('/cloudfuse')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this slip in from another change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daniel requested this for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind if we move it into a separate PR, just that we should ultimately have these assertions.

except asyncio.CancelledError:
raise
except Exception:
log.exception(f'error while unmounting cloudfuse in jvm job {self.jvm_name}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this config['mounted'] = False or True thing feels like a more complicated version of just letting the exception raise all the way up.

If the unmount function raises an exception, shouldn't we just let it propagate all the way up? That will skip cleanup.

We should, perhaps, separately, inspect /proc/mounts to verify that there are no mounts under the scratch directory we're about to delete.

Copy link
Contributor Author

@jigold jigold May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what to do here. What if the user gives us a bad bucket or doesn't have permissions? If we let the exception propagate here even if the mount attempt failed, then that will bork the worker especially for the JVM case where we wouldn't return the JVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user gives us a bad bucket or doesn't have permissions?

Wouldn't this mean that the mount never succeeded in the first place?

@jigold
Copy link
Contributor Author

jigold commented May 4, 2023

closed in favor of #12985

@jigold jigold closed this May 4, 2023
danking pushed a commit that referenced this pull request May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants