[batch] Don't rmtree if unmounting cloudfuse fails #12975

jigold · 2023-05-03T17:09:00Z

I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?

daniel-goldstein · 2023-05-03T17:18:13Z

batch/batch/worker/worker.py

+        if self.cloudfuse:
+            n_mounted = len([config for config in self.cloudfuse if config['mounted']])
+            if n_mounted > 0:
+                return


I think it's worth outlining here what the danger is and adding a comment for defensive pieces of code like this. Otherwise code like this will pile up and we will forget what it was instituted for. I'm very likely to delete what I think is redundant or unnecessary code unless a comment warns me against that line of thinking.

Additionally, we should institute a requirement that JVMJob cloudfuse mount destinations start with /cloudfuse, as that is the only mount point that we set as slave and could therefore be used in JVMJobs.

Actually, apologies, I think bailing out early if an unmount fails is self-explanatory enough. I think what initially gave me pause is this happening in a JVMJob. What's the state of a JVM container where an unmount has failed? That means that another job that comes in could potentially see that mount? That seems really not great, I feel like we should kill the JVM container at that point and create a new one.

It might be worth taking a separate look at JVMJob fuse mounts separately from DockerJob fuse mounts to properly assess where things could go wrong and what the outcome should be.

jigold · 2023-05-03T17:32:47Z

I added a finally to bucket mounting operations to mark it as mounted so we always assume it could have been partially mounted. Feel free to push back on this.

done

daniel-goldstein · 2023-05-03T17:33:08Z

I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?

I suppose this is exactly what happened when the user was able to create a duplicate mount. The unmount succeeded, but it did not rid the worker of the fuse process. Perhaps the more appropriate thing to do is instead of relying on the outcome of the unmount operation, we should check for the existence of the FUSE process

jigold · 2023-05-03T17:35:03Z

I'm not sure this change is thorough enough. Is there a way for a bucket to get partially mounted but have config['mounted'] still be False?

I suppose this is exactly what happened when the user was able to create a duplicate mount. The unmount succeeded, but it did not rid the worker of the fuse process. Perhaps the more appropriate thing to do is instead of relying on the outcome of the unmount operation, we should check for the existence of the FUSE process

Is it worth making a separate PR that explicitly manages the gcsfuse processes PIDs? If yes, can you do that?

daniel-goldstein · 2023-05-03T17:35:13Z

batch/batch/worker/worker.py

+                                    config,
+                                )
+                            finally:
+                                config['mounted'] = True


I think this makes things more confusing. I don't know what "partially mounted" means, and I would prefer to treat config['mounted'] = True to mean "we believe it was mounted" and not "we believe it might have been mounted". What happens if the mount failed and then we later try to unmount?

daniel-goldstein · 2023-05-03T17:39:07Z

Is it worth making a separate PR that explicitly manages the gcsfuse processes PIDs? If yes, can you do that?

Ya I can do this. Should we try to just keep it simple in this PR then? Like just what you had originally where we return after any calls to unmount fail

danking · 2023-05-03T17:56:56Z

batch/batch/worker/worker.py

@@ -2138,6 +2148,8 @@ async def run(self):
                        for config in self.cloudfuse:
                            bucket = config['bucket']
                            assert bucket
+                            assert config['mount_path'] != '/io'
+                            assert config['mount_path'].startswith('/cloudfuse')


did this slip in from another change?

Daniel requested this for this PR.

I don't mind if we move it into a separate PR, just that we should ultimately have these assertions.

danking · 2023-05-03T18:00:43Z

batch/batch/worker/worker.py

+                    except asyncio.CancelledError:
+                        raise
+                    except Exception:
+                        log.exception(f'error while unmounting cloudfuse in jvm job {self.jvm_name}')


So, this config['mounted'] = False or True thing feels like a more complicated version of just letting the exception raise all the way up.

If the unmount function raises an exception, shouldn't we just let it propagate all the way up? That will skip cleanup.

We should, perhaps, separately, inspect /proc/mounts to verify that there are no mounts under the scratch directory we're about to delete.

I'm not quite sure what to do here. What if the user gives us a bad bucket or doesn't have permissions? If we let the exception propagate here even if the mount attempt failed, then that will bork the worker especially for the JVM case where we wouldn't return the JVM.

What if the user gives us a bad bucket or doesn't have permissions?

Wouldn't this mean that the mount never succeeded in the first place?

jigold · 2023-05-04T19:55:45Z

closed in favor of #12985

I think this change is better than #12975

[batch] Don't rmtree if unmounting cloudfuse fails

28d3c38

jigold assigned danking and daniel-goldstein May 3, 2023

daniel-goldstein previously requested changes May 3, 2023

View reviewed changes

address comments

1ff54dd

daniel-goldstein suggested changes May 3, 2023

View reviewed changes

danking suggested changes May 3, 2023

View reviewed changes

jigold mentioned this pull request May 4, 2023

[batch] Don't rmtree if any errors occur while unmounting #12985

Merged

jigold closed this May 4, 2023

danking pushed a commit that referenced this pull request May 4, 2023

[batch] Don't rmtree if any errors occur while unmounting (#12985)

78ee77e

I think this change is better than #12975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] Don't rmtree if unmounting cloudfuse fails #12975

[batch] Don't rmtree if unmounting cloudfuse fails #12975

jigold commented May 3, 2023

daniel-goldstein May 3, 2023 •

edited

Loading

daniel-goldstein May 3, 2023

daniel-goldstein May 3, 2023

daniel-goldstein May 3, 2023

jigold commented May 3, 2023

daniel-goldstein commented May 3, 2023

jigold commented May 3, 2023

daniel-goldstein May 3, 2023

daniel-goldstein commented May 3, 2023

danking May 3, 2023

jigold May 3, 2023

daniel-goldstein May 3, 2023

danking May 3, 2023

jigold May 3, 2023 •

edited

Loading

daniel-goldstein May 3, 2023

jigold commented May 4, 2023

[batch] Don't rmtree if unmounting cloudfuse fails #12975

[batch] Don't rmtree if unmounting cloudfuse fails #12975

Conversation

jigold commented May 3, 2023

daniel-goldstein May 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jigold commented May 3, 2023

daniel-goldstein commented May 3, 2023

jigold commented May 3, 2023

Choose a reason for hiding this comment

daniel-goldstein commented May 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jigold May 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jigold commented May 4, 2023

daniel-goldstein May 3, 2023 •

edited

Loading

jigold May 3, 2023 •

edited

Loading