Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In memory security contexts are not supported #520

Closed
riedel opened this issue Sep 12, 2021 · 10 comments · Fixed by #524
Closed

In memory security contexts are not supported #520

riedel opened this issue Sep 12, 2021 · 10 comments · Fixed by #524
Assignees
Labels
bug Something isn't working

Comments

@riedel
Copy link
Member

riedel commented Sep 12, 2021

See https://github.com/dask/distributed/blob/3754a6694f72207b69aaf661445aecf18fb5f0cb/distributed/security.py#L211

i.e. one cannot use Security.temporary() to create a cluster

distrinuted allows using in memory keys where the content is in memory. It does so by checking if the value is multiline.

the expansion to the extra arguments does not support this:

["--tls-" + key.replace("_", "-"), value]

in this case a temporary file needs to be created

@riedel
Copy link
Member Author

riedel commented Sep 12, 2021

Just tracking this because I will need to fix this in #519 to work on #496

@riedel riedel self-assigned this Sep 12, 2021
@riedel
Copy link
Member Author

riedel commented Sep 12, 2021

I guess the main challenge will be handling the temporary files as they need to survive until the cluster is closed.

@guillaumeeb
Copy link
Member

guillaumeeb commented Oct 7, 2021

Can you think of another solution to avoid having to go through a file (which seems to involve a lot of unwanted code...)?

@riedel
Copy link
Member Author

riedel commented Oct 8, 2021

Not to defend myself: Temporary certificates are always going through a file: https://github.com/dask/distributed/blob/8fa3854514c5ede53e9c0381776f6109882ff778/distributed/security.py#L213 .

However, you are correct, that it is a problem regarding the lifecycle of the file and using the API correctly. It seems that the command line interface of the workers that dask_jobqueue is using does not support any other way to ingest a certificate to the process (I have tried putting the whole key on the command line).

What dask cloud-provider does is creating a dask config : https://github.com/jacobtomlinson/dask-cloudprovider/blob/feb86a8da834bc0b34d6408686c683b976d85a10/dask_cloudprovider/generic/vmcluster.py#L253-L266

My rationale was that this might change/break more things in dask-jobqueue because it might have rather global effects. But maybe I am mistaken here.

@guillaumeeb
Copy link
Member

Okay, so I guess we should stick with this approach, and say we would not support in memory contexts.

@riedel
Copy link
Member Author

riedel commented Oct 10, 2021

Sorry if I am asking, but do you mean that

a) in memory contexts should not be automatically converterted in temp files for the worker and we do not support distributed.security.Security.temporary() and users need to explicitly write their certificates somewhere

or

b) we just do not support keeping them in memory for the workers that are send to the batch scheduler. Which would in fact add support from a user perspective.

in case of a) I guess an exception should be thrown and there would no sound way of certificate autogeneration. I could then strip down #519 to a bare minimum and add an "not supported exception" instead. This will force users to care about temp certificates themselves.

case b) is covered in #519 . As the tempfile code is only triggered anyway in case people use distributed.security.Security.temporary() I would actually personally see no real harm. And like @jacobtomlinson said we are assuming a shared file system anyways. One alternative way of doing things would be to actually create a whole temporary directory where all stuff lives including the jobscript (which is also a temporary file). This no additional tempfile/dir lifecycle would be needed, but the changes would be affecting other bits of the code. One last alternative I could think of is generating the certificate file only in the jobscript, however, this would that all jobscripts need to be changed to support this.

@guillaumeeb
Copy link
Member

Sorry I thought I answered this... I was meaning b).

@riedel
Copy link
Member Author

riedel commented Oct 31, 2021

And like @jacobtomlinson said we are assuming a shared file system anyways.

I was actually thinking this also the whole time until you reminded me of /tmp. I started digging into the code to find where this assumption is actually encoded. It seems that HTCondor is explicit about this:

__doc__ = """ Launch Dask on an HTCondor cluster with a shared file system

However, when setting up the new CI I think I completely ignored this:

volumes:
secrets:

So I think it might probably not depend on this.

Here is a hint that this can be very different on different setups:

and/or shared filesystem setup, one of those methods may not work,

Here you find hints that even python might not be shared:
https://github.com/dask/dask-jobqueue/blob/58e0ce2111e74484dca4e2469698bbe68a2240b2/docs/source/debug.rst#checking-job-script

Bottom line: the testing of security should actually be done on all schedulers and the assumption should be dropped, IMHO. It might well be the case that also file based certificates need to be copied to the worker.#524

What do you think.

@guillaumeeb : thanks for nagging me here. I think this led to much clearer thinking. One positive outcome should be that the testing strategy should be overhauled a bit, since the CI run on #519 for all the schedulers should be a warning...

@riedel
Copy link
Member Author

riedel commented Oct 31, 2021

I confirmed that actually file based certificates also won't work on the HTCondor and slurm CI: #532 . Now the price question: is this a bug of the ci or the code. It should be at least document that one has to be smart enough to put the certificates in some accessible place. On the other hand: how would code fixing this issue know where this is...

@riedel riedel added the bug Something isn't working label Nov 16, 2021
@guillaumeeb
Copy link
Member

Sorry for the delay @riedel, and as always thanks for your investments in dask-jobqueue!

Now the price question: is this a bug of the ci or the code

I'd said neither.

I think all dask-jobqueue Cluster can work without any shared file system in some specific setups. Some CI environments show this. It should remain the same.

The code for generating a temporary file should just try to do it in the user folder, and be explicit (with some logging?) that the place where the file is generated should be accessible from all workers. Maybe we can add a configuration option for this location (as we do for worker log directory). All this should also be clearly documented.

For the CI, we could test this only in CI environment with shared folders. I don't know if we want to add a shared file system for all tests, just to test this. I don't think so.

riedel added a commit to riedel/dask-jobqueue that referenced this issue Nov 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants