-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More options for using AiiDA with locked-down supercomputers #3929
Comments
Just to comment - I think this could become an increasingly important issue, if current trends continue. There was a security incident affecting Tier 1 and 2 HPC machines in the UK this week and it looks possible that SSH-alone access may no longer be possible. If these sorts of highly-targeted (and possibly state-sponsored?) attacks continue, it seems highly likely that many more HPC machines will become locked down with 2FA and similar measures. This is quite a significant threat to usefulness and impact of AiiDA, if not addressed. |
I fully agree with @ConradJohnston that it might be likely that we will rather start to see more centres adopting 2FA than less and this does potentially pose a great problem of us. Either the model will have to shift to having AiiDA run on the cluster, but this will likely face resistance of admins as well given the database and amqp services we require. In addition, this way AiiDA loses one of its strongsuits where it is easy to target multiple compute resources from one central machine. The options mentioned by @ltalirz are certainly worth looking into, but they have serious restrictions. One of the nice mechanisms of AiiDA 1.0 is that it can gracefully recover from temporary connection problems. By going for persistent connections that have to be manually authorized, we severely undermine the automation of AiiDA. It will require a lot more human interaction, which is clearly undesirable. Option number 2 runs the risk that we will end up with many custom solutions for the various center if a standard does not percolate over time. This will likely result in making the configuration machines for users even more complicated (never mind the developmental overhead for the AiiDA team) and more burdensome to use. I guess what I am saying here is that we should try to start addressing these issues with big computational centers themselves. If we think that operational schemes and tools like AiiDA are going to be more and more common, we should reach out to them to make them aware of these use cases so they can estimate what the impact of their changes might be. @ConradJohnston if you would be willing to help us out as the AiiDA ambassador and envoy for the UK 😉 that'd be great |
I was just about to write to the AiiDA mailing list concerning this. Archer mentions you will need "SSH key and a password" https://www.archer.ac.uk/status/ Also pinging @giovannipizzi for info |
@ltalirz So previously it was the case that you could access the ARCHER service with just a password. You could have then installed your own SSH key to be able to use AiiDA. My understanding is that going forward, and also for ARCHER2, you will now need both an SSH key (password protected or not) and to enter a password in the shell. @sphuber - Happy to help on that front! However, if this is going to be an ongoing issue across different HPC services, we should perhaps draft a standard letter to outline what AiiDA is and what the problem is. Otherwise, the risk is is that it's dismissed as a niche use-case. |
I was pointed at this thread by a researcher who wants to use AiiDA on ARCHER following the changes to access mechanisms. I think that we (by "we", I mean the community of HPC professionals, RSEs, tool developers and researchers) all need to work together to find solutions that let people use these tools while maintaining security on HPC systems. It is my opinion that the use of 2FA (likely with TOPT solutions) is going to become much more widespread on HPC systems. One solution to this issue that some US centres have used is to provide dedicated workflow nodes (see, for example: https://docs.nersc.gov/jobs/workflow/workflow_nodes/). I appreciate that this does not allow users to use different resources but at least provides a way to allow them to run in some way. Time-limited key access is also an option, as mentioned above, maybe this is a more attractive solution. This was how things used to work back in the days of the grid (with grid proxy certificates). We (ARCHER) are gathering requirements for use of these tools in the world of 2FA so I will feed the useful comments here into that wider discussion. If we were to put together a virtual event to discuss requirements and possible solutions, would the AiiDA team be interested in being involved? |
Hi @aturner-epcc , thanks for stopping by :-) |
@ltalirz Thanks, the virtual meeting is still just varpourware at the moment but I think it is an important issue so will try and find a way to make it happen. I will drop you a line once we are a bit further on with organising such a meeting. |
@ltalirz @aturner-epcc Any update with this? I would love to continue using archer with AiiDA in a way without any workarounds. |
Our group in University of Bath also has increasing number of AiiDA users and we would be happy if we can use ARCHER to perform our calculations. |
@zhubonan @pzarabadip No update yet but this use case has been flagged to the service. |
Sorry, nothing constructive... Just a heavy AiiDA user who is just starting a project on ARCHER and would really need this... |
I discussed with @pzarabadip this evening - he has a temporary solution for ARCHER2 that he discussed with the responsibles but may (or may not) still need some tweaks before releasing publicly (feel free to contact him if you're interested). As ARCHER2 starts to be opened from next week, he will be in contact with the responsibles to get feedback on how an "official" version could look like. |
FYI: I have made a solution for ARCHER2 (in the form of transport+scheduler plugins) avalaible here: https://github.com/zhubonan/aiida-archer2-scheduler. |
Hi @zhubonan , Great to have another proposed work around to this issue. :) However, I don't think this is a better solution then simply opening a connection to a locked-down supercomputer the canonical way using SSH and then just forwarding the AiiDA traffic to it locally. I very much appreciate the work that went in to this, and we do need a solution for sure, but I wouldn't be supportive of this. There's a risk of reputational damage to the AiiDA project - if a lot of users are using this kind of solution, HPC administrators may choose to disallow AiiDA on their systems. I'd rather be inconvenienced and respect the access that |
Hi @ConradJohnston, thanks for your reply. I think these are valid concerns. Having the password plain text in the environmental varialbe is certainly not ideal, so user should use at its own risk, and minimise the exploure as suggested in the README.md file of the plugin. If there are better way to pass secret information to a long-running python process, I am more than happy to implement it for the scheduler plugin.
Can you please elaborate how to do this, is that already supported by AiiDA?
I don't think this is quite true - most people manually launching SSH are likely to store password in plain text, somewhere. If a password manager is used then the password will leak to the clipboard/buffer anyway. If the system is compromised then the chance of leaking the password is the same, if not more. |
It's supported by AiiDA natively in the sense that AiiDA doesn't know or care that you're doing it - it's some SSH config magic.
You may be right there. However, I still think we need to tread lightly when it comes to this business and engage the administrators as best we can. For example, the move to 2FA and keyboard-interactive on many systems came as a reaction to the 2020 attacks on HPC which reportedly exploited users' passwordless SSH keys..
I think this quote captures well the tension that exists. Admins have users who do not maintain best practice or even actively try to circumvent measures. At the same time, job and workflows managers exist and are growing in popularity and sophistication. There's perhaps a need to arrange a short workshop on this issue and try to invite as many relevant system admins as possible. Do you think this would be feasible, @giovannipizzi ? |
PASC would have been one possible venue for this but the deadline for minisymposia suggestions just passed (Nov 13th). Anyhow, I agree that a meeting on this with broad participation could be very useful and would probably be a good time investment |
@zhubonan From a service provider perspective, I do not think we would be able to endorse this as an appropriate approach for connecting to ARCHER2 from AiiDA. I think you are generally correct that the risk is low but the precedent of coding your own workarounds to security setup is definitely not something we can support. At the moment, for this type of use case, we generally recommend the use of SSH multiplexing which, I think, gets close to what you are trying to achieve. You setup the SSH connection using your credentials and then all SSH traffic is routed through the already established connection. For us, the advantage over your approach is that you are using a standard feature of SSH rather than trying to code your own workaround. In the longer term, we are looking at better ways to support workflow managers given the rise in popularity (given that MFA seems definitely here to stay for HPC access) so would definitely be interested in participating in a workshop to look at this. We have a few ideas of how to go about this and it would be really valuable to get input from AiiDA users and developers. |
Hi @aturner-epcc , thanks a lot for your input! SSH multiplexing is the first option mentioned in the original post in this thread - unfortunately, the python library AiiDA uses for handling the SSH connections does not support
I think it is worth for someone of the AiiDA team (or outside!) to once have a look at @aturner-epcc From your experience, can there be any performance issues with multiplexing compared to opening multiple connections? If we organize a meeting/workshop around this question, we'll make sure to invite you. |
@ltalirz @ConradJohnston @aturner-epcc thanks for getting in touch and the dicussing this. @aturner-epcc I agree that having to "workaround" the security setup is not good. Multiplexing is probably the way to go moving forward. One potential issue would be that AiiDA daemon will not able to "reconnect" to the cluster unattended. The master connection can be watch with
I just had a test of this,
|
I did some further research - it seems that the "ControlMaster" style multiplexing is a feature of OpenSSH, and I did not find many other library supports it, e.g. using an existing socket from OpenSSH's ControlMaster. I have tried There is this PR for paramiko that has been there for a while: paramiko/paramiko#1341, which seems to add support to it, but it is not merged. |
Thanks a lot for checking @zhubonan ! It looks to me like what is holding up the rebased version of the paramiko PR is just an issue in the CI setup. |
Thanks for kick starting that PR again. Hopefully it can be merged soon, I will do some experiments in the mean time and see if it works. |
Thanks for the useful discussion here everybody! I agree that the solution of @zhubonan while practical should be considered a workaround and not be used in production, or at least not suggested as an edorsed solution - it should be very clear in the readme that we actually suggest not to use it in practice. I agree with @ConradJohnston that when it comes to these things, people will think that AiiDA is doing it in an insecure way. I'm happy to have a discussion around how to move forward. At the time I had looked, indeed it seems that by design it's hard to support the ControlMaster feature outside of the SSH executable. But let's see what happens with the paramiko PR and we can discuss if this would work! Luckily AiiDA will pause the processes if there are connection issues (to be tested in case the connection issues come from the multiplexing not working anymore because the underlying connection was closed) so we'll just have to live with someone re-opening the connection if it goes down, and re-playing the processes. Also, @aturner-epcc, have you already checked/discussed with the CSCS people with the solution they want to provide with their supercomputers in Switzerland? E.g. https://products.cscs.ch/firecrest/ ? We're going to discuss whether we can add support for FirecREST in AiiDA at the next coding week, in 2 weeks |
In cases where HPC centers don't offer access via SSH keys (e.g. requiring 2-factor authentication), the only way to use AiiDA currently is to install it on the cluster (which is possible and has become a lot easier since the introduction of the
aiida-core
andaiida-core.services
conda packages).However, there are also alternative routes we could explore, which I list below such that they don't get lost:
opening an SSH connection once and keep reusing it (e.g. using
ControlPersists yes
, andControlPath ~/.ssh/cm_socket/%r@%h:%p
in the~/.ssh/config
file). This is currently not supported by paramiko but there are alternative python bindings like ssh2-python and parallel-ssh we could look into.the guys at nersc have developed a small set of scripts called sshproxy that grants temporary access to SSH keys on the cluster side. I.e. you authenticate once, then something on the cluster "enables" your key for a period of time (say 24h), and after that time the key is disabled again. It seems they haven't put it on their github yet, but if they were asked perhaps they would be fine with open sourcing it. Of course, this route would always require action from the cluster administrator.
P.S. We might anyhow want to look into ssh2-python for performance reasons. See also here and here for a comparison with paramiko.
Mentioning @sphuber for info
The text was updated successfully, but these errors were encountered: