Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster workflow feature to allow shell commands or script to run before remote server setup (e.g. slurm) (wrap install script) #1722

Open
wwarriner opened this issue Oct 24, 2019 · 124 comments
Assignees
Labels
feature-request Request for new features or functionality ssh Issue in vscode-remote SSH
Milestone

Comments

@wwarriner
Copy link

I want to be able to connect to our institution's cluster using VS Code Remote SSH without the server running on a compute node instead of the login node. The preferred workflow is to SSH into the login node and then use a command to allocate a job and spin up an interactive shell on a compute node, and then run any further tasks from there. VS Code Remote SSH doesn't appear to have a feature that facilitates this workflow. I want to be able to inject the spin-up command immediately after SSH'ing into the cluster, but before the VS Code server is set up on the cluster, and before any other tasks are run.

@wwarriner
Copy link
Author

I managed to modify the extension.js file in the following way:

CTRL+F -> "bash
Change the string literal "bash" to "bash -c \"MY_COMMAND bash\""

I've confirmed that this correctly starts the VS Code Remote SSH server on a compute node. Now I am running into a port-forwarding issue, possibly related to issue #92. Our compute nodes have the ports used by VS Code Remote SSH disabled, so there isn't an easy way around this issue.

Thanks for the hard work on this so far! This extension has extraordinary potential. Being able to run and modify a Jupyter notebook remotely on our cluster, while using intellisense and gitlens, AND conda environment detection and dynamic swapping, all in a single application for FREE is incredible.

@roblourens
Copy link
Member

roblourens commented Oct 27, 2019

Our compute nodes have the ports used by VS Code Remote SSH disabled, so there isn't an easy way around this issue.

Do you mean that port forwarding for ssh is disabled on that server? Or are you able to forward some other port over an ssh connection to that server?

@roblourens roblourens added the info-needed Issue requires more information from poster label Oct 27, 2019
@wwarriner
Copy link
Author

Port forwarding for SSH is not disabled on any part of our cluster. I am not intentionally attempting to forward any other ports to the server. I was using remote.SSH.enableDynamicForwarding and remote.SSH.useLocalServer. Your questions gave me the idea to disable those options. I can't determine if that has helped because my earlier assertion was incorrect. I can't get the server to run on a compute node.

To address that issue, and to clarify our workflow some, we are using Slurm. It is highly preferred to have tasks running within a job context so that login node resources aren't being consumed. To do that, we create a job using srun (or one of its siblings) with appropriate resource request parameters. Any commands we want to run are provided as the final argument to srun. All calls to srun must have a command, because it uses execve() to invoke the commands apparently. If no command is passed, srun fails with an error message. With that in mind, setting up the VS Code server on the remote would have to be funneled through a call to srun. Any other method of invocation (such as bash -c) will result in commands being run out of the job context, and thus on the login node. Naively modifying the bash invocation does not work, apparently because srun never receives any arguments. It isn't clear to me how the server installer gets invoked and set up, so I can't offer any suggestions.

As a side note, it is also possible to provide the argument --pty bash to srun to get a terminal within the job context on a node allocated for that job. Looking at #1671, specifically here. It seems like it should be possible to adjust the invocation of bash -ilc to do additional things (found by ctrl+f). I've tried testing this but it doesn't look like that code is called at any point that I can tell, using echo for debugging.

@roblourens
Copy link
Member

What code do you mean by "that code"? I don't think the issue you point to is related.

We run the installer script essentially like echo <installer script here> | ssh hostname bash. There is an old feature request to be able to run a custom script before running the installer. I am not sure whether that would help you here, is there a way with Slurm to run a command, then have the rest of the same script run in a job context?

It sounds more like you need a way to wrap the full installer script in a custom command, like srun "<installer script here>" is that right?

@wwarriner
Copy link
Author

Yes to your last question, ideally with the ability to customize the wrapping command.

@roblourens roblourens self-assigned this Oct 29, 2019
@roblourens roblourens added feature-request Request for new features or functionality ssh Issue in vscode-remote SSH and removed info-needed Issue requires more information from poster labels Oct 29, 2019
@roblourens roblourens added this to the Backlog milestone Oct 29, 2019
@roblourens roblourens changed the title Cluster workflow feature to allow shell commands or script to run before remote server setup Cluster workflow feature to allow shell commands or script to run before remote server setup (wrap install command)0 Oct 29, 2019
@roblourens roblourens changed the title Cluster workflow feature to allow shell commands or script to run before remote server setup (wrap install command)0 Cluster workflow feature to allow shell commands or script to run before remote server setup (wrap install script) Oct 29, 2019
@nicocarbone
Copy link

This would be a important feature for vscode-remote. I am currently trying to use vscode to run some interactive python code in a shared cluster and the only way of doing it is by using the srun command of slurm. I'll try to find a workaround, but I think there really is a user case for this feature request.

@daferna
Copy link

daferna commented Nov 13, 2019

I've got the same issue, but with using LSF instead of SLURM.
As @roblourens points out here: #1829 (comment)
just running the install script and starting the server only solves half the problem. Once the server is started, I surmise that VS code will still try sshing directly into the desired (login-restricted) machine to discover what port the VS-remote server picked, as well as starting new terminals that show up in the GUI.

Basically, the only way this can work is if all subprocesses for servers and user terminals are strictly forked children from the original seed shell acquired from LSF/SLURM/whatever job manager you are using. A hacky workaround may be to use something like Paramiko to start a mini-SSH server from the seed shell and then login to this mini server directly from VS Code (assuming there isn't a firewall blocking you, but obviously reverse SSH tunnels can be used to get around that).

@benfei
Copy link

benfei commented Dec 22, 2019

Another possible resolution to this issue is by enabling a direct connection to the remote server.
That is, the user would:

  1. Launch vscode-server on a remote (possible login-restricted) host.
  2. Enter the remote server address and port in vscode, and connect to it.

That way, no ssh is required at all and it can work on login-restricted hosts.

@ihnorton
Copy link

A slight variant on this: I would like to be able to get the target address for SSH from a script (think cat'ing a file that is semi-frequently updated with the address of a dynamic resource). Currently I am using a ProxyCommand configured in sshconfig, but that has the disadvantage of requiring a second process.

@brando90
Copy link

brando90 commented Feb 9, 2020

I want to be able to connect to our institution's cluster using VS Code Remote SSH without the server running on a compute node instead of the login node. The preferred workflow is to SSH into the login node and then use a command to allocate a job and spin up an interactive shell on a compute node, and then run any further tasks from there. VS Code Remote SSH doesn't appear to have a feature that facilitates this workflow. I want to be able to inject the spin-up command immediately after SSH'ing into the cluster, but before the VS Code server is set up on the cluster, and before any other tasks are run.

@wwarriner Is the issue you are referring to the same one as the one on this stack overflow SO question?

It sounds like we are having a similar problem, when I spin an interactive job and try to run my debugger, I can't do it because it goes back to the head node and tries to run things there.

https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac

@brando90
Copy link

brando90 commented Feb 10, 2020

The problem is more serious than I thought. I can't run the debugger in the interactive session but I can't even "Run Without Debugging" without it switching to the Python Debug Console on it's own. So that means I have to run things manually with python main.py but that won't allow me to use the variable pane...which is a big loss! (I was already willing to lose the breakpoint privilege by using pdb, which I wasn't a super big fan but ok fine while things get fixed...)

What I am doing is switching my terminal to the conoder_ssh_to_job and then clicking the button Run Without Debugging (or ^F5 or Control + fn + f5) and although I made sure to be on the interactive session at the bottom in my integrated window it goes by itself to the Python Debugger window/pane which is not connected to the interactive session I requested from my cluster...

@daeh
Copy link

daeh commented Feb 20, 2020

Am I reading this right that currently the only way to have the language server run a compute node rather than the head/login node is to modify extension.js? Or is there a different preferred solution? I'm also getting weird port conflicts when I modify extension.js.

(I'm also using slurm and the python language server eating up 300GB on the head node disrupts the whole department).

@daeh
Copy link

daeh commented Mar 19, 2020

I'm curious if this is on the roadmap for the near future. With my university going entirely remote for the foreseeable future, being able to use this extension to work on the cluster would be absolutely amazing.

@brando90
Copy link

Yes, I also want this feature a lot with universities going remote due to COVID-19

@brando90
Copy link

Another possible resolution to this issue is by enabling a direct connection to the remote server.
That is, the user would:

  1. Launch vscode-server on a remote (possible login-restricted) host.
  2. Enter the remote server address and port in vscode, and connect to it.

That way, no ssh is required at all and it can work on login-restricted hosts.

how do you do that? Have you tried it?

@roblourens roblourens changed the title Cluster workflow feature to allow shell commands or script to run before remote server setup (wrap install script) Cluster workflow feature to allow shell commands or script to run before remote server setup (e.g. slurm) (wrap install script) Mar 25, 2020
@roblourens
Copy link
Member

No capacity to address this in the near future but I am interested to hear how the cluster setup works for other users - if anyone is not using slurm/srun as described above please let me know what it would take to make this work for you.

@alfonzso
Copy link

alfonzso commented Apr 8, 2020

I put this to settings.json:

"terminal.integrated.shellArgs.linux": [
    "-c",
    "export FAF=FEF ; exec $SHELL -l",
  ]

image

After that every linux shell will has "FAF" env variable ( what I wanted ), furthermore with "exec" command , no new process created !

I hope this will be useful for someone :D !

@Nosferican
Copy link

I guess this is related. I would like VS code clients (e.g., julia client) to have an option to start in the Slurm job I am currently in and not in the login node.

@tzom
Copy link

tzom commented May 10, 2023

My use case has been resolved with the advent of Remote Tunnels. Here is my process, assuming code CLI is installed and on the path.

  1. Start an sbatch job with code tunnel as the payload.
  2. Get the 8 digit code from the output.
  3. Go to my local machine and enter that code at the supplied URL (https://github.com/login/device)
  4. Open VSCode locally
  5. ctrl+shift+p and type "tunnel" to find the "Remote-Tunnels: Connect to Tunnel..." command.
  6. The tunnel I created shows up in the list, click it.
  7. All set, now everything is on a compute node in a job context!

It is a bit annoying to have different tokens for each node, but not the end of the world. Still please go upvote #8110 to see if we can get this mitigated!

Looks like this "just works" now. I was able to connect to a new node on our cluster using a previous tunnel session stored locally.

i don't understand. What happens if the sbatch jobs finished (i.e. got canceled, etc.) does the tunnel magically restart the job?
If yes, this would be indeed a solution. Otherwise, if NOT then this is nothing else then using ssh directly, as you still need to run the sbatch/srun manually. I would like to understand what the difference really is between tunnel and ssh in the context of slurm?

@tzom
Copy link

tzom commented May 10, 2023

Apologies for my delay in this. I've been trying to get this working on an HPC running Slurm, but running on a compute-node and not the head-node. I also want a 'one click' way of doing this - i.e. I open vscode, click connect to compute-node, and it submits a Slurm job that starts vscode-server on the compute-node, and vscode connects to it.

The HPC I'm working with has a specific Slurm setup:

  • Can ssh to any node (No PAM setup).
  • use_interactive_step Slurm configuration setting is not set. So salloc returns you a shell where you ran it - you run it on the head-node, you get a shell there.
  • Compute nodes can't be accessed from the wider network, only a head-node.

I tried many many ways to do this, and vscode is a bit difficult with how it does things it seems. My plan was to sign up to the vscode-server preview and use that but without the tunneling to vscode.dev/browser functionality. I discovered that you can only connect through a tunnel to an already running server (makes sense), so my made up requirement of a one-click solution fails with tunnels. I then realised that vscode-server preview without the tunnels is basically the exact same as the vscode-server that gets installed after connecting with SSH anyway ... so I decided to figure out what vscode is doing and try and hijack that process.

I think the standard process of connecting vscode to a remote server goes like this:

  1. vscode checks the local cache to see if it's already connected to this remote before
  2. If so, it runs some commands over ssh to find the server and check if it's running. When vscode starts the server up it writes a log and the pid (process id) in files stored in .vscode-server/. So vscode looks through those files and checks if the process is running. Note that vscode-server doesn't do this, vscode does.
  3. If that server isn't running, vscode sends the command to start it.
  4. Once it has started, vscode connects.

This is all done with ssh commands I think? Without a tty request (meaning srun --pty can't be run here in a RemoteCommand). My idea was to trick vscode into connecting to the existing server after I have it started it on a compute-node myself through ssh.

I will post all the commands but they are a bit of a mess, and anyone looking to replicate what I've done would probably have to rewrite a lot of it anyway. My basic process is:

  1. ssh to the target compute-node by proxyjumping through the head-node. This means everything in the Remote Command is executed on the compute-node.
  2. In the Remote Command, use salloc --no-shell to reserve my resources resources through Slurm.
  3. After that succeeds, in the same Remote Command, use srun to run start_script.sh on the HPC that starts one instance of the vscode server in the background within that resource allocation (you need to specify --jobid to do this). The command vscode executes to start vscode-server is in the remote-ssh log when you connect so I just used that. After the server starts, the script should save the pid of vscode-server in the pid file in the ./vscode-server/ folder so vscode can pick it up, see it running already, and not try and start it itself. There should be a pid file there already if vscode-server has been started before.
  4. Then in the Remote Command, detect that vscode-server has started (I currently do this by waiting until a certain string appears in the vscode-server logfile), and after that succeeds, run /bin/bash --login so you and vscode have a terminal to interact with. I actually run && sh -c 'echo $$ > vscode-ssh.pid; exec /bin/bash --login' because then I can catch the pid of the ssh instance that vscode uses for the connection and kill the job/vscode-server when that process dies.
  5. Then the start_server.sh script waits until that ssh process dies and kills the Slurm job and cleans everything up. This means when someone disconnects/closes vscode, there's no orphaned Slurm jobs or vscode-servers running.

Hopefully that makes some sort of sense? Bear in mind I am very bad at bash coding.

Here's the silly Remote Command where 42168j2b4guo2u8t4ou6 is my vscode-server id: salloc --no-shell -n 1 -c 16 -J vscode_${USER} --nodelist compute-node-name -K && srun -n 1 -c 16 --jobid $(echo $(squeue --me --name=vscode_$USER --states=R -h -O JobID)) server_start.sh & until grep -q "Extension host agent listening" $HOME/.vscode-server/.42168j2b4guo2u8t4ou6.log;do sleep 1;done && sh -c 'echo $$ > vscode-ssh.pid; exec /bin/bash --login'

Here's my server_start.sh script:


server_id="42168j2b4guo2u8t4ou6"

rm $HOME/.vscode-server/.$server_id.log &> /dev/null
rm $HOME/.vscode-server/.$server_id.pid &> /dev/null
rm $HOME/vscode-ssh.pid &> /dev/null

echo "removed logs and pid files"

module load anaconda3/2022.05

$HOME/.vscode-server/bin/$server_id/bin/code-server --start-server --host=127.0.0.1 --accept-server-license-terms --enable-remote-auto-shutdown --port=0 --telemetry-level all --without-connection-token &> $HOME/.vscode-server/.$server_id.log &
echo $! > $HOME/.vscode-server/.$server_id.pid 

echo "started server and wrote server pid file"

until grep -q "Extension host agent listening" $HOME/.vscode-server-insiders/.$server_id.log
do
     sleep 1
done

echo "server started"

until [ -f $HOME/vscode-ssh.pid ]
do
     sleep 1
done

echo "vscode-ssh.pid file found."

vscode_ssh=$(cat $HOME/vscode-ssh.pid)

echo "waiting for vscode ssh to disconnect so I can shut down the server"
tail --pid=$vscode_ssh -f /dev/null

echo "Removing vscode-ssh.pid file"
rm $HOME/vscode-ssh.pid
echo "Cancelling job"
scancel $(squeue --me --name=vscode_$USER --states=R -h -O JobID)

It's silly because I think a lot of this could have been avoided if PAM was set up and/or salloc dropped you into your resource instead of back to where you submitted it, but I haven't tested that yet.

this looks great! If it works reliable, this should be in VSCode Remote Extensions...

@wwarriner
Copy link
Author

@tzom I'll admit this isn't exactly the solution I had in mind when I created this issue 3.5 years ago, but it certainly eases the barrier to entry and greatly simplifies the workflow I use. Yes, a job needs to be created manually. On our system we can create sbatch jobs that run for 12 hours, which is plenty for a working day. I don't run long-running batch tasks in the same jobs I do development work, so I don't need the job for longer than that.

@roblourens I do have one request that would be what I see as a natural extension of Tunnels. I see that Tunnels sends data through an Azure service. For our institution, this would be challenging to get approved as part of working with PHI/HIPAA and even what we classify as sensitive data, due to the "unknown" nature of that Azure intermediate. I am aware that the code is Open Source, but we can't see what is actually processing our requests, it is all taken on trust. To work with PHI, that's not enough.

Is there a way we can have the simplicity of remote tunnels, but entirely contained within a service we control?

@roblourens
Copy link
Member

Not currently, the only solution for that is SSH

@bamurtaugh
Copy link
Member

Thanks for the additional info @wwarriner!

Is there a way we can have the simplicity of remote tunnels, but entirely contained within a service we control?

This sounds like it would be tracked by this feature request: microsoft/vscode#168492.

And if you'd like further info on how tunnels are secured, we have a section in our docs: https://code.visualstudio.com/docs/remote/tunnels#_how-are-tunnels-secured.

@wwarriner
Copy link
Author

wwarriner commented May 24, 2023

@bamurtaugh Thank you for the link, I also found #7527 via the issue you linked, which appears to be requesting a feature like serve-local but (maybe?) closely matches what I'm after.

I'll keep an eye out for both!

I appreciate the link for tunnel security. Personally it sounds reasonable, though I am not an expert.

Perhaps the following will sound familiar. At our institution, any applications touching PHI and HIPAA data must be approved. Part of the approval is a security review, where the entire proposed network sequence is inspected. With a tunnel, routing information is no longer open to inspection by our firewall (which would raise eyebrows for PHI/HIPAA data), and being routed through a third-party domain.

The simple solution is to not use a tunnel for PHI/HIPAA data. But I can see a future where development work could be adjacent enough to PHI/HIPAA data where this might come under scrutiny. Having the option to start our own SSH Server gives us a more palatable routing configuration.

I would also hazard a guess that the folks at https://github.com/OSC/Open-OnDemand would be pleased at the idea of making the proposed VSCode SSH Server an interactive app on their service. I know that re-hosting might violate the current ToS, but a carve-out is something to think about for the future in terms of VSCode use within Academic Research Computing development.

@GeorgeBGM
Copy link

I used the suggestion of appealing people inside VScode, but I got the following error message, how should I solve the problem.

The terminal process failed to launch: A native exception occurred during launch (forkpty(3) failed.).

@bamurtaugh
Copy link
Member

@George-du can you please share some more info of which suggestion / specific steps you tried?

@ghost
Copy link

ghost commented Sep 5, 2023

/

@xangma
Copy link

xangma commented Sep 27, 2023

I've done it! I'm back with another ridiculous solution, but this one is hopefully better than my last one ... it's not finished yet but it at least allows commands to be run before the remote server is set up 🎉🥳

Given that you can define the path to your ssh binary in vscode's remote ssh settings, I decided to wrap it in a bash script. The script:

  • Pretends to be ssh and intercepts the ssh commands sent from vscode,
  • Uses salloc to reserve resources on the cluster (currently set in the RemoteCommand in the ssh_config),
  • Figures out where those resources are,
  • Proxyjumps through the login node and runs bash within the Slurm allocation using srun,
  • Allows vscode to continue to send its commands to the bash shell on the compute node to run the remote server.

The code is in a repo here and is still very prototype. It may leave jobs running etc. I just wanted to share that I got the basics working.

I understand this was a solved problem for many (running vscode connected to a compute node), but for particular Slurm configurations (with no pam_slurm_adopt or use_interactive_step setting), it would mean that when someone sshs into a compute node it wouldn't attach them to their job and cgroups wouldn't apply (I think that's the case ...). This behaviour makes it difficult to get vscode server running in the right place with the right cgroup restrictions.

I'm planning on wrapping this up into an extension where you could interactively build the ssh_config host entry (including RemoteCommand) in a panel on the left hand side of vscode, and click connect, and it would do the steps I said above. I have no idea if this is possible though.

A note to Microsoft: It would have really helped to have extension.js and localServer.js open sourced/developed on this GitHub repo. I instead had find the files myself and reformat the Javascript so I could study the connection process. I was saddened to see it come out at >50000 lines in one file. I'm not sure if single extension files are a requirement of vscode extensions, but it still made me sad 😣😢

@KangByungwoo
Copy link

@wwarriner I followed your steps and it works really nicely! But one issue is that every time I restart the computer node and the remote version of VS code, it asks me to authenticate my github or Microsoft account as shown in the below screenshot. Is there a way to avoid this?
image

@eugeneteoh
Copy link

The best solution we've found so far:

  1. Start a job on the cluster that just runs sshd:
    #!/bin/bash
    
    #SBATCH --job-name="tunnel"
    #SBATCH --time=8:00:00     # walltime
    
    /usr/sbin/sshd -D -p 2222 -f /dev/null -h ${HOME}/.ssh/id_ecdsa # uses the user key as the host key
    
  2. Connect to that via ProxyCommand, e.g. by adding something like this to your .ssh/config:
    Host hpc-job
        ProxyCommand ssh hpc "nc \$(squeue --me --name=tunnel --states=R -h -O NodeList) 2222"
        StrictHostKeyChecking no
    

Thanks for this @simonbyrne! This is still the best solution for me so far. Only problem is the SLURM* environment variables are not passed to the proxy ssh session. Anyone solved this?

@grvlbit
Copy link

grvlbit commented Feb 12, 2024

@eugeneteoh We have implemented a hackish solution to get the SLURM_ variables to the VS code session by storing them into a file that is sourced during login. To do this properly we have added a trap that will clean up the environment variables file on job termination.

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name="code-tunnel"
#SBATCH --signal=B:TERM@60 # tells the controller
                           # to send SIGTERM to the job 60 secs
                           # before its time ends to give it a
                           # chance for better cleanup.

cleanup() {
    echo "Caught signal - removing SLURM env file"
    rm -f ~/.code-tunnel-env.bash
}

# Trap the timeout signal (SIGTERM) and call the cleanup function
trap 'cleanup' SIGTERM

# store SLURM variables to file
env | awk -F= '$1~/^SLURM_/{print "export "$0}' > ~/.code-tunnel-env.bash

/usr/sbin/sshd -D -p 2222 -f /dev/null -h ${HOME}/.ssh/id_ecdsa &
wait

Then have users add the following line to their .bashrc file:

# source slurm environment if we're connecting through code-tunnel
[ -f ~/.code-tunnel-env.bash ] && source ~/.code-tunnel-env.bash

@williamberrios
Copy link

^ I just tested and it works. Thanks :)

@xangma
Copy link

xangma commented Mar 27, 2024

Hi all, feel free to ask me to be quiet about this in this thread, but I've managed to get my vscode slurm ssh wrapper script working really well now!

It is a wrapper around ssh that you point vscode to instead of your ssh binary. It:

  1. Detects when you connect vscode to your remote slurm cluster,
  2. Checks for an already running vscode job using the jobname,
  3. Connects to existing job if it exists, creates a new job if not,
  4. Proxyjumps to the compute node that your job was assigned,
  5. Starts a little "watcher" process that watches for when you disconnect vscode/close the window on your local machine,
  6. Connects!
  7. Then when you close your local vscode window, if you don't reconnect within a timeout your vscode slurm job is cancelled for you! 🥳 🎉

Finally this script is where I want it to be! I even battled through the Powershell to make a Windows version. Next I just need to make it into an extension that probes your cluster's slurm config and generates the appropriate cluster-specific ssh configs for different slurm resource combos on-the-fly in a GUI :-D

@NoCreativeIdeaForGoodUserName

Hi all, feel free to ask me to be quiet about this in this thread, but I've managed to get my vscode slurm ssh wrapper script working really well now!

It is a wrapper around ssh that you point vscode to instead of your ssh binary. It:

  1. Detects when you connect vscode to your remote slurm cluster,
  2. Checks for an already running vscode job using the jobname,
  3. Connects to existing job if it exists, creates a new job if not,
  4. Proxyjumps to the compute node that your job was assigned,
  5. Starts a little "watcher" process that watches for when you disconnect vscode/close the window on your local machine,
  6. Connects!
  7. Then when you close your local vscode window, if you don't reconnect within a timeout your vscode slurm job is cancelled for you! 🥳 🎉

Finally this script is where I want it to be! I even battled through the Powershell to make a Windows version. Next I just need to make it into an extension that probes your cluster's slurm config and generates the appropriate cluster-specific ssh configs for different slurm resource combos on-the-fly in a GUI :-D

I tried to run this script on a windows 10 laptop that connects via WSL2 to a Linux server and I just get a "Could not establish connection to "server_name": spawn UNKNOWN. The "Installation" was as described in the readme with the appropriate path changes in

Remote-SSH: Settings -> Remote.SSH: Path

and the changes to the ssh config. Using the.sh and the .ps1 scripts does not work, the error remains the same. I did not really work with a cluster before, so I do not really know where to even start to resolve these issues.

@martenreeh
Copy link

@eugeneteoh We have implemented a hackish solution to get the SLURM_ variables to the VS code session by storing them into a file that is sourced during login. To do this properly we have added a trap that will clean up the environment variables file on job termination.

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name="code-tunnel"
#SBATCH --signal=B:TERM@60 # tells the controller
                           # to send SIGTERM to the job 60 secs
                           # before its time ends to give it a
                           # chance for better cleanup.

cleanup() {
    echo "Caught signal - removing SLURM env file"
    rm -f ~/.code-tunnel-env.bash
}

# Trap the timeout signal (SIGTERM) and call the cleanup function
trap 'cleanup' SIGTERM

# store SLURM variables to file
env | awk -F= '$1~/^SLURM_/{print "export "$0}' > ~/.code-tunnel-env.bash

/usr/sbin/sshd -D -p 2222 -f /dev/null -h ${HOME}/.ssh/id_ecdsa &
wait

Then have users add the following line to their .bashrc file:

# source slurm environment if we're connecting through code-tunnel
[ -f ~/.code-tunnel-env.bash ] && source ~/.code-tunnel-env.bash

Would you mind expanding on the full procedure. I'm not sure I understand.

So the SBATCH part reserves resources for the vs-code server to run on and saves some string that helps identify the node we got that we later want the vs-code server to start running on.

What do we do with the part?

Host hpc-job
    ProxyCommand ssh hpc "nc \$(squeue --me --name=tunnel --states=R -h -O NodeList) 2222"
    StrictHostKeyChecking no

What does something like that mean? Exactly this or should the real host be in there somewhere. Is ProxyCommand literally ProxyCommand or should it be something set elsewhere? And is this for a ./ssh/config file on the remote cluster or locally (on the device from which VS-code is originally run before any ssh-ing)?

Is a change in VS-code required (mentioned in the thread but not specifically in this solution)?

@xangma
Copy link

xangma commented May 2, 2024

Hi all, feel free to ask me to be quiet about this in this thread, but I've managed to get my vscode slurm ssh wrapper script working really well now!
It is a wrapper around ssh that you point vscode to instead of your ssh binary. It:

  1. Detects when you connect vscode to your remote slurm cluster,
  2. Checks for an already running vscode job using the jobname,
  3. Connects to existing job if it exists, creates a new job if not,
  4. Proxyjumps to the compute node that your job was assigned,
  5. Starts a little "watcher" process that watches for when you disconnect vscode/close the window on your local machine,
  6. Connects!
  7. Then when you close your local vscode window, if you don't reconnect within a timeout your vscode slurm job is cancelled for you! 🥳 🎉

Finally this script is where I want it to be! I even battled through the Powershell to make a Windows version. Next I just need to make it into an extension that probes your cluster's slurm config and generates the appropriate cluster-specific ssh configs for different slurm resource combos on-the-fly in a GUI :-D

I tried to run this script on a windows 10 laptop that connects via WSL2 to a Linux server and I just get a "Could not establish connection to "server_name": spawn UNKNOWN. The "Installation" was as described in the readme with the appropriate path changes in

Remote-SSH: Settings -> Remote.SSH: Path

and the changes to the ssh config. Using the.sh and the .ps1 scripts does not work, the error remains the same. I did not really work with a cluster before, so I do not really know where to even start to resolve these issues.

Hi, sorry I haven't replied to this already, and feel free to open an issue in my repo to not clog up the thread. I did the windows version as a little extra/bonus (as not many people need it where I work) so it isn't as mature/robust as the linux solution. I have however pushed some fixes + some guidance for windows. It should be: edit ssh.bat to point to the location of ssh_wrapper.ps1, change vscode to point to ssh.bat as the ssh binary path, change/check vscode settings (local server true, remote command true etc.), define slurm job resources in the remote command of a Host in your ssh_config (like in the readme), cross fingers and connect. Feel free to share ssh connection logs in an issue in my repo, just click "details" when connecting :-)

@xangma
Copy link

xangma commented Jul 29, 2024

@eugeneteoh We have implemented a hackish solution to get the SLURM_ variables to the VS code session by storing them into a file that is sourced during login. To do this properly we have added a trap that will clean up the environment variables file on job termination.

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --job-name="code-tunnel"
#SBATCH --signal=B:TERM@60 # tells the controller
                           # to send SIGTERM to the job 60 secs
                           # before its time ends to give it a
                           # chance for better cleanup.

cleanup() {
    echo "Caught signal - removing SLURM env file"
    rm -f ~/.code-tunnel-env.bash
}

# Trap the timeout signal (SIGTERM) and call the cleanup function
trap 'cleanup' SIGTERM

# store SLURM variables to file
env | awk -F= '$1~/^SLURM_/{print "export "$0}' > ~/.code-tunnel-env.bash

/usr/sbin/sshd -D -p 2222 -f /dev/null -h ${HOME}/.ssh/id_ecdsa &
wait

Then have users add the following line to their .bashrc file:

# source slurm environment if we're connecting through code-tunnel
[ -f ~/.code-tunnel-env.bash ] && source ~/.code-tunnel-env.bash

This is great. Made me wonder whether this is possible and it seemingly is? I haven't tested it yet.

@mredenti
Copy link

mredenti commented Aug 2, 2024

Hi all, feel free to ask me to be quiet about this in this thread, but I've managed to get my vscode slurm ssh wrapper script working really well now!

It is a wrapper around ssh that you point vscode to instead of your ssh binary. It:

  1. Detects when you connect vscode to your remote slurm cluster,
  2. Checks for an already running vscode job using the jobname,
  3. Connects to existing job if it exists, creates a new job if not,
  4. Proxyjumps to the compute node that your job was assigned,
  5. Starts a little "watcher" process that watches for when you disconnect vscode/close the window on your local machine,
  6. Connects!
  7. Then when you close your local vscode window, if you don't reconnect within a timeout your vscode slurm job is cancelled for you! 🥳 🎉

Finally this script is where I want it to be! I even battled through the Powershell to make a Windows version. Next I just need to make it into an extension that probes your cluster's slurm config and generates the appropriate cluster-specific ssh configs for different slurm resource combos on-the-fly in a GUI :-D

On some HPC clusters, a password is needed to be able to proxyjump to the compute nodes. Is there not a way around this? Like somehow running the vscode server upon allocation?

@xangma
Copy link

xangma commented Aug 30, 2024

On some HPC clusters, a password is needed to be able to proxyjump to the compute nodes. Is there not a way around this? Like somehow running the vscode server upon allocation?

I'd hope it pops up a little window or you can type the password into the terminal? Without agent forwarding I have to type my password a bunch of times to get it to connect to the compute node.

@PeterKADam
Copy link

The admins of the HPC i use have recently disabled SSH access from the frontend to allocated slurm nodes and now requires users to connect to running jobs with srun --jobid $job_id --overlap --pty bash thus disabling the rather well working proxyjump.

i have made a script to get a jobid and connect to it, and it works through a normal cli (ssh -t). Is there still no decent way to connect to a node through a frontend with some intermediary command?

@davide-q
Copy link

Is there still no decent way to connect to a node through a frontend with some intermediary command?

Sadly not. The way VSCode communicates to the compute nodes strictly requires full ssh capabilities, using a tty proxy as srun --pty cannot work. I banged my head over and over on it.

You may ask your admins to enable pam_slurm_adopt.so so that you can still ssh into nodes, but only if you have a job running in them. This is accepted to many HPC centers nowadays, if they are running slurm as it appears given you mention srun

@PeterKADam
Copy link

You may ask your admins to enable pam_slurm_adopt.so so that you can still ssh into nodes, but only if you have a job running in them. This is accepted to many HPC centers nowadays, if they are running slurm as it appears given you mention srun

i was given the reasoning that sshing into the nodes was not respecting resource allocations from slurm, thus allowing you to steal resources from other users if you were sharing the node. so i guess im forced to use the remote tunnels instead.

@davide-q
Copy link

i was given the reasoning that sshing into the nodes was not respecting resource allocations from slurm, thus allowing you to steal resources from other users if you were sharing the node.

AFAIK, that's incorrect: https://slurm.schedmd.com/pam_slurm_adopt.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for new features or functionality ssh Issue in vscode-remote SSH
Projects
None yet
Development

No branches or pull requests