-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not working? Are the cuda drivers / etc still working? #3
Comments
I haven't actually done a fresh install in a while; I'll check it out today. As far as the AMI, I tend to be skeptical of things I don't understand, and what goes into their AMI fits into that category! Absolutely no other reason beyond that. |
Hmmm, this is strange. I just launched an instance, and before it even started installing A1111 or InvokeAI (note that I generally use just InvokeAI), as soon as the cuda installation finished, it seems happy:
I did have to raise the minimum spot price (I'm running in us-east-2, and it's gotten expensive there!), but other than that, I made no changes to the repo.
|
Hmm, thanks for verifying. I’ll try again tomorrow. I’m familiar with linux/shell/aws cli, but far from an expert. Do you have any tips on how to debug? Eg, I see install steps happening in setup.sh. How do I see logs of what happened? (Also is there a way to see live the output as it’s running? Ie - is there some way to ssh in during setup and “attach” into the running script? This would also give me some clue as to when it’s finished (right now I’ve just been setting my timer for 10 minutes, per the readme)). Thanks again! |
You can normally SSH in after a few minutes. The commands in setup.sh are run as part of the "user data". The output is stored in /var/log/cloud-init-output.log; I often just run a You can also do something like |
Ah perfect -
Ie - after stopping/starting the instance, when I run:
I see:
Any idea why this would happen? (Note I also tried
|
|
Thanks for the explanation of how duckdns helps - that makes sense. I'm not sure exactly how ephemeral storage works. But it looks like cuda drivers are installed there? Does this get wiped if I stop the instance / start again later? And would this explain the problem? Ie - is this installing cuda to the ephemeral storage? Lines 57 to 61 in 2dae649
I don't see any obvious cuda files there...
|
The ephemeral storage is local to the physical VM, rather than being a network volume. It's pretty fast, but it does not persist between shutdowns / startups. It should only be used for temporary storage. The drivers aren't actually installed there; it's just the package is downloaded there (and then deleted). The installation goes to /usr/local/bin or something (I didn't actually look, but that would be my assumption; it's somewhere on the EBS volume). I don't think this explains anything related to your issues, to be honest :-) |
Ok, so I just made two changes related to A1111; I now install I was not able to reproduce your start/stop bug. I stopped my instance, started it, waited about a minute or so for the DNS to update, SSH'ed in, and opened a tunnel. A1111 worked, and at full speed. Given that everything is automated, perhaps try deleting everything and recreating, based on the current version, and see if the problem still exists? |
Ah nice. I've terminated the instance and started a new one. I haven't been able to reproduce the issue yet (tried restarting, and systemctl stopping / starting sdwebgui and invokai alternatively). Note that when it was broken yesterday I had noticed from |
By the way, I got direct browsing working without a tunnel.
I think that's secure enough? |
Ok, now we're getting somewhere! Running out of disk space is Bad™. I'm willing to blame everything on that. The default filesystem is probably too small to install both A1111 and Invoke and have them be usable. Give it another 10GB or so and recreate. It's hard for me to give a good default, because every extra GB costs money, albeit very little ($0.08/mo per GB). I wouldn't use ephemeral storage for anything like that. Honestly, don't use it at all, until you know why you want to use it. If you're not sure, don't :-) The instructions you've seen are correct. Put your A1111 models in /home/admin/stable-diffusion-webui/models/.
It's actually not secure in the slightest, except that it's security by obscurity, aka, "what are the odds that someone will find your server and port?". It's probably not terrible, and what's the worst that can happen? They see your images? They generate some for free? I can't endorse it, but if you're ok with the risks, your implementation looks correct to me. I'm glad to hear everything is working. If you have no other questions, can you close this issue? (if you do, please ask! This is not me trying to get rid of you, this is just me making sure I've answered everything you've asked) |
Ah great, okay thanks so much for explaining this too - this was really helpeful! And just last bit before closing (sorry, issue solved though, this is just me rambling)
I'm only allowing access from my home ip address though. Ie - I think what you're describing would be possible with |
oh, sorry, I totally missed that you only opened it to your IP! I probably should have read what you wrote, and not what I thought I saw ;-). As you say, there's basically zero risk in that case[1]. Also, please feel free to keep asking or sharing! I wrote this repo for myself, but it's always great to hear that it helps others, and I'm (almost) always glad to try to help someone get up and running. [1] If your ISP uses "CGN" aka "NAT444" then you might be opening it to a bunch of people. Most don't, but there are some that do. |
Ok perfect, thanks again! ps - I wrote up a quick python helper to find the cheapest region. We can knock the price down from $0.21 to $0.07 by moving to a cheaper region
It's just a bit annoying having to open up a new support request for each new availability zone to increase the spot cpus allowed. It seems like it can take a couple days |
Hi, thanks for documenting this setup. You've made it nice and easy to follow along!
First
My first question is just - is there a reason to manually set up linux / cuda drivers, rather than using the aws deep learning ami? I wonder if there will be issues over time as some of the packages get upgraded, but some things are hardcoded.
Second
I was able to setup the instance successfully. I've also confirmed that both services are available. However, stable diffusion doesn't seem to be working
I haven't got automatic1111 to work and invokeai seems very slow (I'm guessing cpu only rather than using gpu)
When using automatic1111. I tried running
journalctl -u sdwebgui.service
and get 2 errors:I was able to fix the tmalloc error with (reference)
But I wasn't able to fix the GPU issue. I loaded python in the venv and don't see cuda available
I also don't see cuda available when running
nvidia-smi
. Result:The text was updated successfully, but these errors were encountered: