Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not working? Are the cuda drivers / etc still working? #3

Closed
maurera opened this issue Jan 3, 2024 · 15 comments
Closed

Not working? Are the cuda drivers / etc still working? #3

maurera opened this issue Jan 3, 2024 · 15 comments

Comments

@maurera
Copy link

maurera commented Jan 3, 2024

Hi, thanks for documenting this setup. You've made it nice and easy to follow along!

First

My first question is just - is there a reason to manually set up linux / cuda drivers, rather than using the aws deep learning ami? I wonder if there will be issues over time as some of the packages get upgraded, but some things are hardcoded.

Second

I was able to setup the instance successfully. I've also confirmed that both services are available. However, stable diffusion doesn't seem to be working

I haven't got automatic1111 to work and invokeai seems very slow (I'm guessing cpu only rather than using gpu)

When using automatic1111. I tried running journalctl -u sdwebgui.service and get 2 errors:

Cannot locate TCMalloc (improves CPU memory usage)
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Traceback (most recent call last):
File "/home/admin/stable-diffusion-webui/launch.py", line 48, in
main()
File "/home/admin/stable-diffusion-webui/launch.py", line 39, in main
prepare_environment()
File "/home/admin/stable-diffusion-webui/modules/launch_utils.py", line 384, in prepare_environment
raise RuntimeError(
RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

I was able to fix the tmalloc error with (reference)

sudo apt-get install libgoogle-perftools4 libtcmalloc-minimal4 -y

But I wasn't able to fix the GPU issue. I loaded python in the venv and don't see cuda available

>>> import torch
>>> print(torch.__version__)
2.0.1+cu118
>>> print(torch.cuda.is_available())
False

I also don't see cuda available when running nvidia-smi. Result:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

@mikeage
Copy link
Owner

mikeage commented Jan 3, 2024

I haven't actually done a fresh install in a while; I'll check it out today.

As far as the AMI, I tend to be skeptical of things I don't understand, and what goes into their AMI fits into that category! Absolutely no other reason beyond that.

@mikeage
Copy link
Owner

mikeage commented Jan 3, 2024

Hmmm, this is strange. I just launched an instance, and before it even started installing A1111 or InvokeAI (note that I generally use just InvokeAI), as soon as the cuda installation finished, it seems happy:

admin@ip-172-31-14-255:~$ nvidia-smi
Wed Jan  3 04:23:15 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0              25W /  70W |      2MiB / 15360MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I did have to raise the minimum spot price (I'm running in us-east-2, and it's gotten expensive there!), but other than that, I made no changes to the repo.

  0 mikemi@MCMIKEMI-32Y5M ~/src/stable-diffusion-aws {main *|u=}$ git rev-parse HEAD
2dae6496fdb5b48723032768c11b9100f80d695f
  0 mikemi@MCMIKEMI-32Y5M ~/src/stable-diffusion-aws {main *|u=}$ git diff
diff --git i/README.md w/README.md
index 030f1e7..f60c1bd 100644
--- i/README.md
+++ w/README.md
@@ -35,7 +35,7 @@ aws ec2 run-instances \
     --user-data file://setup.sh \
     --metadata-options "InstanceMetadataTags=enabled" \
     --tag-specifications "ResourceType=spot-instances-request,Tags=[{Key=creator,Value=stable-diffusion-aws}]" "ResourceType=instance,Tags=[{Key=INSTALL_AUTOMATIC1111,Value=$INSTALL_AUTOMATIC1111},{Key=INSTALL_INVOKEAI,Value=$INSTALL_INVOKEAI},{Key=GUI_TO_START,Value=$GUI_TO_START}]" \
-    --instance-market-options 'MarketType=spot,SpotOptions={MaxPrice=0.20,SpotInstanceType=persistent,InstanceInterruptionBehavior=stop}'
+    --instance-market-options 'MarketType=spot,SpotOptions={MaxPrice=0.40,SpotInstanceType=persistent,InstanceInterruptionBehavior=stop}'

 ```

@maurera
Copy link
Author

maurera commented Jan 3, 2024

Hmm, thanks for verifying. I’ll try again tomorrow.

I’m familiar with linux/shell/aws cli, but far from an expert. Do you have any tips on how to debug?

Eg, I see install steps happening in setup.sh. How do I see logs of what happened? (Also is there a way to see live the output as it’s running? Ie - is there some way to ssh in during setup and “attach” into the running script? This would also give me some clue as to when it’s finished (right now I’ve just been setting my timer for 10 minutes, per the readme)).

Thanks again!

@mikeage
Copy link
Owner

mikeage commented Jan 3, 2024

You can normally SSH in after a few minutes.

The commands in setup.sh are run as part of the "user data". The output is stored in /var/log/cloud-init-output.log; I often just run a tail -F on this to see when it finishes.

You can also do something like ps fauwx which will make it pretty obvious what's running when

@maurera
Copy link
Author

maurera commented Jan 4, 2024

Ah perfect - tail -F is great for both var/log/cloud-init-output.log on ititial setup and tail -F /var/log/sdwebui.log as it's running

  1. Good news - I was initially able to get it working. I think I wasn't being patient enough before trying to stop/start the instance (I think it took ~18 minutes for initial setup)
  2. Bad news - It stopped working after I stopped / started the instance

Ie - after stopping/starting the instance, when I run:

tail -F /var/log/sdwebui.log

I see:

################################################################
Launching launch.py...
################################################################
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0]
Version: v1.7.0
Commit hash: cf2772fab0af5573da775e7437e6acdca424f26e
Traceback (most recent call last):
File "/home/admin/stable-diffusion-webui/launch.py", line 48, in
main()
File "/home/admin/stable-diffusion-webui/launch.py", line 39, in main
prepare_environment()
File "/home/admin/stable-diffusion-webui/modules/launch_utils.py", line 384, in prepare_environment
raise RuntimeError(
RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check

Any idea why this would happen? (Note I also tried systemctl disable sdwebgui and systemctl stop sdwebgui in order to switch between invokeai and sdwebgui. I'm not sure if that's the correct workflow...)

  1. Is there any change you could add any info for how to get it working with duckdns? I added the ssh info and I can see that the instance is registering successfully on duckdns using my token (ie - I see the ip address on duckdns updates to be the ec2 instance's address). Am I then just supposed to navigate to http://MYSUBDOMAIN.duckdns.org:7860/ ? Does the tunnel need to open? I tried this and it wouldn't connect. I also tried opening port 7860 in the security group to just my home ip address. This also didn't work (from reading, maybe I need to add --listen to the sdwebgui startup command?)

@mikeage
Copy link
Owner

mikeage commented Jan 4, 2024

  1. Good!
  2. I stopped using A1111 a while back, and now use Invoke-AI exclusively. I'll check into this.
  3. The AWS Security Group does not allow inbound connections on port 7860, and even if it did, I think (but this should be confirmed) that it only listens on localhost:7860, not 0.0.0.0:7860 (this fits with your comment about it not working even when you added it, and possibly needing to add a --listen parameter). My intention behind duckdns was to make it easier to always SSH to the same address, rather than having to either pay for an Elastic IP (which costs money when not in use) or get the IP each time you shutdown / startup the node. But I still use a tunnel, just that the command is ssh foo.duckdns.org -L 7860:localhost:7860 instead of having to swap out the IP each time.

@maurera
Copy link
Author

maurera commented Jan 4, 2024

Thanks for the explanation of how duckdns helps - that makes sense.

I'm not sure exactly how ephemeral storage works. But it looks like cuda drivers are installed there? Does this get wiped if I stop the instance / start again later? And would this explain the problem?

Ie - is this installing cuda to the ephemeral storage?

# install CUDA (from https://developer.nvidia.com/cuda-downloads)
cd /mnt/ephemeral
sudo -u admin wget --no-verbose https://developer.download.nvidia.com/compute/cuda/$CUDA_VERSION/local_installers/cuda_${CUDA_FULL_VERSION}_linux.run
sudo sh cuda_${CUDA_FULL_VERSION}_linux.run --silent
sudo -u admin rm cuda_${CUDA_FULL_VERSION}_linux.run

I don't see any obvious cuda files there...

admin@ip-****:/mnt/ephemeral$ ls -l /mnt/ephemeral/
total 8388632
drwxr-xr-x 4 admin admin 4096 Jan 4 05:53 cache
drwx------ 2 root root 16384 Jan 4 05:53 lost+found
-rw------- 1 root root 8589934592 Jan 4 05:54 swapfile

@mikeage
Copy link
Owner

mikeage commented Jan 4, 2024

The ephemeral storage is local to the physical VM, rather than being a network volume. It's pretty fast, but it does not persist between shutdowns / startups. It should only be used for temporary storage.

The drivers aren't actually installed there; it's just the package is downloaded there (and then deleted). The installation goes to /usr/local/bin or something (I didn't actually look, but that would be my assumption; it's somewhere on the EBS volume).

I don't think this explains anything related to your issues, to be honest :-)

@mikeage
Copy link
Owner

mikeage commented Jan 4, 2024

Ok, so I just made two changes related to A1111; I now install libtcmalloc-minimal4 and use xformers. Neither should be related, but they're both good ideas, I think, and xformers in particular gives about a 33% speed boost in my experience (from ~6it/s to 8it/s). A1111 should now perform much better.

I was not able to reproduce your start/stop bug. I stopped my instance, started it, waited about a minute or so for the DNS to update, SSH'ed in, and opened a tunnel. A1111 worked, and at full speed.

Given that everything is automated, perhaps try deleting everything and recreating, based on the current version, and see if the problem still exists?

@maurera
Copy link
Author

maurera commented Jan 4, 2024

Ah nice.

I've terminated the instance and started a new one. I haven't been able to reproduce the issue yet (tried restarting, and systemctl stopping / starting sdwebgui and invokai alternatively).

Note that when it was broken yesterday I had noticed from df -h that /dev/nvme0n1p1 had 0 space available. Could that be the culprit? I had added a lora (few hundred mb), added sd1.5, and maybe generated about 30 images. I guess I should be using ephemeral storage for this, but the workflows I've been reading online say to add models to /home/admin/stable-diffusion-webui/models/

@maurera
Copy link
Author

maurera commented Jan 4, 2024

By the way, I got direct browsing working without a tunnel.

  1. add the --listen flag in setup.sh
ExecStart=/usr/bin/env bash /home/admin/stable-diffusion-webui/webui.sh --xformers --listen
  1. add a hole in the security group to the computer you're working from
export WORKSTATION_PUBLIC_IP=$(curl ipinfo.io/ip)
aws ec2 authorize-security-group-ingress --group-id $SG_ID --protocol tcp --port 7860 --cidr ${WORKSTATION_PUBLIC_IP}/32
  1. browse to http://mysubdomain.duckdns.org:7860/

I think that's secure enough?

@mikeage
Copy link
Owner

mikeage commented Jan 4, 2024

Note that when it was broken yesterday I had noticed from df -h that /dev/nvme0n1p1 had 0 space available. Could that be the culprit? I had added a lora (few hundred mb), added sd1.5, and maybe generated about 30 images. I guess I should be using ephemeral storage for this, but the workflows I've been reading online say to add models to /home/admin/stable-diffusion-webui/models/

Ok, now we're getting somewhere!

Running out of disk space is Bad™. I'm willing to blame everything on that.

The default filesystem is probably too small to install both A1111 and Invoke and have them be usable. Give it another 10GB or so and recreate. It's hard for me to give a good default, because every extra GB costs money, albeit very little ($0.08/mo per GB).

I wouldn't use ephemeral storage for anything like that. Honestly, don't use it at all, until you know why you want to use it. If you're not sure, don't :-)

The instructions you've seen are correct. Put your A1111 models in /home/admin/stable-diffusion-webui/models/.

By the way, I got direct browsing working without a tunnel.
...
I think that's secure enough?

It's actually not secure in the slightest, except that it's security by obscurity, aka, "what are the odds that someone will find your server and port?". It's probably not terrible, and what's the worst that can happen? They see your images? They generate some for free? I can't endorse it, but if you're ok with the risks, your implementation looks correct to me.

I'm glad to hear everything is working. If you have no other questions, can you close this issue? (if you do, please ask! This is not me trying to get rid of you, this is just me making sure I've answered everything you've asked)

@maurera
Copy link
Author

maurera commented Jan 4, 2024

Give it another 10GB or so and recreate.

Ah great, okay thanks so much for explaining this too - this was really helpeful!

And just last bit before closing (sorry, issue solved though, this is just me rambling)

It's actually not secure in the slightest, except that it's security by obscurity

I'm only allowing access from my home ip address though. Ie - I think what you're describing would be possible with --cidr 0.0.0.0/0. But I've only opened it up to --cidr ${WORKSTATION_PUBLIC_IP}/32. Isn't the only risk that someone will find my house, connect to my wifi, and then connect to my server from my home wifi?

@mikeage
Copy link
Owner

mikeage commented Jan 4, 2024

oh, sorry, I totally missed that you only opened it to your IP! I probably should have read what you wrote, and not what I thought I saw ;-). As you say, there's basically zero risk in that case[1].

Also, please feel free to keep asking or sharing! I wrote this repo for myself, but it's always great to hear that it helps others, and I'm (almost) always glad to try to help someone get up and running.

[1] If your ISP uses "CGN" aka "NAT444" then you might be opening it to a bunch of people. Most don't, but there are some that do.

@maurera
Copy link
Author

maurera commented Jan 4, 2024

Ok perfect, thanks again!

ps - I wrote up a quick python helper to find the cheapest region. We can knock the price down from $0.21 to $0.07 by moving to a cheaper region

import json
import pandas as pd
import requests

def parse_json_to_df(json_data) -> pd.DataFrame:
    # read an aws spot pricing json data object and convert to DataFrame
    result = []
    for region_list in json_data['config']['regions']:
        region = region_list['region']
        for instance_list in region_list['instanceTypes']:
            for size_list in instance_list['sizes']:
                instance_type = size_list['size']
                for values_list in size_list['valueColumns']:
                    os = values_list['name']
                    if os != 'linux':
                        continue
                    prices = values_list['prices']
                    price = prices.get('USD')
                    result.append([region, instance_type, os, price])
    df = pd.DataFrame(result, columns=['region', 'instanceType', 'os', 'price'])
    return df

spot_prices_url = 'https://website.spot.ec2.aws.a2z.com/spot.json'
data_string = requests.get(spot_prices_url).content.decode()
data_json = json.loads(data_string)
df = parse_json_to_df(data_json)

# find the 5 cheapest regions
print(df[df.instanceType=='g4dn.xlarge'].sort_values('price').reset_index(drop=True).iloc[:5,:])
# output is 
#             region instanceType     os   price
# 0       eu-south-1  g4dn.xlarge  linux  0.0704
# 1   ap-northeast-3  g4dn.xlarge  linux  0.0808
# 2        sa-east-1  g4dn.xlarge  linux   0.111
# 3       me-south-1  g4dn.xlarge  linux   0.118
# 4  us-west-2-lax-1  g4dn.xlarge  linux  0.1285

It's just a bit annoying having to open up a new support request for each new availability zone to increase the spot cpus allowed. It seems like it can take a couple days

@maurera maurera closed this as completed Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants