Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cpu offloading #23

Merged
merged 1 commit into from
Oct 11, 2024
Merged

Add cpu offloading #23

merged 1 commit into from
Oct 11, 2024

Conversation

Ednaordinary
Copy link
Contributor

@Ednaordinary Ednaordinary commented Oct 10, 2024

This adds CPU offloading, allowing 768p 10s 24fps to run on a single 3090 (And likely within 12 GB too. If you have a 12 GB gpu, please let me know if it works!)

With these changes, inference can be run like in the following example (768p, 5s):

import time
import torch
from PIL import Image
from pyramid_dit import PyramidDiTForVideoGeneration
from diffusers.utils import load_image, export_to_video

torch.cuda.set_device(0)
model_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16, fp16 or fp32	

model = PyramidDiTForVideoGeneration(
    'pyramid-flow',                                         # The downloaded checkpoint dir
    model_dtype,
    model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
)

model.vae.enable_tiling() # this may be unnecessary with offloading, still testing

prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"

start_time = time.time()
with torch.no_grad(), torch.amp.autocast('cuda', enabled=True, dtype=torch_dtype):
    frames = model.generate(
        prompt=prompt,
        num_inference_steps=[20, 20, 20],
        video_num_inference_steps=[10, 10, 10],
        height=768,     
        width=1280,
        temp=16,                    # temp=16: 5s, temp=31: 10s
        guidance_scale=9.0,         # The guidance for the first frame
        video_guidance_scale=5.0,   # The guidance for the other video latent
        output_type="pil",
        cpu_offloading=True,
        save_memory=True,
    )

export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
print(f'Max allocated memory: {torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3:.3f}GiB')
print(f'Time: {str(round(time.time() - start_time, 2))}')

This makes inferencing take longer as modules' locations have to be changed, but also allows them to be run in far less vram.

One possible downside is that it's not possible to specify the load device. I can add another parameter if needed, though I didn't want to complicate the pipeline too much. i2v should also be supported with this method, but I haven't added it there or tested it.

I'll have exact memory allocations and timings shortly.

@Ednaordinary
Copy link
Contributor Author

768p, 10s:

Max allocated memory: 10.431GiB
Time: 1239.03 seconds

768p, 5s:

Max allocated memory: 10.432GiB
Time: 478.9 seconds

384p, 5s:

Max allocated memory: 10.432GiB
Time: 466.34 seconds

Please note, nvidia-smi/nvtop will likely give incorrect vram usage amounts, as this version seems to use less vram if less vram was available (I was able to run 768p 10s 24fps with only 12gb available on my 24gb card). This seems to be reserved memory instead of allocated memory.

@jy0205
Copy link
Owner

jy0205 commented Oct 11, 2024

Thanks for your great efforts for CPU offloading! We will merge it to the main branch.

@Hussain-X
Copy link

Nice thanks, maybe my 12GB 4070 can run this now.

@drwootton
Copy link

I have RTX 3060 and RTX 4070 with 12GB each running Ubuntu Linux. My RTX 4070 is my X display server so that takes about 1GB VRAM.
If I try to run on the RTX 4070 I get an out of memory error. If I run with the RTX 3060 the 768P version runs fine.
The run reports 10.43 GB memory used so this makes sense since that plus 1GB for X server is over 12GB.
If I shut down my X server, the RTX 4070 runs fine.

So 12GB VRAM seems to be the current minimum. Not a problem, just an observation.

Also, as another data point, for the 10 second 768P video, time is reported as 3261 for the RTX 3060 and 1510 for the RTX 4070

@Mikerhinos
Copy link

How do we activate it ?

I tried running app.py -cpu_offloading=True , tried img to video and text to video, and it goes up to 24GB of used VRAM (I have 16GB so it uses shared memory which is super slow : it announces around 3h45 for the default test of the text to video tab :( )
I even tried to hardcode the True value in the .py file and it still uses 24GB VRAM.

@Ednaordinary
Copy link
Contributor Author

It should be as simple as modifying cpu_offload=False to cpu_offload=True in app.py on lines 99 and 121

If this doesn't work, make sure you are using a more recent commit with the torch.cuda.empty_cache() lines merged, as I suspect your issue might be from cached vram not being deallocated as inference continues due to system memory fallback being available.

@Mikerhinos
Copy link

Mikerhinos commented Oct 12, 2024

Oh it's in the app.py ? I changed it in the changed file from the commit lol.
I modified the app.py and it worked for the 1st step (announcing 2mn to complete the video) using 13GB of VRAM, then I had this in the terminal and it raised to 20GB (it's a fresh install from a couple hours):

(dit) Warning: Do not preload pipeline components (i.e. to cuda) with cpu offloading enabled! Otherwise, a second transfer will occur needlessly taking up time.
(vae) Warning: Do not preload pipeline components (i.e. to cuda) with cpu offloading enabled! Otherwise, a second transfer will occur needlessly taking up time.

Edit : step 1 completed and now VRAM jumped to 27GB

@Ednaordinary
Copy link
Contributor Author

Ednaordinary commented Oct 12, 2024

Ah, app.py preloads components, which it shouldn't when cpu offloading is enabled. Delete lines 52-54 here:

Screenshot_20241012-035126.png

I'll make a PR for dealing with CPU offloading in app.py better tomorrow/today (after my next sleep)

@Mikerhinos
Copy link

No more warning, but still a jump to 20GB at the end of step 1, then after a couple of minutes it goes to step 2 and jump to 27GB of VRAM (RAM usage goes from 32GB down to 23GB in the same time, not sure if it helps).

"I'll make a PR for dealing with CPU offloading in app.py better tomorrow/today (after my next sleep)"
Ok thanks (no pressure) :) I'll make some other tries tomorrow or in a few days.

My hardware btw :

  • RTX 4070 Ti Super 16GB
  • 32GB RAM

@Ednaordinary
Copy link
Contributor Author

Should be fixed in #76

feifeiobama also got to simplifying cpu_offloading in app.py before me, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants