Add cpu offloading #23

Ednaordinary · 2024-10-10T23:21:08Z

This adds CPU offloading, allowing 768p 10s 24fps to run on a single 3090 (And likely within 12 GB too. If you have a 12 GB gpu, please let me know if it works!)

With these changes, inference can be run like in the following example (768p, 5s):

import time
import torch
from PIL import Image
from pyramid_dit import PyramidDiTForVideoGeneration
from diffusers.utils import load_image, export_to_video

torch.cuda.set_device(0)
model_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16, fp16 or fp32	

model = PyramidDiTForVideoGeneration(
    'pyramid-flow',                                         # The downloaded checkpoint dir
    model_dtype,
    model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
)

model.vae.enable_tiling() # this may be unnecessary with offloading, still testing

prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"

start_time = time.time()
with torch.no_grad(), torch.amp.autocast('cuda', enabled=True, dtype=torch_dtype):
    frames = model.generate(
        prompt=prompt,
        num_inference_steps=[20, 20, 20],
        video_num_inference_steps=[10, 10, 10],
        height=768,     
        width=1280,
        temp=16,                    # temp=16: 5s, temp=31: 10s
        guidance_scale=9.0,         # The guidance for the first frame
        video_guidance_scale=5.0,   # The guidance for the other video latent
        output_type="pil",
        cpu_offloading=True,
        save_memory=True,
    )

export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
print(f'Max allocated memory: {torch.cuda.max_memory_allocated(device="cuda") / 1024 ** 3:.3f}GiB')
print(f'Time: {str(round(time.time() - start_time, 2))}')

This makes inferencing take longer as modules' locations have to be changed, but also allows them to be run in far less vram.

One possible downside is that it's not possible to specify the load device. I can add another parameter if needed, though I didn't want to complicate the pipeline too much. i2v should also be supported with this method, but I haven't added it there or tested it.

I'll have exact memory allocations and timings shortly.

Ednaordinary · 2024-10-10T23:32:48Z

768p, 10s:

Max allocated memory: 10.431GiB
Time: 1239.03 seconds

768p, 5s:

Max allocated memory: 10.432GiB
Time: 478.9 seconds

384p, 5s:

Max allocated memory: 10.432GiB
Time: 466.34 seconds

Please note, nvidia-smi/nvtop will likely give incorrect vram usage amounts, as this version seems to use less vram if less vram was available (I was able to run 768p 10s 24fps with only 12gb available on my 24gb card). This seems to be reserved memory instead of allocated memory.

jy0205 · 2024-10-11T03:02:20Z

Thanks for your great efforts for CPU offloading! We will merge it to the main branch.

Hussain-X · 2024-10-11T11:03:10Z

Nice thanks, maybe my 12GB 4070 can run this now.

drwootton · 2024-10-12T01:33:23Z

I have RTX 3060 and RTX 4070 with 12GB each running Ubuntu Linux. My RTX 4070 is my X display server so that takes about 1GB VRAM.
If I try to run on the RTX 4070 I get an out of memory error. If I run with the RTX 3060 the 768P version runs fine.
The run reports 10.43 GB memory used so this makes sense since that plus 1GB for X server is over 12GB.
If I shut down my X server, the RTX 4070 runs fine.

So 12GB VRAM seems to be the current minimum. Not a problem, just an observation.

Also, as another data point, for the 10 second 768P video, time is reported as 3261 for the RTX 3060 and 1510 for the RTX 4070

Mikerhinos · 2024-10-12T08:44:32Z

How do we activate it ?

I tried running app.py -cpu_offloading=True , tried img to video and text to video, and it goes up to 24GB of used VRAM (I have 16GB so it uses shared memory which is super slow : it announces around 3h45 for the default test of the text to video tab :( )
I even tried to hardcode the True value in the .py file and it still uses 24GB VRAM.

Ednaordinary · 2024-10-12T09:01:59Z

It should be as simple as modifying cpu_offload=False to cpu_offload=True in app.py on lines 99 and 121

If this doesn't work, make sure you are using a more recent commit with the torch.cuda.empty_cache() lines merged, as I suspect your issue might be from cached vram not being deallocated as inference continues due to system memory fallback being available.

Mikerhinos · 2024-10-12T09:49:54Z

Oh it's in the app.py ? I changed it in the changed file from the commit lol.
I modified the app.py and it worked for the 1st step (announcing 2mn to complete the video) using 13GB of VRAM, then I had this in the terminal and it raised to 20GB (it's a fresh install from a couple hours):

(dit) Warning: Do not preload pipeline components (i.e. to cuda) with cpu offloading enabled! Otherwise, a second transfer will occur needlessly taking up time.
(vae) Warning: Do not preload pipeline components (i.e. to cuda) with cpu offloading enabled! Otherwise, a second transfer will occur needlessly taking up time.

Edit : step 1 completed and now VRAM jumped to 27GB

Ednaordinary · 2024-10-12T09:57:32Z

Ah, app.py preloads components, which it shouldn't when cpu offloading is enabled. Delete lines 52-54 here:

I'll make a PR for dealing with CPU offloading in app.py better tomorrow/today (after my next sleep)

Mikerhinos · 2024-10-12T10:10:56Z

No more warning, but still a jump to 20GB at the end of step 1, then after a couple of minutes it goes to step 2 and jump to 27GB of VRAM (RAM usage goes from 32GB down to 23GB in the same time, not sure if it helps).

"I'll make a PR for dealing with CPU offloading in app.py better tomorrow/today (after my next sleep)"
Ok thanks (no pressure) :) I'll make some other tries tomorrow or in a few days.

My hardware btw :

RTX 4070 Ti Super 16GB
32GB RAM

Ednaordinary · 2024-10-13T03:06:17Z

Should be fixed in #76

feifeiobama also got to simplifying cpu_offloading in app.py before me, thanks!

add cpu offloading

d9f348d

jy0205 merged commit 471754d into jy0205:main Oct 11, 2024

feifeiobama mentioned this pull request Oct 11, 2024

how much minimum vram needed to run the 768p and 384p model locally #12

Closed

Hussain-X mentioned this pull request Oct 11, 2024

I have 12 GB of VRAM, yet I'm still experiencing an out-of-memory (OOM) issue. Can someone specify how much RAM I need for it? #32

Closed

This was referenced Oct 12, 2024

Question about performance #54

Closed

How much ram is it needed? I have 24gb ram and when using cpu_offloading getting oom #58

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cpu offloading #23

Add cpu offloading #23

Ednaordinary commented Oct 10, 2024 •

edited

Loading

Ednaordinary commented Oct 10, 2024

jy0205 commented Oct 11, 2024

Hussain-X commented Oct 11, 2024

drwootton commented Oct 12, 2024

Mikerhinos commented Oct 12, 2024

Ednaordinary commented Oct 12, 2024

Mikerhinos commented Oct 12, 2024 •

edited

Loading

Ednaordinary commented Oct 12, 2024 •

edited

Loading

Mikerhinos commented Oct 12, 2024

Ednaordinary commented Oct 13, 2024

Add cpu offloading #23

Add cpu offloading #23

Conversation

Ednaordinary commented Oct 10, 2024 • edited Loading

Ednaordinary commented Oct 10, 2024

jy0205 commented Oct 11, 2024

Hussain-X commented Oct 11, 2024

drwootton commented Oct 12, 2024

Mikerhinos commented Oct 12, 2024

Ednaordinary commented Oct 12, 2024

Mikerhinos commented Oct 12, 2024 • edited Loading

Ednaordinary commented Oct 12, 2024 • edited Loading

Mikerhinos commented Oct 12, 2024

Ednaordinary commented Oct 13, 2024

Ednaordinary commented Oct 10, 2024 •

edited

Loading

Mikerhinos commented Oct 12, 2024 •

edited

Loading

Ednaordinary commented Oct 12, 2024 •

edited

Loading