Add support to sequential cpu offload 8GB VRAM maybe less #75

rodjjo · 2024-10-13T02:12:55Z

What

Use even less memory to run it.

Why

To help people with 8GB VRAM GPU

Description

Add support to sequential CPU offloading.

Take longer, however runs on with very low VRAM and way faster than only using CPU.
Keeps the previous logic of offloading.

For the reviewer

Generate a text to video and a image to video to validate (I moved the logic from a personal project to support this one here).

How to use:

model = PyramidDiTForVideoGeneration(
    'PATH',                                         # The downloaded checkpoint dir
    model_dtype,
    model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
)


model.vae.enable_tiling()
model.enable_sequential_cpu_offload()

Ednaordinary · 2024-10-13T03:14:40Z

I couldn't get this to run, it looks like the cpu_offload function didn't get ~~copied over~~ imported

Oops: just need to change the accelerate import to include cpu_offload

feifeiobama · 2024-10-13T04:12:21Z

I couldn't get this to run, it looks like the cpu_offload function didn't get ~~copied over~~ imported

Oops: just need to change the accelerate import to include cpu_offload

This is indeed a problem. @rodjjo could you submit another commit to include from accelerate import Accelerator, cpu_offload. Then, I will immediately merge this pull request and update the README.md.

Ednaordinary · 2024-10-13T04:15:07Z

With a really short video, I was also able to run as low as 2GB with #76 merged and tile_sample_min_size lowered to 64 (48, 32, 16 progressively became slower and vram usage didn't go down)

With this, vae decode is now the limiting factor. The exact vae decode usage seems to fluctuate depending on temp:

16 (5sec): 4.808GiB
10: 3.242GiB
5: 1.889GiB
3:1.780GiB

(anything in between should work, these are just the ones I chose to sample)

feifeiobama · 2024-10-13T04:18:30Z

With a really short video, I was also able to run as low as 2GB with #76 merged and tile_sample_min_size lowered to 64 (48, 32, 16 progressively became slower and vram usage didn't go down)

With this, vae decode is now the limiting factor. The exact vae decode usage seems to fluctuate depending on temp:

16 (5sec): 4.808GiB 10: 3.242GiB 5: 1.889GiB 3:1.780GiB

(anything in between should work, these are just the ones I chose to sample)

This looks great. Thank you @rodjjo!

rodjjo · 2024-10-13T10:23:15Z

Sorry for missing the include :D

Add support to sequential cpu offload 8GB VRAM maybe

81a4c80

rodjjo mentioned this pull request Oct 13, 2024

Use sequential model offload at least on the text encoder. #51

Closed

rodjjo changed the title ~~Add support to sequential cpu offload 8GB VRAM maybe~~ Add support to sequential cpu offload 8GB VRAM maybe less Oct 13, 2024

feifeiobama mentioned this pull request Oct 13, 2024

Running out of 12GB ？ #78

Open

Add missing import

6d0f7e6

feifeiobama merged commit 04e68e1 into jy0205:main Oct 13, 2024

rodjjo deleted the sequential branch October 13, 2024 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to sequential cpu offload 8GB VRAM maybe less #75

Add support to sequential cpu offload 8GB VRAM maybe less #75

rodjjo commented Oct 13, 2024

Ednaordinary commented Oct 13, 2024 •

edited

Loading

feifeiobama commented Oct 13, 2024

Ednaordinary commented Oct 13, 2024

feifeiobama commented Oct 13, 2024 •

edited

Loading

rodjjo commented Oct 13, 2024

Add support to sequential cpu offload 8GB VRAM maybe less #75

Add support to sequential cpu offload 8GB VRAM maybe less #75

Conversation

rodjjo commented Oct 13, 2024

What

Why

Description

For the reviewer

Ednaordinary commented Oct 13, 2024 • edited Loading

feifeiobama commented Oct 13, 2024

Ednaordinary commented Oct 13, 2024

feifeiobama commented Oct 13, 2024 • edited Loading

rodjjo commented Oct 13, 2024

Ednaordinary commented Oct 13, 2024 •

edited

Loading

feifeiobama commented Oct 13, 2024 •

edited

Loading