Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to sequential cpu offload 8GB VRAM maybe less #75

Merged
merged 2 commits into from
Oct 13, 2024

Conversation

rodjjo
Copy link
Contributor

@rodjjo rodjjo commented Oct 13, 2024

What

Use even less memory to run it.

Why

To help people with 8GB VRAM GPU

Description

Add support to sequential CPU offloading.

  • Take longer, however runs on with very low VRAM and way faster than only using CPU.
  • Keeps the previous logic of offloading.

For the reviewer

Generate a text to video and a image to video to validate (I moved the logic from a personal project to support this one here).

How to use:

model = PyramidDiTForVideoGeneration(
    'PATH',                                         # The downloaded checkpoint dir
    model_dtype,
    model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
)


model.vae.enable_tiling()
model.enable_sequential_cpu_offload()

image

@rodjjo rodjjo changed the title Add support to sequential cpu offload 8GB VRAM maybe Add support to sequential cpu offload 8GB VRAM maybe less Oct 13, 2024
@Ednaordinary
Copy link
Contributor

Ednaordinary commented Oct 13, 2024

I couldn't get this to run, it looks like the cpu_offload function didn't get copied over imported

Oops: just need to change the accelerate import to include cpu_offload

@feifeiobama
Copy link
Collaborator

I couldn't get this to run, it looks like the cpu_offload function didn't get copied over imported

Oops: just need to change the accelerate import to include cpu_offload

This is indeed a problem. @rodjjo could you submit another commit to include from accelerate import Accelerator, cpu_offload. Then, I will immediately merge this pull request and update the README.md.

@Ednaordinary
Copy link
Contributor

With a really short video, I was also able to run as low as 2GB with #76 merged and tile_sample_min_size lowered to 64 (48, 32, 16 progressively became slower and vram usage didn't go down)

With this, vae decode is now the limiting factor. The exact vae decode usage seems to fluctuate depending on temp:

16 (5sec): 4.808GiB
10: 3.242GiB
5: 1.889GiB
3:1.780GiB

(anything in between should work, these are just the ones I chose to sample)

@feifeiobama
Copy link
Collaborator

feifeiobama commented Oct 13, 2024

With a really short video, I was also able to run as low as 2GB with #76 merged and tile_sample_min_size lowered to 64 (48, 32, 16 progressively became slower and vram usage didn't go down)

With this, vae decode is now the limiting factor. The exact vae decode usage seems to fluctuate depending on temp:

16 (5sec): 4.808GiB 10: 3.242GiB 5: 1.889GiB 3:1.780GiB

(anything in between should work, these are just the ones I chose to sample)

This looks great. Thank you @rodjjo!

@rodjjo
Copy link
Contributor Author

rodjjo commented Oct 13, 2024

Sorry for missing the include :D

@feifeiobama feifeiobama merged commit 04e68e1 into jy0205:main Oct 13, 2024
@rodjjo rodjjo deleted the sequential branch October 13, 2024 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants