Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CogVideoX text-to-video generation model #9082

Merged
merged 107 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
c8e5491
Create autoencoder_kl3d.py
zRzRzRzRzRzRzR Jul 30, 2024
c341786
vae draft
zRzRzRzRzRzRzR Jul 30, 2024
bd6efd5
initial draft of cogvideo transformer
a-r-r-o-w Jul 30, 2024
bb91775
add imports
a-r-r-o-w Jul 30, 2024
59e6669
fix attention mask
a-r-r-o-w Jul 30, 2024
45cb1f9
fix layernorms
a-r-r-o-w Jul 30, 2024
84ff56e
fix with some review guide
zRzRzRzRzRzRzR Jul 30, 2024
a3d827f
rename
zRzRzRzRzRzRzR Jul 30, 2024
dc7e6e8
fix error
zRzRzRzRzRzRzR Jul 30, 2024
aff72ec
Update autoencoder_kl3d.py
zRzRzRzRzRzRzR Jul 30, 2024
cb5348a
fix nasty bug in 3d sincos pos embeds
a-r-r-o-w Jul 30, 2024
e982881
refactor
a-r-r-o-w Jul 31, 2024
d963b1a
update conversion script for latest modeling changes
a-r-r-o-w Jul 31, 2024
1696758
remove debug prints
a-r-r-o-w Jul 31, 2024
21a0fc1
make style
a-r-r-o-w Jul 31, 2024
d83c1f8
add workflow to rebase with upstream main nightly.
sayakpaul Jul 29, 2024
dfeb329
add upstream
sayakpaul Jul 29, 2024
71bcb1e
Revert "add workflow to rebase with upstream main nightly."
sayakpaul Jul 29, 2024
0980f4d
add workflow for rebasing with upstream automatically.
sayakpaul Jul 29, 2024
ee40f0e
follow review guide
zRzRzRzRzRzRzR Jul 31, 2024
8fe54bc
add
zRzRzRzRzRzRzR Jul 31, 2024
1c661ce
remove deriving and using nn.module
zRzRzRzRzRzRzR Jul 31, 2024
73b041e
Merge branch 'cogvideox' into cogvideox-common-draft-1
a-r-r-o-w Jul 31, 2024
b305280
add skeleton for pipeline
a-r-r-o-w Jul 31, 2024
6bcafcb
make fix-copies
a-r-r-o-w Jul 31, 2024
ec9508c
Merge branch 'main' into cogvideox-common-draft-2
a-r-r-o-w Jul 31, 2024
3ae9413
undo unnecessary changes added on cogvideo-vae by mistake
a-r-r-o-w Jul 31, 2024
2be7469
groups->norm_num_groups
a-r-r-o-w Jul 31, 2024
9f9d0cb
verify CogVideoXSpatialNorm3D implementation
a-r-r-o-w Jul 31, 2024
c43a8f5
minor factor and repositioning of code in order of invocation
a-r-r-o-w Jul 31, 2024
5f183bf
reorder upsampling/downsampling blocks in order of invocation
a-r-r-o-w Jul 31, 2024
470815c
minor refactor
a-r-r-o-w Jul 31, 2024
e67cc5a
implement encode prompt
a-r-r-o-w Jul 31, 2024
d45d199
make style
a-r-r-o-w Jul 31, 2024
73469f9
make fix-copies
a-r-r-o-w Jul 31, 2024
45f7127
fix bug in handling long prompts
a-r-r-o-w Jul 31, 2024
a449ceb
update conversion script
a-r-r-o-w Jul 31, 2024
4498cfc
add doc draft
zRzRzRzRzRzRzR Jul 31, 2024
2956866
Merge branch 'cogvideox-common-draft-2' of https://github.com/hugging…
zRzRzRzRzRzRzR Jul 31, 2024
bb4740c
add clear_fake_cp_cache
zRzRzRzRzRzRzR Jul 31, 2024
e05f834
refactor vae
a-r-r-o-w Aug 1, 2024
03c28ee
modeling fixes
a-r-r-o-w Aug 1, 2024
712ddbe
make style
a-r-r-o-w Aug 1, 2024
03ee7cd
add pipeline implementation
a-r-r-o-w Aug 1, 2024
a31db5f
using with 226 instead of 225 of final weight
zRzRzRzRzRzRzR Aug 1, 2024
351d1f0
remove 0.transformer_blocks.encoder.embed_tokens.weight
zRzRzRzRzRzRzR Aug 1, 2024
d0b8db2
update
a-r-r-o-w Aug 1, 2024
fe6f5d6
ensure tokenizer config correctly uses 226 as text length
a-r-r-o-w Aug 1, 2024
4c2e887
add cogvideo specific attn processor
a-r-r-o-w Aug 1, 2024
41da084
remove debug prints
a-r-r-o-w Aug 1, 2024
77558f3
add pipeline docs
a-r-r-o-w Aug 1, 2024
e12458e
make style
a-r-r-o-w Aug 1, 2024
c33dd02
remove incorrect copied from
a-r-r-o-w Aug 1, 2024
71e7c82
vae problem fix
zRzRzRzRzRzRzR Aug 2, 2024
ec53a30
schedule
zRzRzRzRzRzRzR Aug 2, 2024
551c884
remove debug prints
a-r-r-o-w Aug 2, 2024
3def905
update
a-r-r-o-w Aug 2, 2024
65f6211
Merge pull request #4 from huggingface/cogvideox-refactor-to-diffusers
a-r-r-o-w Aug 2, 2024
21509aa
fp16 problem
zRzRzRzRzRzRzR Aug 2, 2024
b42b079
fix some comment
zRzRzRzRzRzRzR Aug 3, 2024
477e12b
fix
zRzRzRzRzRzRzR Aug 3, 2024
fd0831c
timestep fix
zRzRzRzRzRzRzR Aug 3, 2024
d99528b
Restore the timesteps parameter
zRzRzRzRzRzRzR Aug 3, 2024
c7ee165
Update downsampling.py
zRzRzRzRzRzRzR Aug 3, 2024
61c6da0
remove chunked ff code; reuse and refactor to support temb directly i…
a-r-r-o-w Aug 3, 2024
fa7fa9c
make inference 2-3x faster (by fixing the bug i introduced) 🚀😎
a-r-r-o-w Aug 3, 2024
6988cc3
new schedule with dpm
zRzRzRzRzRzRzR Aug 4, 2024
ba4223a
remove attenstion mask
zRzRzRzRzRzRzR Aug 4, 2024
312f7dc
apply suggestions from review
a-r-r-o-w Aug 4, 2024
1b1b26b
make style
a-r-r-o-w Aug 4, 2024
ba1855c
add workflow to rebase with upstream main nightly.
sayakpaul Jul 29, 2024
7360ea1
add upstream
sayakpaul Jul 29, 2024
2f1b787
Revert "add workflow to rebase with upstream main nightly."
sayakpaul Jul 29, 2024
90aa8be
add workflow for rebasing with upstream automatically.
sayakpaul Jul 29, 2024
5781e01
Merge branch 'huggingface:main' into main
a-r-r-o-w Aug 4, 2024
92c8c00
make fix-copies
a-r-r-o-w Aug 4, 2024
fd11c0f
Merge branch 'main' into cogvideox-common-draft-2
a-r-r-o-w Aug 4, 2024
03580c0
remove cogvideox-specific attention processor
a-r-r-o-w Aug 4, 2024
01c2dff
update docs
a-r-r-o-w Aug 4, 2024
311845f
update docs
a-r-r-o-w Aug 4, 2024
1b1b737
cogvideox branch
zRzRzRzRzRzRzR Aug 5, 2024
2d9602c
add CogVideoX team, Tsinghua University & ZhipuAI
zRzRzRzRzRzRzR Aug 5, 2024
fb6130f
Merge branch 'cogvideox-common-draft-2' of github.com:huggingface/dif…
zRzRzRzRzRzRzR Aug 5, 2024
511c9ef
merge remote branch
zRzRzRzRzRzRzR Aug 5, 2024
123ecef
Merge branch 'main' into cogvideox-2b
a-r-r-o-w Aug 5, 2024
cf7369d
fix some error
zRzRzRzRzRzRzR Aug 5, 2024
9c6b889
rename unsample and add some docs
zRzRzRzRzRzRzR Aug 5, 2024
22dcceb
messages
zRzRzRzRzRzRzR Aug 5, 2024
e4d65cc
update
yiyixuxu Aug 5, 2024
6f4e60b
Merge branch 'cogvideox-2b' of github.com:zRzRzRzRzRzRzR/diffusers in…
yiyixuxu Aug 5, 2024
70a54a8
use num_frames instead of num_seconds
a-r-r-o-w Aug 5, 2024
b3428ad
Merge branch 'main' into cogvideox-2b
a-r-r-o-w Aug 5, 2024
9a0b906
restore
zRzRzRzRzRzRzR Aug 5, 2024
32da2e7
Update lora_conversion_utils.py
zRzRzRzRzRzRzR Aug 5, 2024
878f609
remove dynamic guidance scale
a-r-r-o-w Aug 5, 2024
de9e0b2
address review comments
a-r-r-o-w Aug 6, 2024
9c086f5
dynamic cfg; fix cfg support
a-r-r-o-w Aug 6, 2024
62d94aa
address review comments
a-r-r-o-w Aug 6, 2024
5e4dd15
update tests
a-r-r-o-w Aug 6, 2024
884ddd0
Merge branch 'main' into cogvideox-2b
a-r-r-o-w Aug 6, 2024
d1c575a
fix docs error
a-r-r-o-w Aug 6, 2024
11224d9
alternative implementation to context parallel cache
a-r-r-o-w Aug 6, 2024
70cea91
Update docs/source/en/api/pipelines/cogvideox.md
yiyixuxu Aug 6, 2024
cbc4d32
remove tiling and slicing until their implementations are complete
a-r-r-o-w Aug 6, 2024
14698d0
Merge branch 'main' into cogvideox-2b
sayakpaul Aug 7, 2024
8be845d
Merge branch 'main' into cogvideox-2b
sayakpaul Aug 7, 2024
827a70a
Apply suggestions from code review
sayakpaul Aug 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,8 @@
title: VQModel
- local: api/models/autoencoderkl
title: AutoencoderKL
- local: api/models/autoencoderkl_cogvideox
title: AutoencoderKLCogVideoX
- local: api/models/asymmetricautoencoderkl
title: AsymmetricAutoencoderKL
- local: api/models/stable_cascade_unet
Expand All @@ -263,6 +265,8 @@
title: FluxTransformer2DModel
- local: api/models/latte_transformer3d
title: LatteTransformer3DModel
- local: api/models/cogvideox_transformer3d
title: CogVideoXTransformer3DModel
- local: api/models/lumina_nextdit2d
title: LuminaNextDiT2DModel
- local: api/models/transformer_temporal
Expand Down Expand Up @@ -302,6 +306,8 @@
title: AutoPipeline
- local: api/pipelines/blip_diffusion
title: BLIP-Diffusion
- local: api/pipelines/cogvideox
title: CogVideoX
- local: api/pipelines/consistency_models
title: Consistency Models
- local: api/pipelines/controlnet
Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/api/loaders/single_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:

## Supported pipelines

- [`CogVideoXPipeline`]
- [`StableDiffusionPipeline`]
- [`StableDiffusionImg2ImgPipeline`]
- [`StableDiffusionInpaintPipeline`]
Expand Down Expand Up @@ -49,6 +50,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
- [`UNet2DConditionModel`]
- [`StableCascadeUNet`]
- [`AutoencoderKL`]
- [`AutoencoderKLCogVideoX`]
- [`ControlNetModel`]
- [`SD3Transformer2DModel`]
- [`FluxTransformer2DModel`]
Expand Down
37 changes: 37 additions & 0 deletions docs/source/en/api/models/autoencoderkl_cogvideox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# AutoencoderKLCogVideoX

The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.

The model can be loaded with the following code snippet.

```python
from diffusers import AutoencoderKLCogVideoX

vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda")
```

## AutoencoderKLCogVideoX
a-r-r-o-w marked this conversation as resolved.
Show resolved Hide resolved

[[autodoc]] AutoencoderKLCogVideoX
- decode
- encode
- all

## AutoencoderKLOutput

[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput

## DecoderOutput

[[autodoc]] models.autoencoders.vae.DecoderOutput
30 changes: 30 additions & 0 deletions docs/source/en/api/models/cogvideox_transformer3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# CogVideoXTransformer3DModel

A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.

The model can be loaded with the following code snippet.

```python
from diffusers import CogVideoXTransformer3DModel

vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
```

## CogVideoXTransformer3DModel
a-r-r-o-w marked this conversation as resolved.
Show resolved Hide resolved

[[autodoc]] CogVideoXTransformer3DModel

## Transformer2DModelOutput

[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
91 changes: 91 additions & 0 deletions docs/source/en/api/pipelines/cogvideox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->

# CogVideoX

<!-- TODO: update paper with ArXiv link when ready. -->

[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI.

The abstract from the paper is:

*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.*

<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

</Tip>

This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).

## Inference

Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.

First, load the pipeline:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include this code block to demonstrate torch.compile, or is it to show inference time without torch.compile? If it's not necessary, I'm more in favor of just showing the below to keep it simpler.

# create pipeline
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.bfloat16).to("cuda")

# set to channels_last
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)

# compile
pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)

# inference
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)
```

Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:

```python
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)
```

Finally, compile the components and run inference:

```python
pipeline.transformer = torch.compile(pipeline.transformer)
pipeline.vae.decode = torch.compile(pipeline.vae.decode)

# CogVideoX works very well with long and well-described prompts
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
```

The [benchmark](TODO: link) results on an 80GB A100 machine are:

```
Without torch.compile(): Average inference time: TODO seconds.
With torch.compile(): Average inference time: TODO seconds.
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also include a tip section like we have in Flux:
https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux

This way users are aware of the optimizations that are possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also mention that users can benefit from context-parallel caching.

## CogVideoXPipeline

[[autodoc]] CogVideoXPipeline
- all
- __call__

## CogVideoXPipelineOutput

[[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput
Loading
Loading