Here we provide an efficient MindSpore version of Open-Sora-Plan from Peking University. We would like to express our gratitude to their contributions! 👍
OpenSora-PKU is still under active development. Currently, we are in line with Open-Sora-Plan v1.2.0 (commit id).
Official News from OpenSora-PKU | MindSpore Support |
---|---|
[2024.10.16] 🎉 PKU released version 1.3.0, featuring: WFVAE, pompt refiner, data filtering strategy, sparse attention, and bucket training strategy. They also support 93x480p within 24G VRAM. More details can be found at their latest report. | 📝 Working in Progress |
[2024.07.24] 🔥🔥🔥 PKU launched Open-Sora Plan v1.2.0, utilizing a 3D full attention architecture instead of 2+1D. See their latest report. | ✅ V.1.2.0 CausalVAE inference & OpenSoraT2V multi-stage training |
[2024.05.27] 🚀🚀🚀 PKU launched Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out their latest report. | ✅ V.1.1.0 CausalVAE inference and LatteT2V infernece & three-stage training (65x512x512 , 221x512x512 , 513x512x512 ) |
[2024.04.09] 🚀 PKU shared the latest exploration on metamorphic time-lapse video generation: MagicTime, and the dataset for train (updating): Open-Sora-Dataset. | N.A. |
[2024.04.07] 🔥🔥🔥 PKU released Open-Sora-Plan v1.0.0. See their report. | ✅ CausalVAE+LatteT2V+T5 inference and three-stage training (17×256×256 , 65×256×256 , 65x512x512 ) |
[2024.03.27] 🚀🚀🚀 PKU released the report of VideoCausalVAE, which supports both images and videos. | ✅ CausalVAE training and inference |
[2024.03.10] 🚀🚀🚀 PKU supports training a latent size of 225×90×90 (t×h×w), which means to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition. | frame interpolation and super-resolution are under-development. |
[2024.03.08] PKU support the training code of text condition with 16 frames of 512x512. | ✅ CausalVAE+LatteT2V+T5 training (16x512x512 ) |
[2024.03.07] PKU support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512. | class-conditioned training is under-development. |
mindspore | ascend driver | firmware | cann tookit/kernel |
---|---|---|---|
2.3.1 | 24.1RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 |
The following videos are generated based on MindSpore and Ascend 910*.
29×1280×720 Text-to-Video Generation.
29x720x1280 (1.2s) |
---|
![]() |
A close-up of a woman’s face, illuminated by the soft light of dawn... |
29x720x1280 (1.2s) |
---|
![]() |
0-A young man at his 20s is sitting on a piece of cloud in the sky, reading a book... |
29x720x1280 (1.2s) |
---|
![]() |
0-A close-up of a woman with a vintage hairstyle and bright red lipstick... |
Videos are saved to .gif
for display.
- 📍 Open-Sora-Plan v1.2.0 with the following features
- ✅ CausalVAEModel_D4_4x8x8 inference. Supports video reconstruction.
- ✅ mT5-xxl TextEncoder model inference.
- ✅ Text-to-video generation up to 93 frames and 720x1280 resolution.
- ✅ Multi-stage training using Zero2 and Sequence parallelism.
- ✅ Acceleration methods: flash attention, recompute (graident checkpointing), mixed precision, data parallelism, optimizer-parallel, etc..
- ✅ Evaluation metrics : PSNR and SSIM.
- Image-to-Video model [WIP].
- Scaling model parameters and dataset size [WIP].
- Evaluation of various metrics [WIP].
You contributions are welcome.
Other useful documents and links are listed below.
-
Use python>=3.8 [install]
-
Please install MindSpore 2.3.1 according to the MindSpore official website and install CANN 8.0.RC2.beta1 as recommended by the official installation website.
-
Install requirements
pip install -r requirements.txt
In case decord
package is not available, try pip install eva-decord
.
For EulerOS, instructions on ffmpeg and decord installation are as follows.
How to install ffmpeg and decord
1. install ffmpeg 4, referring to https://ffmpeg.org/releases
wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
tar -xvf ffmpeg-4.0.1.tar.bz2
mv ffmpeg-4.0.1 ffmpeg
cd ffmpeg
./configure --enable-shared # --enable-shared is needed for sharing libavcodec with decord
make -j 64
make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
git clone --recursive https://github.com/dmlc/decord
cd decord
rm build && mkdir build && cd build
cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
make -j 64
make install
cd ../python
python3 setup.py install --user
Please download the torch checkpoint of mT5-xxl from google/mt5-xxl, and download the opensora v1.2.0 models' weights from LanguageBind/Open-Sora-Plan-v1.2.0. Place them under examples/opensora_pku
as shown below:
mindone/examples/opensora_pku
├───LanguageBind
│ └───Open-Sora-Plan-v1.2.0
│ ├───1x480p/
│ ├───29x480p/
│ ├───29x720p/
│ ├───93x480p/
│ ├───93x480p_i2v/
│ ├───93x720p/
│ └───vae/
└───google/
└───mt5-xxl/
├───config.json
├───generation_config.json
├───pytorch_model.bin
├───special_tokens_map.json
├───spiece.model
└───tokenizer_config.json
Currently, we can load .safetensors
files directly in MindSpore, but not .bin
or .ckpt
files. We recommend you to convert the
vae/checkpoint.ckpt
and mt5-xxl/pytorch_model.bin
files to .safetensor
files manually by running the following commands:
python tools/model_conversion/convert_pytorch_ckpt_to_safetensors.py --src LanguageBind/Open-Sora-Plan-v1.2.0/vae/checkpoint.ckpt --target LanguageBind/Open-Sora-Plan-v1.2.0/vae/diffusion_pytorch_model.safetensors --config LanguageBind/Open-Sora-Plan-v1.2.0/vae/config.json
python tools/model_conversion/convert_pytorch_ckpt_to_safetensors.py --src google/mt5-xxl/pytorch_model.bin --target google/mt5-xxl/model.safetensors --config google/mt5-xxl/config.json
Once the checkpoint files have all been prepared, you can refer to the inference guidance below.
You can run video-to-video reconstruction task using scripts/causalvae/rec_video.sh
:
python examples/rec_video.py \
--ae_path LanguageBind/Open-Sora-Plan-v1.2.0/vae \
--video_path test.mp4 \
--rec_path rec.mp4 \
--device Ascend \
--sample_rate 1 \
--num_frames 65 \
--height 480 \
--width 640 \
--enable_tiling \
--tile_overlap_factor 0.125 \
--save_memory
Please change the --video_path
to the existing video file path and --rec_path
to the reconstructed video file path. You can set --grid
to save the original video and the reconstructed video in the same output file.
You can also run video reconstruction given an input video folder. See scripts/causalvae/rec_video_folder.sh
.
You can run text-to-video inference on a single Ascend device using the script scripts/text_condition/single-device/sample_t2v_29x720p.sh
.
python opensora/sample/sample_t2v.py \
--model_path LanguageBind/Open-Sora-Plan-v1.2.0/29x720p \
--num_frames 29 \
--height 720 \
--width 1280 \
--cache_dir "./" \
--text_encoder_name google/mt5-xxl \
--text_prompt examples/prompt_list_0.txt \
--ae CausalVAEModel_D4_4x8x8 \
--ae_path LanguageBind/Open-Sora-Plan-v1.2.0/vae\
--save_img_path "./sample_videos/prompt_list_0_29x720p" \
--fps 24 \
--guidance_scale 7.5 \
--num_sampling_steps 100 \
--enable_tiling \
--max_sequence_length 512 \
--sample_method EulerAncestralDiscrete \
--model_type "dit" \
You can change the num_frames
, height
and width
to match with the training shape of different checkpoints, e.g., 29x480p
requires num_frames=29
, height=480
and width=640
. In case of oom on your device, you can try to append --save_memory
to the command above, which enables a more radical tiling strategy for causal vae.
If you want to run a multi-device inference, e.g., 8 cards, please use msrun
and pass --use_parallel=True
as the example below:
# 8 NPUs
msrun --master_port=8200 --worker_num=8 --local_worker_num=8 --log_dir="output_log" \
python opensora/sample/sample_t2v.py \
--use_parallel True \
... # pass other arguments
The command above will run a 8-card inference and save the log files into "output_log". --master_port
specifies the scheduler binding port number. --worker_num
and --local_worker_num
should be the same to the number of running devices, e.g., 8.
In case of the following error:
RuntimtError: Failed to register the compute graph node: 0. Reason: Repeated registration node: 0
Please edit the master_port
to a different port number in the range 1024 to 65535, and run the script again.
See more examples of multi-device inference scripts under scripts/text_condifion/multi-devices
.
We support running inference with sequence parallelism. Please see the sample_t2v_29x480p_sp.sh
and sample_t2v_29x720p_sp.sh
under scripts/text_condition/multi-devices/
.
If you set --sp_size 8
to run sequence parallelism on 8 NPUs, you should also edit as follows:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Step 1: Downloading Datasets:
To train the causal vae model, you need to prepare a video dataset. Open-Sora-Plan-v1.2.0 trains vae in two stages. In the first stage, the authors trained vae on the Kinetic400 video dataset. Please download K400 dataset from this repository. In the second stage, they trained vae on Open-Sora-Dataset-v1.1.0 dataset. We give a tutorial on how to download the v1.1.0 datasets. See downloading tutorial.
Step 2: Converting Pretrained Weights:
As with v1.1.0, they initialized from the SD2.1 VAE using tail initialization for better convergence. Please download the torch weight file from the given URL.
After downloading the sd-vae-ft-mse weights, you can run:
python tools/model_conversion/convert_vae_2d.py --src path/to/diffusion.safetensor --target /path/to/sd-vae-ft-mse.ckpt.
This can convert the torch weight file into mindspore weight file.
They you can inflate the 2d vae model checkpoint into a 3d causal vae initial weight file as follows:
python tools/model_conversion/inflate_vae2d_to_vae3d.py \
--src /path/to/sd-vae-ft-mse.ckpt \
--target pretrained/causal_vae_488_init.ckpt
In order to train vae with lpips loss, please also download lpips_vgg-426bf45c.ckpt and put it under pretrained/
.
The first-stage training is conducted on 25-frame 256×256 videos of the K400 dataset. you can revise the --video_path
in the training script to the video folder path of your downloaded dataset. This will allow the training script to load all video files under video_path
in a recursive manner, and use them as the training data. Make sure the --load_from_checkpoint
is set to the pretrained weight, e.g., pretrained/causal_vae_488_init.ckpt
.
How to define the train and test set?
If you need to define the video files in the training set, please use a csv file with only one column, like:
"video"
folder_name/video_name_01.mp4
folder_name/video_name_02.mp4
...
Afterwards, you should revise the training script as below:
python opensora/train/train_causalvae.py \
--data_file_path path/to/train_set/csv/file \
--video_column "video" \
--video_path path/to/downloaded/dataset \
# pass other arguments
Similarly, you can create a csv file to include the test set videos, and pass the csv file to --data_file_path
in examples/rec_video_vae.py
.
To launch a single-card training, please run:
bash scripts/causalvae/train_with_gan_loss.sh
Note:
- Supports resume training by setting
--resume_training_checkpoint True
. It is the same for the multi-device training script.
For parallel training, please use msrun
and pass --use_parallel=True
.
# 8 NPUs
msrun --master_port=8200 --worker_num=8 --local_worker_num=8 --log_dir="output_log" \
python opensora/train/train_causalvae.py \
--use_parallel True \
... # pass other arguments, please refer to the single-device training script.
For more details, please take scripts/causalvae/train_with_gan_loss_multi_device.sh
as an example.
After training, you will find the checkpoint files under the ckpt/
folder of the output directory. To evaluate the reconstruction of the checkpoint file, you can take scripts/causalvae/rec_video_folder.sh
and revise it like:
python examples/rec_video_folder.py \
--batch_size 1 \
--real_video_dir input_real_video_dir \
--generated_video_dir output_generated_video_dir \
--device Ascend \
--sample_fps 10 \
--sample_rate 1 \
--num_frames 65 \
--height 480 \
--width 640 \
--num_workers 8 \
--ae_path LanguageBind/Open-Sora-Plan-v1.2.0/vae \
--enable_tiling \
--tile_overlap_factor 0.125 \
--save_memory \
--ms_checkpoint /path/to/ms/checkpoint \
Runing this command will generate reconstructed videos under the given output_generated_video_dir
. You can then evalute some common metrics (e.g., ssim, psnr) using the script under opensora/eval/script
.
Here, we report the training performance and evaluation results on the UCF-101 dataset. Experiments are tested on Ascend 910* with mindspore 2.3.1 graph mode.
model name | cards | batch size | resolution | precision | discriminator | sink | recompute | jit level | graph compile | s/step | img/s | psnr | ssim | recipe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CausalVAE | 8 | 1 | 25x256x256 | BF16 | FALSE | OFF | OFF | O0 | 3 mins | 4.21 | 47.51 | 28.92 | 0.87 | train |
CausalVAE | 8 | 1 | 25x256x256 | FP32 | TRUE | OFF | ON | O0 | 3 mins | 5.45 | 36.70 | 29.28 | 0.88 | train |
Step 1: Downloading Datasets:
The Open-Sora-Dataset-v1.2.0 contains annotation json files, which are listed below:
Panda70M_HQ1M.json
Panda70M_HQ6M.json
sam_image_11185255_resolution.json
v1.1.0_HQ_part1.json
v1.1.0_HQ_part2.json
v1.1.0_HQ_part3.json
Please check the readme doc for details of these annotation files. Open-Sora-Dataset-v1.2.0 contains the Panda70M (training full), SAM, and the data from Open-Sora-Dataset-v1.1.0. You can take the following instructions only how to download Open-Sora-Dataset-v1.1.0.
How to download Open-Sora-Dataset-v1.1.0?
The Open-Sora-Dataset-v1.1.0 includes three image-text datasets and three video-text datasets. As reported in Report v1.1.0, the three image-text datasets are:
Name | Image Source | Text Captioner | Num pair |
---|---|---|---|
SAM-11M | SAM | LLaVA | 11,185,255 |
Anytext-3M-en | Anytext | InternVL-1.5 | 1,886,137 |
Human-160k | Laion | InternVL-1.5 | 162,094 |
The three video-text datasets are:
Name | Hours | Num frames | Num pair |
---|---|---|---|
Mixkit | 42.0h | 65 | 54,735 |
513 | 1,997 | ||
Pixabay | 353.3h | 65 | 601,513 |
513 | 51,483 | ||
Pexel | 2561.9h | 65 | 3,832,666 |
513 | 271,782 |
Each video-text dataset has two annotation json files. For example, the mixkit dataset has video_mixkit_65f_54735.json
which includes video_mixkit_513f_1997.json
which includes path
corresponding to the video path, cap
corresponding to the caption, and frame_idx
corresponding to the frame indexes range. An example of annotation json file is shown below:
[
{
"path": "Fish/mixkit-multicolored-coral-shot-with-fish-projections-4020.mp4",
"frame_idx": "0:513",
"cap": "The video presents a continuous exploration of a vibrant underwater coral environment,...",
}
...
]
To prepare the training datasets, please first download the video and image datasets in Open-Sora-Dataset-v1.1.0. We give a tutorial on how to download these datasets. See downloading tutorial.
You need to download at least one video dataset and one image dataset to enable video-image joint training. After downloading all datasets, you can place images/videos under the folder datasets
, which looks like:
datasets/
├───images/ # Human-160k
├───anytext3m/ # Anytext-3M-en
├───sam/ # SAM-11M
├───pixabay_v2/ # Pixabay
├───pexels/ # Pexel
└───mixkit/ # Mixkit
You can place the json files under the folder anno_jsons
. The folder structure is:
anno_jsons/
├───video_pixabay_65f_601513.json
├───video_pixabay_513f_51483.json
├───video_pexel_65f_3832666.json
├───video_pexel_513f_271782.json
├───video_mixkit_65f_54735.json
├───video_mixkit_513f_1997.json
├───human_images_162094.json
├───anytext_en_1886137.json
└───sam_image_11185255.json
Step 2: Extracting Embedding Cache:
Next, please extract the text embeddings and save them in the disk for training acceleration. For each json file, you need to run the following command accordingly and save the t5 embeddings cache in the output_path
.
python opensora/sample/sample_text_embed.py \
--data_file_path /path/to/caption.json \
--output_path /path/to/text_embed_folder \
The text embeddings are extracted and saved under the specified output_path
.
Step 3: Revising the Paths:
After extracting the embedding cache, you will have the following three paths ready:
images/videos path: e.g., datasets/panda70m/
text embedding path: e.g., datasets/panda70m_emb-len=512/
annotation json path: e.g., datasets/anno_jsons/Panda70M_HQ1M.json
In the dataset file, for example, scripts/train_data/merge_data.txt
, each line represents one dataset. Each line includes three paths: the images/videos folder, the text embedding cache folder, and the path to the annotation json file. Please revise them accordingly to the paths on your disk.
The training scripts are stored under scripts/text_condition
. The single-device training scripts are under the single-device
folder for demonstration. We recommend to use the parallel-training scripts under the multi-devices
folder.
Here we choose an example of training scripts (train_video3d_nx480p_zero2.sh
) and explain the meanings of some experimental arguments.
Here is the major command of the training script:
NUM_FRAME=29
python opensora/train/train_t2v_diffusers.py \
--data "scripts/train_data/merge_data.txt" \
--num_frames ${NUM_FRAME} \
--max_height 480 \
--max_width 640 \
--attention_mode xformers \
--gradient_checkpointing \
--pretrained "path/to/ms-or-safetensors-ckpt/from/last/stage" \
--parallel_mode "zero" \
--zero_stage 2 \
# pass other arguments
There are some arguments related to the training dataset path:
data
: the text file to the video/image dataset. The text file should contain N lines corresponding to N datasets. Each line should have two or three items. If two items are available, they correspond to the video folder and the annotation json file. If three items are available, they correspond to the video folder, the text embedding cache folder, and the annotation json file.num_frames
: the number of frames of each video sample.max_height
andmax_width
: the frame maximum height and width.attention_mode
: the attention mode, choosing frommath
orxformers
. Note that we are not using the actual xformers library to accelerate training, but using MindSpore-nativeFlashAttentionScore
. Thexformers
is kept for compatibility and maybe discarded in the future.gradient_checkpointing
: it is referred to MindSpore recomputation feature, which can save memory by recomputing the intermediate activations in the backward pass.pretrained
: the pretrained checkpoint to be loaded as initial weights before training. If not provided, the OpenSoraT2V will use random initialization. If provided, the path should be either the safetensors checkpoint directiory or path, e.g., "LanguageBind/Open-Sora-Plan-v1.2.0/1x480p" or "LanguageBind/Open-Sora-Plan-v1.2.0/1x480p/diffusion_pytorch_model.safetensors", or MindSpore checkpoint path, e.g., "t2i-image3d-1x480p/ckpt/OpenSoraT2V-ROPE-L-122.ckpt".parallel_mode
: the parallelism mode chosen from ["data", "optim", "zero"], which denotes the data parallelism, the optimizer parallelism and the deepspeed zero_x parallelism.zero_stage
: runs parallelism like deepspeed, supporting zero0, zero1, zero2, and zero3, if parallel_mode is "zero".
For the stage 4 (29x720p
) and stage 5 (93x720p
) training script, please refer to train_video3d_29x720p_zero2_sp.sh
and train_video3d_93x720p_zero2_sp.sh
.
We also support to run validation during training. This is supported by editing the training script like this:
- --data "scripts/train_data/merge_data.txt" \
+ --data "scripts/train_data/merge_data_train.txt" \
+ --val_data "scripts/train_data/merge_data_val.txt" \
+ --validate True \
+ --val_batch_size 1 \
+ --val_interval 1 \
The edits allow to compute the loss on the validation set specified by merge_data_val.txt
for every 1 epoch (defined by val_interval
). merge_data_val.txt
has the same format as merge_data_train.txt
, but specifies a different subset from the train set. The validation loss will be recorded in the result_val.log
under the output directory. For example training script, please refer to train_video3d_29x720p_zero2_sp_val.sh
under scripts/text_conditions/multi-devices/
.
We also support training with sequence parallelism and zero2 parallelism together. This is enabled by setting --sp_size
and --train_sp_batch_size
. For example, with sp_size=8
and train_sp_batch_size=4
, 2 NPUs are used for a single video sample.
See train_video3d_29x720p_zero2_sp.sh
under scripts/text_condition/mult-devices/
for detailed usage.
To align with the hyper-parameters, we use the same learning rate (LR)
- You can lower your LR or increase the effective batch size, for example, by increasing
gradient_accumulation_steps
or running multi-machine training. - You can try a different LR scheduler, for example, you can change the current constant LR scheduler to
polynomial decay
by:
- --lr_scheduler="constant" \
+ --lr_scheduler="polynomial_decay" \
+ --lr_decay_steps=1000000 \
The edits will set the polynomial_decay LR scheduler, and decay the start LR to the end LR in 1000000 steps. You can adjust lr_decay_steps
based on your max_train_steps
. See other options of LR scheduler in mindone/trainers/lr_schedule.py
.
The training performance are tested on ascend 910* with mindspore 2.3.1 graph mode.
model name | cards | stage | batch size | num frames | resolution | graph compile | parallelism | recompute | sink | jit level | s/step | img/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenSoraT2V-ROPE-L-122 | 8 | 2 | 8 | 1 | 640x480 | 3mins | zero2 | ON | ON | O0 | 2.35 | 27.23 |
OpenSoraT2V-ROPE-L-122 | 8 | 3 | 1 | 29 | 640x480 | 6mins | zero2 | ON | ON | O0 | 3.68 | 63.04 |
OpenSoraT2V-ROPE-L-122 | 8 | 4 | 1 | 29 | 1280x720 | 10mins | zero2 + SP(sp_size=8) | OFF | ON | O0 | 4.32 | 6.71 |
OpenSoraT2V-ROPE-L-122 | 8 | 5 | 1 | 93 | 1280x720 | 15mins | zero2 + SP(sp_size=8) | ON | ON | O0 | 24.40 | 3.81 |
SP: sequence parallelism.
Stage means the muti-stage training as illustrated above.
batch size: the local batch size for a single card.
- Latte: The main codebase we built upon and it is an wonderful video generated model.
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.