Here we provide an efficient MindSpore version of Open-Sora-Plan from Peking University. We would like to express our gratitude to their contributions! 👍
OpenSora-PKU is still under active development. Currently, we are in line with Open-Sora-Plan v1.3.0 (commit id).
Official News from OpenSora-PKU | MindSpore Support |
---|---|
[2024.10.16] 🎉 PKU released version 1.3.0, featuring: WFVAE, pompt refiner, data filtering strategy, sparse attention, and bucket training strategy. They also support 93x480p within 24G VRAM. More details can be found at their latest report. | ✅ V.1.3.0 WFVAE and OpenSoraT2V: inference, multi-stage & multi-devices training |
[2024.07.24] 🔥🔥🔥 PKU launched Open-Sora Plan v1.2.0, utilizing a 3D full attention architecture instead of 2+1D. See their latest report. | ✅ V.1.2.0 CausalVAE inference & OpenSoraT2V multi-stage training |
[2024.05.27] 🚀🚀🚀 PKU launched Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out their latest report. | ✅ V.1.1.0 CausalVAE inference and LatteT2V infernece & three-stage training (65x512x512 , 221x512x512 , 513x512x512 ) |
[2024.04.09] 🚀 PKU shared the latest exploration on metamorphic time-lapse video generation: MagicTime, and the dataset for train (updating): Open-Sora-Dataset. | N.A. |
[2024.04.07] 🔥🔥🔥 PKU released Open-Sora-Plan v1.0.0. See their report. | ✅ CausalVAE+LatteT2V+T5 inference and three-stage training (17×256×256 , 65×256×256 , 65x512x512 ) |
[2024.03.27] 🚀🚀🚀 PKU released the report of VideoCausalVAE, which supports both images and videos. | ✅ CausalVAE training and inference |
[2024.03.10] 🚀🚀🚀 PKU supports training a latent size of 225×90×90 (t×h×w), which means to train 1 minute of 1080P video with 30FPS (2× interpolated frames and 2× super resolution) under class-condition. | frame interpolation and super-resolution are under-development. |
[2024.03.08] PKU support the training code of text condition with 16 frames of 512x512. | ✅ CausalVAE+LatteT2V+T5 training (16x512x512 ) |
[2024.03.07] PKU support training with 128 frames (when sample rate = 3, which is about 13 seconds) of 256x256, or 64 frames (which is about 6 seconds) of 512x512. | class-conditioned training is under-development. |
mindspore | ascend driver | firmware | cann tookit/kernel |
---|---|---|---|
2.3.1 | 24.1RC2 | 7.3.0.1.231 | 8.0.RC2.beta1 |
The following videos are generated based on MindSpore and Ascend 910*.
93x352x640 Text-to-Video Generation.
93x352x640 (5.8s) |
---|
![]() |
A litter of golden retriever puppies playing in the snow... |
93x352x640 (5.8s) |
---|
![]() |
An extreme close-up of an gray-haired man with a beard in his 60s... |
93x352x640 (5.8s) |
---|
![]() |
Drone view of waves crashing against the rugged cliffs along Big Sur's garay point beach... |
Videos are saved to .gif
for display.
- 📍 Open-Sora-Plan v1.3.0 with the following features
- ✅ WFVAE inference & multi-stage training.
- ✅ mT5-xxl TextEncoder model inference.
- ✅ Prompt Refiner Inference.
- ✅ Text-to-video generation up to 93 frames and 640x640 resolution.
- ✅ Multi-stage training using Zero2 and sequence parallelism.
- ✅ Acceleration methods: flash attention, recompute (graident checkpointing), mixed precision, data parallelism, etc..
- ✅ Evaluation metrics : PSNR and SSIM.
- Image-to-Video model [WIP].
- Scaling model parameters and dataset size [WIP].
- Evaluation of various metrics [WIP].
You contributions are welcome.
Other useful documents and links are listed below.
-
Use python>=3.8 [install]
-
Please install MindSpore 2.3.1 according to the MindSpore official website and install CANN 8.0.RC2.beta1 as recommended by the official installation website.
-
Install requirements
pip install -r requirements.txt
In case decord
package is not available, try pip install eva-decord
.
For EulerOS, instructions on ffmpeg and decord installation are as follows.
How to install ffmpeg and decord
1. install ffmpeg 4, referring to https://ffmpeg.org/releases
wget https://ffmpeg.org/releases/ffmpeg-4.0.1.tar.bz2 --no-check-certificate
tar -xvf ffmpeg-4.0.1.tar.bz2
mv ffmpeg-4.0.1 ffmpeg
cd ffmpeg
./configure --enable-shared # --enable-shared is needed for sharing libavcodec with decord
make -j 64
make install
2. install decord, referring to https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source
git clone --recursive https://github.com/dmlc/decord
cd decord
rm build && mkdir build && cd build
cmake .. -DUSE_CUDA=0 -DCMAKE_BUILD_TYPE=Release
make -j 64
make install
cd ../python
python3 setup.py install --user
Please download the torch checkpoint of mT5-xxl from google/mt5-xxl, and download the opensora v1.3.0 models' weights from LanguageBind/Open-Sora-Plan-v1.3.0. Place them under examples/opensora_pku
as shown below:
mindone/examples/opensora_pku
├───LanguageBind
│ └───Open-Sora-Plan-v1.3.0
│ ├───any93x640x640/
│ ├───any93x640x640_i2v/
│ ├───prompt_refiner/
│ └───vae/
└───google/
└───mt5-xxl/
├───config.json
├───generation_config.json
├───pytorch_model.bin
├───special_tokens_map.json
├───spiece.model
└───tokenizer_config.json
Currently, we can load .safetensors
files directly in MindSpore, but not .bin
or .ckpt
files. We recommend you to convert the
vae/checkpoint.ckpt
and mt5-xxl/pytorch_model.bin
files to .safetensor
files manually by running the following commands:
python tools/model_conversion/convert_wfvae.py --src LanguageBind/Open-Sora-Plan-v1.3.0/vae/merged.ckpt --target LanguageBind/Open-Sora-Plan-v1.3.0/vae/diffusion_pytorch_model.safetensors --config LanguageBind/Open-Sora-Plan-v1.3.0/vae/config.json
python tools/model_conversion/convert_pytorch_ckpt_to_safetensors.py --src google/mt5-xxl/pytorch_model.bin --target google/mt5-xxl/model.safetensors --config google/mt5-xxl/config.json
In addition, please merge the multiple .saftensors files under any93x640x640/
into a merged checkpoint:
python tools/ckpt/merge_safetensors.py -i LanguageBind/Open-Sora-Plan-v1.3.0/any93x640x640/ -o LanguageBind/Open-Sora-Plan-v1.3.0/diffusion_pytorch_model.safetensors -f LanguageBind/Open-Sora-Plan-v1.3.0/any93x640x640/diffusion_pytorch_model.safetensors.index.json
Once the checkpoint files have all been prepared, you can refer to the inference guidance below.
You can run video-to-video reconstruction task using scripts/causalvae/single-device/rec_video.sh
:
python examples/rec_video.py \
--ae "WFVAEModel_D8_4x8x8" \
--ae_path LanguageBind/Open-Sora-Plan-v1.3.0/vae \
--video_path test.mp4 \
--rec_path rec.mp4 \
--device Ascend \
--sample_rate 1 \
--num_frames 65 \
--height 512 \
--width 512 \
--fps 30 \
--enable_tiling
Please change the --video_path
to the existing video file path and --rec_path
to the reconstructed video file path. You can set --grid
to save the original video and the reconstructed video in the same output file.
You can also run video reconstruction given an input video folder. See scripts/causalvae/single-device/rec_video_folder.sh
.
You can run text-to-video inference on a single Ascend device using the script scripts/text_condition/single-device/sample_t2v_93x640.sh
.
# Single NPU
python opensora/sample/sample.py \
--model_path LanguageBind/Open-Sora-Plan-v1.3.0/any93x640x640 \
--version v1_3 \
--num_frames 93 \
--height 352 \
--width 640 \
--text_encoder_name_1 google/mt5-xxl \
--text_prompt examples/sora.txt \
--ae WFVAEModel_D8_4x8x8 \
--ae_path LanguageBind/Open-Sora-Plan-v1.3.0/vae \
--save_img_path "./sample_videos/sora_93x640_mt5" \
--fps 18 \
--guidance_scale 7.5 \
--num_sampling_steps 100 \
--max_sequence_length 512 \
--sample_method EulerAncestralDiscrete \
--seed 1234 \
--num_samples_per_prompt 1 \
--rescale_betas_zero_snr \
--prediction_type "v_prediction" \
--precision bf16 \
You can change the num_frames
, height
and width
. Note that DiT model is trained arbitrarily on stride=32.
So keep the resolution of the inference a multiple of 32. num_frames
needs to be 4n+1, e.g. 93, 77, 61, 45, 29, 1.
If you want to run a multi-device inference using data parallelism, please use scripts/text_condition/multi-devices/sample_t2v_93x640_ddp.sh
.
The script will run a 8-card inference and save the log files into "parallel_logs/". --master_port
specifies the scheduler binding port number. --worker_num
and --local_worker_num
should be the same to the number of running devices, e.g., 8.
In case of the following error:
RuntimtError: Failed to register the compute graph node: 0. Reason: Repeated registration node: 0
Please edit the master_port
to a different port number in the range 1024 to 65535, and run the script again.
See more examples of multi-device inference scripts under scripts/text_condifion/multi-devices
.
If you want to run T2V inference with caption refiner, you should attach following argument to the T2V inference command above:
--caption_refiner "LanguageBind/Open-Sora-Plan-v1.3.0/prompt_refiner/"
If you just want to run prompt refinement, please run:
python opensora/sample/caption_refiner.py
We support running inference with sequence parallelism. Please see the sample_t2v_93x640_sp.sh
under scripts/text_condition/multi-devices/
. The script will run a 8-card inference with sp_size=8
, which means each video tensor is sliced into 8 parts along the sequence dimension. If you want to try sp_size=4
, you can revise it as below:
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
msrun --bind_core=True --worker_num=4 --local_worker_num=4 --master_port=9000 --log_dir="./sample_videos/sora_93x640_mt5_sp/parallel_logs/" \
opensora/sample/sample.py \
... \
--sp_size 4
Step 1: Downloading Datasets:
To train the causal vae model, you need to prepare a video dataset. Please download K400 dataset from this repository as used in the Arxiv paper or download the UCF101 dataset from the official website as used in this tutorial.
How to define the train and test set?
If you need to define the video files in the training set, please use a csv file with only one column, like:
"video"
folder_name/video_name_01.mp4
folder_name/video_name_02.mp4
...
Afterwards, you should revise the training script as below:
python opensora/train/train_causalvae.py \
--data_file_path path/to/train_set/csv/file \
--video_column "video" \
--video_path path/to/downloaded/dataset \
# pass other arguments
Similarly, you can create a csv file to include the test set videos, and pass the csv file to --data_file_path
in examples/rec_video_vae.py
.
Step 2: Prepare Pretrained Weights:
Open-Sora-Plan-v1.3.0 trains WFVAE in multiple stages. The loss used for the first two stages is a weighted sum of multiple loss terms:
--disc_weight
. --kl_weight
. --wavelet_weight
. In the third stage, LPIPS loss is also used to improve the performance. Its weight --perceptual_weight
. Please see more arguments in opensora/train/train_causalvae.py
.
In order to train vae with LPIPS loss, please also download lpips_vgg-426bf45c.ckpt and put it under pretrained/
.
Steps 3: Hyper-parameters Setting
As introduced in the Open-Sora Plan Arxiv paper, the hyper-parameters of each stage is summerized in the following table:
Stage | Resolution | Num of frames | FPS | Batch size | Train Steps | Discrminator | |
---|---|---|---|---|---|---|---|
1 | 256x256 | 25 | Original fps | 8 | 800K | TRUE | - |
2 | 256x256 | 49 | Original fps / 2 | 8 | 200K | TRUE | - |
3 | 256x256 | 49 | Original fps / 2 | 8 | 200K | TRUE | 0.1 |
See the hyper-parameters in scripts/causalvae/multi-devices/train_stage_x.sh
Note:
- We support resume training by setting
--resume_from_checkpoint True
. It is the same for the multi-device training script.- We also provide the standalone training script:
scripts/causalvae/single-device/train.sh
.
After training, you will find the checkpoint files under the ckpt/
folder of the output directory. To evaluate the reconstruction of the checkpoint file, you can take scripts/causalvae/single-device/rec_video_folder.sh
and revise it like:
python examples/rec_video_folder.py \
--batch_size 1 \
--real_video_dir datasets/UCF-101/ \
--data_file_path datasets/ucf101_test.csv \
--generated_video_dir recons/ucf101_test/ \
--device Ascend \
--sample_fps 30 \
--sample_rate 1 \
--num_frames 25 \
--height 256 \
--width 256 \
--num_workers 8 \
--ae "WFVAEModel_D8_4x8x8" \
--ae_path LanguageBind/Open-Sora-Plan-v1.3.0/vae \
--ms_checkpoint path/to/ms/ckpt \
Runing this command will generate reconstructed videos under the given output_generated_video_dir
. You can then evalute some common metrics (e.g., ssim, psnr) using the script under opensora/eval/script
.
Step 1: Downloading Datasets:
Open-Sora-Dataset-v1.3.0 dataset is the same as the dataset used in Open-Sora-Dataset-v1.2.0 which contains annotation json files listed below:
Panda70M_HQ1M.json
Panda70M_HQ6M.json
sam_image_11185255_resolution.json
v1.1.0_HQ_part1.json
v1.1.0_HQ_part2.json
v1.1.0_HQ_part3.json
Please check the readme doc for details of these annotation files. Open-Sora-Dataset-v1.2.0 contains the Panda70M (training full), SAM and the data from Open-Sora-Dataset-v1.1.0. You can take the following instructions on how to download Open-Sora-Dataset-v1.1.0.
How to download Open-Sora-Dataset-v1.1.0?
The Open-Sora-Dataset-v1.1.0 includes three image-text datasets and three video-text datasets. As reported in Report v1.1.0, the three image-text datasets are:
Name | Image Source | Text Captioner | Num pair |
---|---|---|---|
SAM-11M | SAM | LLaVA | 11,185,255 |
Anytext-3M-en | Anytext | InternVL-1.5 | 1,886,137 |
Human-160k | Laion | InternVL-1.5 | 162,094 |
The three video-text datasets are:
Name | Hours | Num frames | Num pair |
---|---|---|---|
Mixkit | 42.0h | 65 | 54,735 |
513 | 1,997 | ||
Pixabay | 353.3h | 65 | 601,513 |
513 | 51,483 | ||
Pexel | 2561.9h | 65 | 3,832,666 |
513 | 271,782 |
Each video-text dataset has two annotation json files. For example, the mixkit dataset has video_mixkit_65f_54735.json
which includes video_mixkit_513f_1997.json
which includes path
corresponding to the video path, cap
corresponding to the caption, and frame_idx
corresponding to the frame indexes range. An example of annotation json file is shown below:
[
{
"path": "Fish/mixkit-multicolored-coral-shot-with-fish-projections-4020.mp4",
"frame_idx": "0:513",
"cap": "The video presents a continuous exploration of a vibrant underwater coral environment,...",
}
...
]
To prepare the training datasets, please first download the video and image datasets in Open-Sora-Dataset-v1.1.0. We give a tutorial on how to download these datasets. See downloading tutorial.
You need to download at least one video dataset and one image dataset to enable video-image joint training. After downloading all datasets, you can place images/videos under the folder datasets
, which looks like:
datasets/
├───images/ # Human-160k
├───anytext3m/ # Anytext-3M-en
├───sam/ # SAM-11M
├───pixabay_v2/ # Pixabay
├───pexels/ # Pexel
└───mixkit/ # Mixkit
You can place the json files under the folder anno_jsons
. The folder structure is:
anno_jsons/
├───video_pixabay_65f_601513.json
├───video_pixabay_513f_51483.json
├───video_pexel_65f_3832666.json
├───video_pexel_513f_271782.json
├───video_mixkit_65f_54735.json
├───video_mixkit_513f_1997.json
├───human_images_162094.json
├───anytext_en_1886137.json
└───sam_image_11185255.json
Step 2: Extracting Embedding Cache:
Next, please extract the text embeddings and save them in the disk for training acceleration. For each json file, you need to run the following command accordingly and save the mt5-xxl embeddings cache in the output_path
.
python opensora/sample/sample_text_embed.py \
--data_file_path /path/to/caption.json \
--output_path /path/to/text_embed_folder \
The text embeddings are extracted and saved under the specified output_path
.
Step 3: Revising the Paths:
After extracting the embedding cache, you will have the following three paths ready:
images/videos path: e.g., datasets/panda70m/
text embedding path: e.g., datasets/panda70m_emb-len=512/
annotation json path: e.g., datasets/anno_jsons/Panda70M_HQ1M.json
In the dataset file, for example, scripts/train_data/merge_data.txt
, each line represents one dataset. Each line includes three paths: the images/videos folder, the text embedding cache folder, and the path to the annotation json file. Please revise them accordingly to the paths on your disk.
Step 4: Hyper-Parameters Setting
As introduced in the Open-Sora Plan Arxiv paper, the hyper-parameters of each stage is summerized in the following table:
Stage | Resolution | Num of frames | Datasets | Batch size | Train Steps | LR | Attention |
---|---|---|---|---|---|---|---|
1 (T2I) | 256x256 | 1 | SAM, AnyText, Human Images | 1024 | 150K (full-attention) + 100K (skiparse attention) | 2e-5 | Full 3D -> Skiparse |
2 (T2I&T2V) | maximumly 93×640×640 | 93 | SAM, Panda70M | 1024 | 200K | 2e-5 | Skiparse |
3 (T2V) | 93x352x640 | 93 | filtered Panda70M, high-quality data | 1024 | 100K~200K | 1e-5 | Skiparse |
The training scripts are stored under scripts/text_condition
. The single-device training scripts are under the single-device
folder for demonstration. We recommend to use the parallel-training scripts under the multi-devices
folder.
Here we choose an example of training scripts (train_t2i_stage1.sh
) and explain the meanings of some experimental arguments.
Here is the major command of the training script:
NUM_FRAME=1
WIDTH=256
HEIGHT=256
python opensora/train/train_t2v_diffusers.py \
--data "scripts/train_data/image_data_v1_2.txt" \
--num_frames ${NUM_FRAME} \
--force_resolution \
--max_height ${HEIGHT} \
--max_width ${WIDTH} \
--gradient_checkpointing \
--pretrained path/to/last/stage/ckpt \
--parallel_mode "zero" \
--zero_stage 2 \
# pass other arguments
There are some arguments related to the training dataset path:
data
: the text file to the video/image dataset. The text file should contain N lines corresponding to N datasets. Each line should have two or three items. If two items are available, they correspond to the video folder and the annotation json file. If three items are available, they correspond to the video folder, the text embedding cache folder, and the annotation json file.num_frames
: the number of frames of each video sample.max_height
andmax_width
: the frame maximum height and width.force_resolution
: whether to train with fixed resolution or dynamic resolution. Ifforce_resolution
is True, all videos will be cropped and resized to the resolution ofargs.max_height x args.max_width
. Ifforce_resolution
is False,args.max_hxw
must be provided which determines the maximum token length of each video tensor.gradient_checkpointing
: it is referred to MindSpore recomputation feature, which can save memory by recomputing the intermediate activations in the backward pass.pretrained
: the pretrained checkpoint to be loaded as initial weights before training. If not provided, the OpenSoraT2V will use random initialization.parallel_mode
: the parallelism mode chosen from ["data", "optim", "zero"], which denotes the data parallelism, the optimizer parallelism and the deepspeed zero_x parallelism.zero_stage
: runs parallelism like deepspeed, supporting zero0, zero1, zero2, and zero3, if parallel_mode is "zero". By default, we use--zero_stage 2
for all training stages.
For the stage 2 and stage 3 training scripts, please refer to train_t2v_stage2.sh
and train_t2v_stage3.sh
.
We also support to run validation during training. This is supported by editing the training script like this:
- --data "scripts/train_data/merge_data.txt" \
+ --data "scripts/train_data/merge_data_train.txt" \
+ --val_data "scripts/train_data/merge_data_val.txt" \
+ --validate True \
+ --val_batch_size 1 \
+ --val_interval 1 \
The edits allow to compute the loss on the validation set specified by merge_data_val.txt
for every 1 epoch (defined by val_interval
). merge_data_val.txt
has the same format as merge_data_train.txt
, but specifies a different subset from the train set. The validation loss will be recorded in the result_val.log
under the output directory.
We also support training with sequence parallelism and zero2 parallelism together. This is enabled by setting --sp_size
.
See train_t2v_stage2.sh
under scripts/text_condition/mult-devices/
for detailed usage.
We evaluated the training performance on Ascend NPUs. All experiments are running in PYNATIVE mode with MindSpore(2.3.1). The results are as follows.
model name | cards | stage | batch size (global) | video size | Paramllelism | recompute | data sink | jit level | step time | train imgs/s |
---|---|---|---|---|---|---|---|---|---|---|
OpenSoraT2V_v1_3-2B/122 | 8 | 1 | 32 | 1x256x256 | zero2 | TRUE | FALSE | O0 | 4.37 | 7.32 |
OpenSoraT2V_v1_3-2B/122 | 8 | 2 | 1 | up to 93x640x640 | zero2 + SP(sp_size=8) | TRUE | FALSE | O0 | 22.4s* | 4.15 |
OpenSoraT2V_v1_3-2B/122 | 8 | 3 | 8 | 93x352x640 | zero2 | TRUE | FALSE | O0 | 10.30 | 72.23 |
SP: sequence parallelism.
*: dynamic resolution using bucket sampler. The step time may vary across different batches due to the varied resolutions.
train imgs/s: it is computed by
$num\quad of\quad frames \times global\quad batch\quad size \div per\quad step\quad time$
- Latte: The main codebase we built upon and it is an wonderful video generated model.
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.