a tool for leveraging text2image tools such as StableDiffusion for generating videos.
Mainly based on PyTTI and deforum, but adding and focusing around:
- designed around a CLI interface with well-structured and very versatile YAML configuration
- audio-reactivity and other multi-modality to come
- PyTTI style functions for most generative parameters
- extensible design for adding arbitrary mechanisms for generating animations
- prerequisite: Install python3.10+
- create venv:
python3 -m venv venv
- source venv:
source venv/Scripts/activate
(orsource venv/bin/activate
on linux/macOS) - install requirements:
pip install -r requirements.txt
- get the remaining dependencies that don't come with wheels/proper build systems:
./init.sh
(or just copy the fewgit clone
commands from the file on windows)
- use https://github.com/AUTOMATIC1111/stable-diffusion-webui
- add the
--api
startup flag (in webui.sh or webui.bat or however you launch it) to expose the REST API (you can verify this by going to http://localhost:7860/docs and ensure there is a/sdapi/...
endpoint there) - Set your mechanism to
api
and configure thehost
parameter (if running locally it's always just http://localhost:7860) - ensure web-ui is running
- check out the examples or the doc and build your config
- make sure the venv is sourced (see above)
- Run your scenario with
python3 -cp config -cn=yourconfig main.py
where-cp
specifies the path to your config directory and-cn
specifies which config to run.
Once you have generated your frames, pyttv's job is done. Encoding to a video file is done by other tools.
A useful tool to use is Flowframes or similar tools that use RIFE interpolation or other decent interpolation mechanisms.
rife-ncnn-vulkan seems to work on macOS (M1).
If you just want to encode your frames directly to a video file, you can of course use ffmpeg. cd
into your output frame directory and run
cat *.png | ffmpeg -framerate 18 -f image2pipe -i - -c:v libx264 -pix_fmt yuv420p out.mp4
where 18
is to be replaced by your fps of course and out.mp4
is the output filename.
or to encode in a lossless format:
ffmpeg -framerate 18 -i directory\%05d.png -c:v copy output.mkv
ffmpeg -i out.mp4 -i audio.flac out_audio.mp4
Example: Append audio, but the audio gets offset by 55s (-ss
) and the audio duration is 5 seconds:
ffmpeg -i out.mp4 -ss 00:00:55.0 -t 00:00:05.0 -i audio.flac out_audio.mp4
If you created your video at a lower resolution, decent tools are RealESRGAN and UIs for it, e.g. cupscale.
Of course you can also use the WebUI for this, either by running a text conditioned upscale with batch img2img mode + SD upscale script or effectively only running ESRGAN by setting the denoising strength to 0.
If you use an M1 mac, use torch_device: cpu
in your configs. unfortunately the depth model currently does not work directly on the mps device.