TAEHV is a Tiny AutoEncoder for Hunyuan Video (& Wan 2.1). TAEHV can decode latents into videos more cheaply (in time & memory) than the full-size VAEs, at the cost of slightly lower quality.
Here's a comparison of the output & memory usage of the Full Hunyuan VAE vs. TAEHV:
See the profiling notebook for details on this comparison or the example notebook for a simpler demo.
Since Wan 2.1 uses the same input / output shapes as Hunyuan VAE, you can also use TAEHV for Wan 2.1 decoding using the taew2_1.pth
weights (see the Wan 2.1 example notebook).
Try the taecvx.pth
weights (see the example notebook).
You can disable temporal or spatial upscaling to get even-cheaper decoding.
TAEHV(decoder_time_upscale=(False, False), decoder_space_upscale=(True, True, True))
TAEHV(decoder_time_upscale=(False, False), decoder_space_upscale=(False, False, False))
If you have a powerful GPU or are decoding at a reduced resolution, you can also set parallel=True
in TAEHV.decode_video
to decode all frames at once (which is faster but requires more memory).
TAEHV is still pretty experimental (specifically, it's a hacky finetune of TAEM1 :) using a fairly limited dataset) and I haven't tested it much yet. Please report quality / performance issues as you discover them.