Hunyuan Video is a new local and open-source video model with exceptional quality. It can generate a short video clip with a text prompt alone in a few minutes. It is ideal for content creators such as Youtubers to create the B-rolls in their videos.
Below is an example of Hunyuan Video.
In this tutorial, I will show you how to use Hunyuan Video in the following modes.
- Text-to-video
- Text-to-image
Table of Contents
Software
We will use ComfyUI, an alternative to AUTOMATIC1111. You can use it on Windows, Mac, or Google Colab. If you prefer using a ComfyUI service, Think Diffusion offers our readers an extra 20% credit.
Read the ComfyUI beginner’s guide if you are new to ComfyUI. See the Quick Start Guide if you are new to AI images and videos.
Take the ComfyUI course to learn how to use ComfyUI step-by-step.
What is Hunyuan video?
Tencent’s HunyuanVideo is an open-source AI model for text-to-video generation, distinguished by several key features and innovations:
- Large model: With 13 billion parameters, HunyuanVideo is the largest open-source text-to-video model. It is more than Mochi (10 billion), CogVideoX (5 billion), and LTX (0.2 billion).
- Unified Image and Video Generation: HunyuanVideo employs a “dual-stream to single-stream” hybrid transformer model design. In the dual-stream phase, video and text tokens are processed independently, allowing them to learn modulation mechanisms without interference. In the single-stream phase, the model joins the video and text tokens to fuse the information, enhancing the generation of both images and videos.
- Multimodal LLM Text Encoder: Unlike other video models, Hunyuan uses a visual LLM as its text encoder for higher-quality image-text alignment.
- 3D VAE: Hunyuan uses CausalConv3D to compress videos and images into latent space. This compression significantly reduces the resource requirement while maintaining the causal relations in the video.
- Prompt Rewrite Mechanism: To handle variability in user-provided prompts, HunyuanVideo includes a prompt rewrite model fine-tuned from the Hunyuan-Large model. It offers two modes: Normal and Master.
- Camera motion: The model is trained with many camera movements in the text prompt. You can use the following: zoom in, zoom out, pan up, pan down, pan left, pan right, tilt up, tilt down, tilt left, tilt right, around left, around right, static shot, handheld shot.
Generation time
Hunyuan Video generates an 848 x 480 (480p) video with 73 frames in:
- 4.5 mins on my RTX4090.
- 11 mins on Google Colab with an L4 runtime.
Hardware requirement
You will need an NVidia GPU card to run this workflow. People have reported running Hunyuan Video on ComfyUI with as low as 8 GB VRAM. The workflows in this tutorial are tested with RTX4090 with 24 GB VRAM.
Hunyuan Text-to-video workflow
The following workflow generates a Hunyuan video in 480p (848 x 480 pixels) and saves it as an MP4 file.
The instructions below are for local installation. If you use my ComfyUI notebook, select the HunyuanVideo model and switch the runtime type to L4. Jump to Step 4 to load the workflow.
Step 0: Update ComfyUI
Before loading the workflow, make sure your ComfyUI is up-to-date. The easiest way to do this is to use ComfyUI Manager.
Click the Manager button on the top toolbar.
Select Update ComfyUI.
Restart ComfyUI.
Step 1: Download video model
Download the Hunyuan video text-toimage model and put it in ComfyUI > models > diffusion_models.
Step 2: Download text encoders
Download clip_l.safetensors and llava_llama3_fp8_scaled.safetensors.
Put them in ComfyUI > models > text_encoders.
Step 3: Download VAE
Download hunyuan_video_vae_bf16.safetensors and put it in ComfyUI > models > vae.
Step 4: Load workflow
Download the Hunyuan video workflow JSON file below.
Drop it to ComfyUI.
Step 5: Install missing nodes
If you see red blocks, you don’t have the custom node that this workflow needs.
Click Manager > Install missing custom nodes and install the missing nodes.
Restart ComfyUI.
Step 6: Revise prompt
Revise the prompt to what you want to generate.
Step 7: Generate video
Click the Queue button to generate the video.
Troubleshooting
RuntimeError: “replication_pad3d_cuda” not implemented for ‘BFloat16’
This error comes from an outdated PyTorch version. This can happen if you have an old ComfyUI installation.
You can see the PyTorch version during startup in the command console. It should be 2.4 or higher.
If you use the Windows Portable version, you can try updating ComfyUI by double-clicking the file ComfyUI_windows_portable > update > update_comfyui_and_python_dependencies.bat
If it still doesn’t work, you can install a new copy of ComfyUI.
Hunyuan text-to-image workflow
Like Stable Diffusion or Flux, the Hunyuan video model can generate static images. In the workflow, you must set the number of frames to 1 and replace the final saving node as previewing or saving the image.
For your convenience, I have done all that, and you can use the following workflow JSON file after following the setup in the text-to-video workflow.
Simply revise the prompt and click Queue.
Can you please share Image-to-Video workflow as well?
They didn’t release an image-to-video model.But I will cover the ip-adapter workflow (ip2v) which uses an image for conditioning.
If I use mimicmypc ai
Can your tutorial help me?
Your blog is well written
However, I would like to see you on youtube offering the DEMO
If your card support bf16 mode then video may be generated with 8Gb VRAM, if not support so you can’t generate even with 12Gb.
That explains why it errors out with Golab T4 instance.