Hunyuan Video is a local video model which turns a text description into a video. But what if you want to turn an image into a video? You need an image-to-video mdoel but Hunyuan has not released one (that would be game-changing). For now, you can use the Image-Prompt to Video (IP2V) workflow to achieve a similar effect.
This workflow converts an image and a prompt to a video. For example, you can give a background image, add a person with the prompt, and generate a video like the one below.
Table of Contents
Software
We will use ComfyUI, an alternative to AUTOMATIC1111. You can use it on Windows, Mac, or Google Colab. If you prefer using a ComfyUI service, Think Diffusion offers our readers an extra 20% credit.
Read the ComfyUI beginner’s guide if you are new to ComfyUI. See the Quick Start Guide if you are new to AI images and videos.
Take the ComfyUI course to learn how to use ComfyUI step-by-step.
How does it work?
This Hunyuan IP2V workflow uses the Hunyuan Video text-to-image model. How does it use the image? Hunyuan preprocesses the prompt using a Large Language and Vision Assistant (LLaVA) model, which reads text and images. This workflow taps into the unused power of Hunyuan’s text encoder to read an image to supplement the prompt.
Difference between img2vid and IP2V
An image-to-video workflow uses the input image as the first frame of the video. The image-prompt-to-video workflow uses an image as part of the prompt. It uses the image concept but does not use it as the first frame.
Use cases
Let’s test the Hunyuan IP2V workflow to see what it is good at.
Use an Image as the background
Let’s use the following image of a tunnel as the background and add a person to the video with the prompt.
Prompt:
A fashionable beautiful woman with long blonde hair, black short skirt, white blouse, high heel, walking towards the camera, <image>, camera zooming in.
Note that you need to insert the image token <image>
to the prompt to specify where you want the image prompt to be inserted.
Here’s the output video.
The Hunyuan IP2V workflow does a good job of generating a person walking in a tunnel. I did not mention the tunnel in the prompt. The workflow uses the visual LLM to parse and put the background image into the video.
You may have noticed that the tunnels in the image and the video look alike but not the same. This is an essential point in mastering the IP2V workflow. The visual LLM reads the image and converts it to the image tokens. They describe the scene and, in turn, influence the video. Unlike the IP-adapter, it does not replicate the image in the video.
Animate an image
Hunyuan Video is an excellent tool for content creators. Its exceptional video quality has great potential for generating B-rolls, the supplementary footage to the main video.
Royalty-free B roll footage exists. You can find them on royalty-free video sites. But they are limited compared to royalty-free images. Why not use the Hunyuan IP2V workflow to animate a royalty-free image and make a unique B-roll?
Suppose you are making a video about financial planning. The following image is an excellent fit for a B-roll.
Use this image as the input and only the image token as the text prompt.
<image>
You get a B roll!
Step-by-step guide
Step 0: Update ComfyUI
Before loading the workflow, make sure your ComfyUI is up-to-date. The easiest way to do this is to use ComfyUI Manager.
Click the Manager button on the top toolbar.
Select Update ComfyUI.
Restart ComfyUI.
Step 1: Download video model
Download the hunyuan_video_FastVideo_720_fp8_e4m3fn.safetensors and put it in ComfyUI > models > diffusion_models.
Step 2: Download VAE
Download hunyuan_video_vae_bf16.safetensors.
This VAE file differs from the one released by ComfyUI and is used in the text-to-video tutorial. Rename it to hunyuan_video_vae_bf16-kj.safetensors
Put it in ComfyUI > models > vae.
Step 3: Load workflow
Download the Hunyuan video workflow JSON file below.
Drop it to ComfyUI.
Step 4: Install missing nodes
If you see red blocks, you don’t have the custom node that this workflow needs.
Click Manager > Install missing custom nodes and install the missing nodes.
Restart ComfyUI.
Step 5: Run the workflow
Upload the following image to the Load Image node.
Click the Queue button to generate the video.
Running the workflow for the first time takes time because it will download some model files.
Adjusting the image prompt
Downsampling image tokens
How much the image should influence the video is controlled by image_token_selection_expr. A value ::4
downsamples the image tokens so that only one out of four is used. Use a higher value to reduce the influence of the image, e.g. ::8
or ::16
.
Increase the downsampling factor to ::16
changes the background to a outdoor torn building with graffiti. The tunnel becomes a similar hallway. The woman is controlled by the prompt so she’s still wearing the same outfit.
Increasing the downsampling to ::256
eliminates the tunnel pathway. She’s walking in an open space in a torn building, but the graffiti is still everywhere.
Other options for passing the image tokens
The llava-llama-3 model has 576 image tokens. Instead of downsampling, you can experiment with passing only a portion of the tokens.
:128
– First 128 tokens.-128:
– Last 128 tokens.:128, -128:
– First 128 tokens and last 128 tokens.
Getting error. The file path exists but not the file in the folder.
– **Node ID:** 71
– **Node Type:** DownloadAndLoadHyVideoTextEncoder
– **Exception Type:** FileNotFoundError
– **Exception Message:** No such file or directory: “C:\\directory\\ComfyUI_windows_portable_nvidia_cu118_or_cpu\\ComfyUI_windows_portable\\ComfyUI\\models\\LLM\\llava-llama-3-8b-v1_1-transformers\\model-00001-of-00004.safetensors”
delete the llava-llama-3-8b-v1_1-transformers folder in LLM and try again.
Hello,
I have this exception during generation :
“Only vision_languague models support image input”
I am using all the specified models and have no clue…
Thank you for your help !
You can try deleting the xtuner text encoder and let it auto-download agin.
Hello,
I have the following issue while trying your workflow :
HyVideoTextImageEncode
## Error Details
– **Node ID:** 73
– **Node Type:** HyVideoTextImageEncode
– **Exception Type:** TypeError
– **Exception Message:** unsupported operand type(s) for //: ‘int’ and ‘NoneType’
Any idea ?
Thanks in advance
See the solution in the other comment.
Thank you for your answer.
I did but nothing better.
Any other idea ?
Can you try reinstall ‘python.exe -m pip install transformers==4.47.0’, and then clear dependency lib.
What is the recommended VRAM to run this? I’m getting a `torch.OutOfMemoryError` exception in the (Down)load TextEncoder node on 12GB.
You can reduce the video size and the number of frames until it fits. The default setting in the json file uses 20GB VRAM.
Привет! Ошибка в узле ВАЕ загрузчик. Ошибка занимает весь экран монитора.
You need to download the VAE using the link in this tutorial. The VAE for this custom nodes is not the same as the one released by comfyui org.
Hi! Please excuse my mistake. I created a Hunyuan folder in the VAE folder. I moved the file to the VAE folder and everything worked!
thankyou. very helpful tutorial.
Hi! i got this issue:
HyVideoTextImageEncode
unsupported operand type(s) for //: ‘int’ and ‘NoneType’
Same here
This is a known issue: https://github.com/kijai/ComfyUI-HunyuanVideoWrapper/issues/269
Downgrade the transformers library to fix it.
Run “python.exe -m pip install transformers==4.47.0” in the python_embedded folder.
Downgraded to 4.47.0 and still getting the error.
# ComfyUI Error Report
## Error Details
– **Node ID:** 73
– **Node Type:** HyVideoTextImageEncode
– **Exception Type:** TypeError
– **Exception Message:** unsupported operand type(s) for //: ‘int’ and ‘NoneType’
Getting a “‘TextEncoder’ object has no attribute ‘is_fp8′” noticed there is no Clip connected to the TextImageEncode Node. Does it need to have an input?
Thanks for this breakdown!
There’s no clip connection to the textencoder node. Try updating all in comfyui manager.
File “D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\transformers\models\llava\processing_llava.py”, line 160, in __call__
num_image_tokens = (height // self.patch_size) * (
~~~~~~~^^~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for //: ‘int’ and ‘NoneType’
Update:
The problem has gone after I downgraded transformers from 4.48.0 to 4.47.0.
What is the best option currently available if I want to create anime/cartoon video. Is it text to video or image to video ? Also which models would be best to achieve that. Any recommended workflow would be much appreciated.
Hunyuan and Mochi are the local models currently with the highest quality. Try prompting the model for style. If you don’t like them,train LoRAs for Hunyuan to finetune it.
I got this issue : HyVideoTextImageEncode
The input provided to the model are wrong. The number of image tokens is 0 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
You need the an image token in the prompt.