Flux Hunyuan Text-to-Video workflow (ComfyUI)

Published Categorized as Workflow Tagged , , , 3 Comments on Flux Hunyuan Text-to-Video workflow (ComfyUI)

This workflow combines an image generation model (Flux) with a video generation model (Hunyuan). Here’s how it works:

  1. Generates an AI image using Flux
  2. Automatically inputs the AI image into Hunyuan, the video generator
    • The AI image will be in the first frame of the video

Benefits:

  1. Compared to the LTX model, both the Flux and the Hunyuan models generate higher quality videos
  2. Faster to use. By merging the two separate processes of text-image and image-video, this workflow allows for you to accomplish both with just a click of a button!

You need to be a member of this site to download the ComfyUI workflow.

Sci-fi spaceship generated using the Flux Hunyuan workflow

Software

We will use ComfyUI, an alternative to AUTOMATIC1111.

Read the ComfyUI installation guide and ComfyUI beginner’s guide if you are new to ComfyUI.

Take the ComfyUI course to learn ComfyUI step-by-step.

Workflow overview

  1. Generate an image with the Flux.1 Dev model

2. Generate a video with the Hunyuan model

Step-by-step guide

Step 0: Update ComfyUI

Before loading the workflow, make sure your ComfyUI is up to date. The easiest way to do this is to use ComfyUI Manager.

Click the Manager button on the top toolbar.

Select Update ComfyUI.

Restart ComfyUI.

Step 1: Download the Flux AI model

Download the Flux1 dev FP8 checkpoint.

Put the model file in the folder ComfyUI > models > checkpoints.

Note: If you use my ComfyUI Colab Notebook, you don’t need to download the Flux model. Simply select the Flux1_dev model.

Step 2: Download the Hunyuan-Video model

You have some of these models if you have installed the Hunyuan Video text-to-image model.

Download hunyuan_video_image_to_video_720p_bf16.safetensors and put it in ComfyUI > models > diffusion_models.

Download clip_l.safetensors and llava_llama3_fp8_scaled.safetensors. Put them in ComfyUI > models > text_encoders.

Download hunyuan_video_vae_bf16.safetensors and put it in ComfyUI > models > vae.

Download llava_llama3_vision.safetensors and put it in ComfyUI > models > clip_vision.

Google Colab

If you use my ComfyUI Colab notebook, you don’t need to install the model files. They will be downloaded automatically.

Select the HunyuanVideo models before starting the notebook.

In the top menu, select Runtime > Change runtime type > L4 GPUSave the settings.

google colab change the runtime type to L4 GPU.

Step 3: Load the workflow

Download the ComfyUI JSON workflow below.

Become a member of this site to see this content

Already a member? Log in here.

Drag and drop the JSON file to ComfyUI.

Step 4: Install missing nodes

Click Manager > Install Missing Custom Nodes.

Install the nodes that are missing.

Restart ComfyUI.

Step 5: Revise the prompt

The prompt controls both the Flux image and the Hunyuan video. Change it to what you want to generate.

Step 6: Run the workflow

Press the Run button to generate a video.

Usage tips

It is best to treat the video generation as a 2-step process.

  1. Refine the prompt to generate a good image.
  2. Change the video’s seed to refine the video.

You can use the Fast Groups Muter to disable the video generation by disabling the Hunyuan Video group, as shown below.

Revise the prompt and change the seed to get a good image.

When you are happy with the image, turn the Hunyuan Video group back on.

Happy AI video generating!!!

By Sage

Sage is a highly accomplished programmer, winning awards in combining the usage of LLMs, APIs, and BERT models to create software. Sage is a musician, philosopher, writer, and a reader in his spare time.

3 comments

  1. Thanks, Sage, that fixed it.
    I added a Comfyroll Aspect Ratio node to the Flux image section to make selection easier and a TeaCache node to speed it up further.
    With 3.2x acceleration selected on the video TeaCache and the Colab A100 GPU I’m getting 3 sec video generation times of just over a minute which is amazing!
    I guess the model might be based on the 720p broadcast standard and I found 30fps gives smoother and more natural motion with people. Going up to 60 fps made it worse, however.
    I discovered by accident that with a landscape format image, selecting width 720 and crop and fill rather than keep proportions in the resizer gives a closeup of the middle of the scene in portrait format. Not sure why this is but it has great detail and realistic motion.
    I’ve found the TTP video sampler node loads ok if the nightly version is selected and it is a bit quicker than the other one.
    Does the FluxGuidance in the video section need to be 6? I tried 3.5 as for the image generation and that seemed to work ok too.
    Slightly off topic but does the Wan2.1 workflow also need the resizer to be set at multiples of 16?

  2. Thanks, Sage, for making this complementary workflow to Andrew’s Wan2.1 insect one. The quality seems about the same and the generation time in Hunyuan is a bit quicker. But I couldn’t get it to do anything other than square videos – using the landscape or portrait image dimensions gave this error on the video sampler (I tried both versions):
    shape ‘[1, 12, 30, 45, 16, 1, 2, 2]’ is invalid for input of size 1071360
    Can you suggest what is going wrong?
    Thanks

    1. Hey, thanks for leaving a comment about that bug! It’s now fixed and you can download the updated JSON file. Specifically, the problem was with the “Image Resize” node. Initially, the dimensions were resized to be multiples of 8, but HunYuan’s video model only accepts dimensions which are multiples of 16, so we changed the value accordingly. Thanks for catching that!

Leave a comment

Your email address will not be published. Required fields are marked *