Z-image Wan 2.2 text-to-video

Published Categorized as Workflow Tagged , , , , , 2 Comments on Z-image Wan 2.2 text-to-video

Z-Image is the leading local model for generating realistic images. The possibility is endless when it is used with the Wan 2.2 model to animate the image.

This workflow delivers just that. The same prompt is used for Z-Image and Wan 2.2 to generate a video clip, with an option to add background audio.

With a single click, this workflow can generate high quality videos from a text prompt.

You must be a member of this site to download the workflow JSON file.

Software

We will use ComfyUI, a free AI image and video generator. You can use it on Windows, Mac, or Google Colab

Think Diffusion provides an online ComfyUI service. They offer an extra 20% credit to our readers.

Read the ComfyUI beginnerโ€™s guide if you are new to ComfyUI. See the Quick Start Guide if you are new to AI images and videos.

Take the ComfyUI course to learn how to use ComfyUI step by step.

Workflow overview

This workflow combines the Z-Image text-to-image model and the Wan 2.2 Image-to-Video model into a single workflow. You only need to provide a text prompt to generate a video, with the option to add a background video.

Image generation

The workflow first generates an image from text using the Z-Image model.

Video generation

The same prompt is fed into the Wan 2.2 Image-to-video model to complete the video generation.

Audio

You can optionally add an audio to this workflow.

Conforming video size

To avoid size mismatch between the image and the video, I used the Get Image Size node to read the size of the generated image and fed it into the Wan 2.2 video latent.

Same prompt for image and video

In order to use the same prompt for generating the image and the video, I used the multi-line text box (which is easier to read than the String Primitive node) from the Was Node Suite.

Note that passing the conditioning from Z-Image to Wan won’t work because they use different CLIP models (text encoders).

Step-by-step guide

Step 1: Install models

Install the Z-Image models:

  • Downloadย ae.safetensorsย and put it inย ComfyUIย >ย modelsย >ย vaeย folder.
  • Downloadย qwen_3_4b.safetensorsย and put it inย ComfyUIย >ย modelsย >ย text_encodersย folder.
  • Downloadย z_image_turbo_bf16.safetensorsย and put it inย ComfyUIย >ย modelsย >ย diffusion_modelsย folder.

Install the Wan 2.2 5B model:

Step 2: Load the workflow

Download the workflow below. Drop it into ComfyUI to load.

Become a member of this site to see this content

Already a member? Log in here.

Step 3: Install missing nodes

If you see nodes with red borders, you don’t have the custom nodes required for this workflow. You should have ComfyUI Manager installed before performing this step.

Click Manager > Install Missing Custom Nodes.

Install the nodes that are missing.

Restart ComfyUI.

Refresh the ComfyUI page.

Step 4: Revise the prompt

Revise the prompt in the Prompt node. It should describe the video. Don’t forget to add some action words.

Step 5: Upload an audio file

Upload an audio file in the Load Audio node for the background of the video.

If you don’t want background audio, you can disable the Audio group in the Fast Group Muter.

Step 6: Run the workflow

Click the Run button to run the workflow.

Tips

Refine the image before generating the video

The video generation takes a long time. Normally, you want to get a good image first before generating a video.

To generate the image only without the video, disable the Wan 2.2 group in the Fast Group Muter.

Change the prompt and the image seed to get a satisfactory image.

Image and video seeds

Change the image seed to generate an image variation.

Change the video seed to generate a video variation.

Video size

The video size is dictated by the image size.

The image size can be adjusted in the Empty SD3 Latent Image node.

The rule of thumb is width times height should be about 1 million pixels.

Video length

The video length, in frames, can be adjusted in the Wan22 Image To Video Latent node.

Andrew

By Andrew

Andrew is an experienced software engineer with a specialization in Machine Learning and Artificial Intelligence. He is passionate about programming, art, and education. He has a doctorate degree in engineering.

2 comments

  1. This is my new favorite workflow! It works SO much better than text to video with Wan2.2 and seems to work faster than first frame last frame. Z-image does really well with fantasy and abstract, which has been a struggle with other models. I’m working with a 4070 and got it pushed to 12 seconds at 768×1344 at 16 frames per second… I might try pushing it further!

Leave a comment

Your email address will not be published. Required fields are marked *