How to run Wan 2.1 Video on ComfyUI

Published Categorized as Tutorial Tagged , , , 1 Comment on How to run Wan 2.1 Video on ComfyUI

Wan 2.1 Video is a series of open foundational video models. It supports a wide range of video-generation tasks. It can turn images or text descriptions into videos at 480p or 720p resolutions.

In this post:

  • An overview of Wan 2.1 models
  • How to use Wan 2.1 image-to-video in ComfyUI
  • How to use Wan 2.1 text-to-video in ComfyUI

Software

We will use ComfyUI, an alternative to AUTOMATIC1111. You can use it on Windows, Mac, or Google Colab. If you prefer using a ComfyUI service, Think Diffusion offers our readers an extra 20% credit.

Read the ComfyUI beginner’s guide if you are new to ComfyUI. See the Quick Start Guide if you are new to AI images and videos.

Take the ComfyUI course to learn how to use ComfyUI step-by-step.

Wan 2.1 sample videos

Unsurprisingly, 720p videos are higher quality than 480p for both image-to-video and text-to-video. However, some 480p videos have glitch artifacts.

720p videos

480p videos

a cat demonstrating kungfu in a traditional chinese temple

What are the Wan 2.1 models?

Released by Wan AI, the Wan 2.1 models are a collection of video models that can turn images or text descriptions into videos.

  • Text-to-video 14B model: Supports both 480p and 720p
  • Image-to-video 14B 720p model: Supports 720p
  • Image-to-video 14B 480p model: Supports 480p
  • Text-to-video 1.3B model: Supports 480p

The most interesting one is the image-to-video 720p model. We badly need a high-quality image-to-video model to use with a high-quality text-to-image video, such as Flux AI.

Model architecture

Not much technical information was released. I will update this section when they publish more.

Highlights

The bullet points below are taken from the Wan 2.1 GitHub page.

  • State-of-the-art Performance. It is competitive to the best video models like Hunyuan Video.
  • Supports consumer-grade GPUs: The smallest model T2V-1.3B requires only 8.19 GB VRAM, making it compatible with almost all GPUs.
  • Multiple Tasks: supports text-to-video, image-to-video, video editing, text-to-image, and video-to-audio.
  • Visual Text Generation: It can generate Chinese and English text.
  • Powerful Video VAE: The Wan-VAE can encode and decode 1080p videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.

Wan 2.1 Image-to-video workflow

This workflow turns an image into a 2.3-second video with 720p resolution (1280 x 720 pixels) in the MP4 format. To use it, you must supply an image and a text prompt.

It takes 23 mins on my RTX4090 (24GB VRAM).

Step 1: Update ComfyUI

Before loading the workflow, make sure your ComfyUI is up-to-date. The easiest way to do this is to use ComfyUI Manager.

Click the Manager button on the top toolbar.

Select Update ComfyUI.

comfyui manager - update comfyui

Restart ComfyUI.

Step 2: Download model files

Download the diffusion model wan2.1_i2v_480p_14B_fp8_e4m3fn.safetensors and put it in ComfyUI > models > diffusion_models.

Download the text encoder model umt5_xxl_fp8_e4m3fn_scaled.safetensors and put it in ComfyUI > models > text_encoders.

Download the CLIP vision model clip_vision_h.safetensors and put it in ComfyUI > models > clip_vision.

Download the Wan VAE model wan_2.1_vae.safetensors and put it in ComfyUI > models > vae.

Step 3: Load the Wan 2.1 img2vid workflow

Download the workflow JSON file below and drop it to ComfyUI to load.

Step 4: Install missing nodes

If you see red blocks, you don’t have the custom node that this workflow needs.

Click Manager > Install missing custom nodes and install the missing nodes.

Restart ComfyUI.

Step 5: Set the image image

Upload an image you wish to use as the video’s initial frame. You can download my test image for testing.

Step 6: Revise the prompt

Revise the positive prompt to describe the video you want to generate. Some tips:

  • Don’t just describe your input image. Describe what the later part of the video should do.
  • Add action words, e.g., laugh, run, fight, etc.
  • You can leave the boilerplate negative prompt unchanged.

Step 7: Generate the video

Click the Queue button to run the workflow.

queue button comfyui

You should get this video.

480p image-to-video workflow

You can use the workflow above to generate 480p videos (640×480 pixels).

Download the diffusion model wan2.1_i2v_480p_14B_bf16.safetensors and put it in ComfyUI > models > diffusion_models.

Set the video resolution to something close to 480p:

  • Width: 848
  • Height: 480

Wan 2.1 text-to-video workflow

This workflow turns a text description into a 2.3-second video with 720p resolution (1280 x 720 pixels) in the MP4 format.

Step 1: Update ComfyUI

Before loading the workflow, make sure your ComfyUI is up-to-date. The easiest way to do this is to use ComfyUI Manager.

Click the Manager button on the top toolbar.

Select Update ComfyUI.

comfyui manager - update comfyui

Restart ComfyUI.

Step 2: Download model files

Download the diffusion model wan2.1_t2v_14B_fp8_e4m3fn.safetensors and put it in ComfyUI > models > diffusion_models.

Download the text encoder model umt5_xxl_fp8_e4m3fn_scaled.safetensors and put it in ComfyUI > models > text_encoders.

Download the CLIP vision model clip_vision_h.safetensors and put it in ComfyUI > models > clip_vision.

Download the Wan VAE model wan_2.1_vae.safetensors and put it in ComfyUI > models > vae.

Step 3: Load the Wan 2.1 txt2vid workflow

Download the workflow JSON file below and drop it to ComfyUI to load.

Step 4: Install missing nodes

If you see red blocks, you don’t have the custom node that this workflow needs.

Click Manager > Install missing custom nodes and install the missing nodes.

Restart ComfyUI.

Step 5: Revise the prompt

Revise the positive prompt to describe the video you want to generate.

Step 6: Generate the video

Click the Queue button to run the workflow.

queue button comfyui

You should get a video like this:

a cat demonstrating kungfu in a traditional chinese temple

480p text-to-video workflow

Here you go. This workflow uses the same model files above.

a cat demonstrating kungfu in a traditional chinese temple

Fast 480p text-to-video workflow

This workflow uses a smaller diffusion model.

Download the diffusion model wan2.1_t2v_1.3B_bf16.safetensors and put it in ComfyUI > models > diffusion_models.

The result is interesting… But it only takes a minute.

Reference

ComfyUI Blog: Wan2.1 Video Model Native Support in ComfyUI!

Model file page: Comfy-Org/Wan_2.1_ComfyUI_repackaged · Hugging Face

GitHub page: Wan-Video/Wan2.1: Wan: Open and Advanced Large-Scale Video Generative Models

Andrew

By Andrew

Andrew is an experienced software engineer with a specialization in Machine Learning and Artificial Intelligence. He is passionate about programming, art, and education. He has a doctorate degree in engineering.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *