CogvideoX 5B: High quality local video generator

Published Categorized as Tutorial Tagged , , , 7 Comments on CogvideoX 5B: High quality local video generator

Cognvideo is a state-of-the-art AI video generator similar to Kling, except you can generate the video locally on your PC. In this article, you will learn how to use Cogvideo in ComfyUI.

Software

We will use ComfyUI, an alternative to AUTOMATIC1111.

Read the ComfyUI installation guide and ComfyUI beginner’s guide if you are new to ComfyUI. See the Quick Start Guide if you are new to AI images and videos.

Take the ComfyUI course to learn ComfyUI step-by-step.

What is CogvideoX?

CogVideoX is a significant advancement in text-to-video generation. Building upon the success of text-to-image models like Stable Diffusion, CogVideo is specifically designed to generate coherent and high-quality videos from text prompts.

Model architecture

Here are some notable model design features.

  • CogVideo uses the large T5 text encoder to convert the text prompt into embeddings, similar to Stable Diffusion 3 and Flux AI.
  • In Stable Diffusion, an VAE compresses an image to and from the latent space. CogVideo generalizes this idea and uses a 3D casual VAE to compress a video into the latent space.

Models available

CogVideo models with 2B and 5B parameters are available. We will use the 5B version in this tutorial for higher-quality videos.

In a dimly lit bar, purplish light bathes the face of a mature man, his eyes blinking thoughtfully as he ponders in close-up, the background artfully blurred to focus on his introspective expression, the ambiance of the bar a mere suggestion of shadows and soft lighting.

How to use CogVideo in ComfyUI

This workflow is tested with an RTX4090 GPU card. It takes about 15 minutes to generate a video with a max of 16GB VRAM usage.

Step 1: Load the CogVideo workflow

Download the workflow JSON file below. Drop it to ComfyUI.

Step 2: Install missing nodes

You will need the ComfyUI Manager for this step. Follow the link for instructions to install ComfyUI Manager.

Click Manager on the sidebar. Click Install missing custom nodes.

Install the ComfyUI CogVideoX Wrapper.

Restart ComfyUI.

Refresh the ComfyUI page.

Step 3: Download the T5 text encoder

Download the T5 text encoder using the link below. Put it in ComfyUI > models > clip.

t5xxl_fp8_e4m3fn.safetensors

Step 4: Generate a video.

Press Queue Prompt to generate a video.

It will automatically download the 5B CogVideo model the first time you run it. It will take a while as if nothing is happening. But you can tell by the size of the folder models > CogVideos getting larger.

After the download is complete, it will start the video generation.

Avatar

By Andrew

Andrew is an experienced engineer with a specialization in Machine Learning and Artificial Intelligence. He is passionate about programming, art, photography, and education. He has a Ph.D. in engineering.

7 comments

  1. Highly dependant on prompt and skill – and *every* video generator out there is a hit or miss as far as what works.

    I’ve used every text-to-vid, img-to-vid and vid-to-vid tool out there that I can find on Github and otherwise and I get mostly crap from all of them, with a couple good ones here and there once you stumble on something that works well.

    CogVideoX is actually pretty phenomenal, even when compared to the more well-known of the bunch like Kling and Runway.
    Not saying it’s better, it isn’t – but is capable of giving you stuff that is nearly on-par – and given the fact we have other AI tools to clean things up, it’s pretty amazing really.

    And don’t forget – this is the worst it’s going to get.

Leave a comment

Your email address will not be published. Required fields are marked *