TeaCache: 2x speed up in ComfyUI

Published Categorized as Tutorial Tagged , , , 1 Comment on TeaCache: 2x speed up in ComfyUI

Do you wish AI to run faster on your PC? TeaCache can speed up diffusion models with negligible changes in quality. It is simple to implement and is available for image generation with Flux and video generation with Hunyuan Video on ComfyUI.

In this tutorial, I will cover:

  • How does TeaCache work
  • TeaCache workflow for Flux (2x speed up)
  • TeaCache workflow for HunyuanVideo (2x speed up)

What is TeaCache?

Timestep Embedding Aware Cache (TeaCache) speeds up the sampling steps of the Diffusion Transformer (DiT) models. That’s the architecture most new diffusion models use. (E.g. Flux and HunyuanVideo)

Unlike distillation, TeaCache doesn’t require training a new model. So, all your LoRAs and ControlNets are compatible with this speed-up.

Distilled models are faster because they use fewer sampling steps, significantly reducing quality. In Contrast, TeaCache uses the same number of steps to preserve the quality. You can adjust the cache parameter to attain a significant speed-up without quality loss.

The flexibility in trading between quality and speed shines. Companies can use this simple method to reduce inference costs by half. Local users can fill up their local storage with those very useful AI videos twice as quickly…

How does TeaCache work?

TeaCache takes advantage of the observation that some neural network blocks don’t do much during sampling. Researchers have recognized that diffusion models generate image outlines in the initial sampling steps and fill in details in the late steps.

Diffusion models generate the image outline in the initial steps and details in the late steps. (Image: Chen et. al.)

A Diffusion Transformer model is made of a series of transformer blocks. The front transformer blocks draw the outlines, and the rear transformer blocks draw the details. Since the outline doesn’t change in the later steps, the delta-DiT method cleverly uses the cached outputs of the front blocks instead of calculating them every time, speeding up the sampling steps.

TeaCache goes one step further. It intelligently determines when to use caches during sampling. It uses the cached output when the current input is similar to that produced the cache. It only recomputes the cache when the input becomes substantially different. You can control how often the cache is recomputed by a threshold value.

Using TeaCache with Flux

It all sounds great on paper. Let’s see how it does on a consumer-grade GPU card in real life.

The threshold parameter delta controls the caching. 0 turns it off. A higher delta value refreshes the cache less frequently and makes the sampling faster.

delta thresholdImage generation time
0 (off)13.5 sec
0.211.7 sec
0.47.4 sec
0.66.0 sec
0.85.3 sec
1.04.0 sec
Generation times of one Flux.1 Dev image on RTX4090.

Setting the threshold to 0.2 – 0.4 results in minimal change in the image. Interestingly, most changes are in the background instead of the subject.

It starts to show artifacts at 0.6 (look at her hands). The image changes quite a lot and becomes incoherent at 1.0.

Setting the threshold to 0.4 is a safe starting point. It speeds up generation by 2x without quality loss.

Follow the steps below to use TeaCache with Flux on ComfyUI.

Step 1: Download the Flux AI model

Download the Flux1 dev FP8 checkpoint.

Put the model file in the folder ComfyUI > models > checkpoints.

Step 2: Load Flux TeaCache workflow

Download the Flux1 TeaCache workflow JSON file below.

Drop it to your ComfyUI.

Step 3: Install missing nodes

This workflow uses the ComfyUI-TeaCache custom node. If you see some red blocks, you don’t have the custom node that this workflow needs.

Click Manager > Install missing custom nodes and install the missing nodes.

Restart ComfyUI.

Step 4: Generate an image

Update ComfyUI and reload the page if you see red boxes.

Press Queue Prompt to generate an image.

Flux image with Teacache

Using TeaCache with Hunyuan Video

The popular ComfyUI-HunyuanVideoWrapper custom nodes have native support for Hunyuan Video. You only need to connect the HunyuanVideo TeaCache node to the HunyuanVideo Sampler node, as shown below.

Or you can use my workflow below. You can expect to cut down the generation time by half.

ThresholdVideo generation time
0 (Off)140 secs
0.1576 secs

Step 0: Update ComfyUI

Before loading the workflow, make sure your ComfyUI is up-to-date. The easiest way to do this is to use ComfyUI Manager.

Click the Manager button on the top toolbar.

Select Update ComfyUI.

Restart ComfyUI.

Step 1: Download video model

Download the hunyuan_video_FastVideo_720_fp8_e4m3fn.safetensors and put it in ComfyUI > models > diffusion_models.

Step 2: Download VAE

Download hunyuan_video_vae_bf16.safetensors.

This VAE file differs from the one released by ComfyUI and is used in the text-to-video tutorial. Rename it to hunyuan_video_vae_bf16-kj.safetensors

Put it in ComfyUI > models > vae.

Step 3: Load workflow

Download the Hunyuan video TeaCache workflow JSON file below.

Drop it to ComfyUI.

Step 4: Install missing nodes

If you see red blocks, you don’t have the custom node that this workflow needs.

Click Manager > Install missing custom nodes and install the missing nodes.

Restart ComfyUI.

Step 5: Run the workflow

Click the Queue button to generate the video.

Tips: Change the seed value in HunyuanVideo Sampler node if you see a subpar video.

Andrew

By Andrew

Andrew is an experienced software engineer with a specialization in Machine Learning and Artificial Intelligence. He is passionate about programming, art, and education. He has a doctorate degree in engineering.

1 comment

  1. Hi Andrew,
    On Colab L4, I found that generation time of your 832×1216 image was reduced from 47 secs without TeaCache to 31secs at a threshold of 0.4, 26 secs at 0.5 and 24 secs at 0.6. It was only at 0.6 that I started to see the image changing – the skulls on her jacket were distorted – and that was only on some images which were otherwise still useable. At 0.7, 21 secs, hands were distorted and weren’t. So your finding of halving the generation time was borne out by my test on Google Colab’s gpus. Thanks for bring this really useful tool to us!

Leave a comment

Your email address will not be published. Required fields are marked *