How to generate OmniHuman-1 lip sync video

Published Categorized as Tutorial Tagged , No Comments on How to generate OmniHuman-1 lip sync video
OmniHuman lip sync

OmniHuman-1 is a human video generation model that can generate a lip sync video from a single image and an audio clip. The motion is highly realistic and matches the voice.

OmniHuman-1 is currently only available through an online service. In this tutorial, I will show you how to access the service and generate a free lip sync video.

OmniHuman-1 video. (OmniHuman-lab)

OmniHuman-1 model

OmniHuman model architecture
OmniHuman-1 training and model architecture. (OmniHuman-lab)

OmniHuman-1 is built around a unified Diffusion Transformer backbone that learns to turn a static reference image and motion cues (such as audio) into a realistic human video. It starts from a pre-trained text-to-video model (Seaweed/MMDiT) and simultaneously handles multiple types of conditioning signals.

OmniHuman-1 integrates appearance, lip movement, gestures, and full-body motion in a single end-to-end network.

It employs an innovative three-stage “omni-conditions” training:

  1. Learning general motions from text
  2. Refining lip sync and head movements from audio
  3. Mastering full-body dynamics from pose information.

This design allows the model to leverage massive datasets, resulting in lifelike videos that sync speech, gestures, and object interactions.

OmniHuman-1 lip sync videos

The advantages of the OmniHuman-1 model are:

  • Realistic human motion matching the audio input – speech, singing, etc.
  • Arbitrary video length.
  • Animating diverse image styles – realistic photo, anime, painting, etc.

Generate OmniHuman-1 videos

The OmniHuman-1 model is currently available on CapCut’s Dreamina video generation service.

Step 1: Access the video service

Visit Dreamina from CapCut.

Select Lip Sync under the AI Avatar generator.

You can create or use an existing CapCut account to log in.

You should have enough free credits to generate a lip sync video.

Step 2: Upload an image

Upload an image you want to animate to the Character image canvas.

Step 3: Enter the lip sync speech

Enter a speech you want the character to say. For example:

I thought a thought. But the thought I thought wasn’t the thought I thought I thought. If the thought I thought I thought had been the thought I thought, I wouldn’t have thought I thought.

Step 4: Select a voice

Choose a voice you like under text-to-speech.

Step 5: Generate a video

Click Generate.

Reference

Andrew

By Andrew

Andrew is an experienced software engineer with a specialization in Machine Learning and Artificial Intelligence. He is passionate about programming, art, and education. He has a doctorate degree in engineering.

Leave a comment

Your email address will not be published. Required fields are marked *