ControlNet: Control human pose in Stable Diffusion

Updated Categorized as Tutorial Tagged 7 Comments on ControlNet: Control human pose in Stable Diffusion
controlnet stable diffusion image.

ControlNet is a stable diffusion model that can copy compositions and human poses. Seasoned Stable Diffusion users know how hard it is to generate the exact pose you want. Everything is kind of… random… like diffusion. ControlNet solves this problem. It is powerful and versatile, allowing you to use it with any Stable Diffusion model.

In this post, You will learn everything you need about ControlNet: What it is, how to install and use it in AUTOMATIC1111, what each setting means, and some common use cases.

What is ControlNet?

ControlNet is a modified Stable Diffusion model. The most basic form of Stable Diffusion model is text-to-image. It uses text prompts as the conditioning to steer image generation. ControlNet adds one more conditioning. Let me show you two ControlNet examples: (1) edge detection and (2) human pose detection.

Edge detection

In the workflow illustrated below, ControlNet takes an additional input image and detects its outlines using the Canny edge detector. The detected edges are saved as a control map and then fed into the ControlNet model as extra conditioning, in addition to the text prompt.

Stable Diffusion ControlNet model workflow.
Stable Diffusion ControlNet with Canny edge conditioning.

The process of extracting specific information (edges in this case) from the input image is called annotation (in the research article) or preprocessing (in the ControlNet extension).

Human pose detection

As you may have suspected, edge detection is not the only way an image could be preprocessed. Openpose is a fast keypoint detection model that can extract human poses like positions of hands, legs, and head. See the example below.

Input image annotated with human pose detection using Openpose.

Below is the ControlNet workflow using OpenPose. Keypoints are extracted from the input image using OpenPose, and saved as a control map containing the positions of keypoints. It is then fed to Stable Diffusion as an extra conditioning together with the text prompt. Images are generated based on these two conditionings.

What’s the difference between using Canny edge detection and Openpose? Canny edge detector extracts the edges of the subject and background alike. It tends to translate the scene more faithfully. You can see the dancing man became a woman but the outline and hairstyle were similar.

OpenPose only detects keypoints so the image generation is more liberal but follows the original pose. In the above example, it generated a woman jumping up with the left foot pointing sideways, different from the original image and the one in Canny edge example. The reason is that OpenPose’s keypoint detection does not specify the orientations of the feet.

ControlNet models available

Now let’s see what ControlNet models can do. You will learn exactly what they do when I explain how to use the ControlNet extension.

OpenPose detector

OpenPose detects human keypoints such as positions of the head, shoulders, hands, etc. It is useful for copying human poses but not other details such as outfits, hairstyles, and backgrounds.

OpenPose ControlNet model.

Canny edge detector

Canny edge detector is a general-purpose, old-school edge detector. It extracts outlines of an image. It is useful for retaining the composition of the original image.

Canny Edge detection (Image courtesy of ControlNet)

Straight line detector

ControlNet can be used with M-LSD (Mobile Line Segment Detection), a fast straight-line detector. It is useful for extracting outlines with straight edges like interior designs, buildings, street scenes, picture frames, and paper edges.

Straight-line detection (Image courtesy of ControlNet)

HED edge detector

HED (Holistically-Nested Edge Detection) is an edge detector good at producing outlines like an actual person would. According to ControlNet’s authors, HED is suitable for recoloring and restyling an image.

Straight line detection (Image courtesy of ControlNet)

Scribbles

Controlnet can also turn something you scribble to an image!

ControlNet Scribbles (Image courtesy of ControlNet)

Other models

Additional models are

  • Human Pose – Use OpenPose to detect keypoints.
  • Semantic Segmentation – Generate images based on a segmentation map extracted from the input image.
  • Depth Map – Like depth-to-image in Stable diffusion v2, ControlNet can infer a depth map from the input image. ControlNet’s depth map has a higher resolution than Stable Diffusion v2’s.
  • Normal mapNormal map specifies the 3D orientation of an object’s surface. It is a common way to fake texture on low-resolution surfaces made of polygons. Of course, all you have is a 2D input image. The normal map is calculated from the depth map.

Installing Stable Diffusion ControlNet

Let’s walk through how to install ControlNet in AUTOMATIC1111, a popular and full-featured (and free!) Stable Diffusion GUI. I will use this extension, which is the de facto standard, to enable ControlNet.

If you already have ControlNet installed, you can skip to the next section to learn how to use it.

Install ControlNet in Google Colab

It’s easy to use ControlNet with the 1-click Stable Diffusion Colab notebook in our Quick Start Guide.

In the Extensions section of the Colab notebook, check ControlNet.

Press the Play button to start AUTOMATIC1111. That’s it!

Install ControlNet on Windows PC or Mac

You can use ControlNet with AUTOMATIC1111 on Windows PC or Mac. Follow the instructions in these articles to install AUTOMATIC1111 if you have not already done so.

If you already have AUTOMATIC1111 installed, make sure your copy is up-to-date.

Install ControlNet extension (Windows/Mac)

To install ControlNet extension, go to Extensions tab and select Available sub-tab. Press the Load from button.

In the newly appeared list, find the row with the extension sd-webui-controlnet. Press Install.

Restart AUTOMATIC1111 webui.

If the extension is successfully installed, you will see a new collapsible section in the txt2img tab called ControlNet. It should be right above the Script drop-down menu.

This indicates the extension installation was successful.

Install ControlNet Models (Windows/Mac)

The authors of ControlNet have released a handful of pretrained ControlNet models. They can be found on this page.

The webui community hosts half-precision versions of ControlNet models, which have smaller file sizes. They are faster to download, and easier to store, so why not? You can find them here.

To use these models, download the model files and put them in the model folder of the extension. The path to the model folder is

stable-diffusion-webui/extensions/sd-webui-controlnet/models

You don’t need to download all models. If this is your first time using ControlNet, you can just download the openpose model.

Using ControlNet in AUTOMATIC1111

You will learn one simple example of using the extension in this section. You will see a detailed explanation of each setting later.

You should have the ControlNet extension installed to follow this section. You can verify by seeing the ControlNet section below.

Press the caret on the right to expand the ControlNet panel. It shows the full section of control knobs and an image upload canvas.

I will use the following image to show you how to use ControlNet. You can download the image using the download button to follow the tutorial.

Text-to-image settings

ControlNet will need to be used with a Stable Diffusion model. In the Stable Diffusion checkpoint dropdown menu, select the model you want to use with ControlNet. Select v1-5-pruned-emaonly.ckpt to use v1.5 base model.

In the txt2image tab, write a prompt and optionally a negative prompt to be used by ControlNet. I will use the prompts below.

Prompt:

full-body, a young female, highlights in hair, dancing outside a restaurant, brown eyes, wearing jeans

Negative prompt:

disfigured, ugly, bad, immature

Set image size for image generation. I will use width 512 and height 776 for my demo image. Note that the image size is set in the txt2img section, NOT in the ControlNet section.

The GUI should look below.

ControlNet settings

Now I will talk about what you need to do in the ControlNet panel.

First upload an image to the image canvas.

Check the Enable checkbox.

You will need to select a preprocessor and a model. Preprocessor is just a different name for the annotator mentioned earlier, such as the OpenPose keypoint detector. Let’s select openpose as Preprocessor.

The selected ControlNet model has to be consistent with the preprocessor. For OpenPose, you should select control_openpose-fp16 as the model.

The ControlNet panel should look like this.

That’s all. Now press Generate to start generating images using ControlNet.

You should see the images generated to follow the pose of the input image. The last image is straightly from the preprocessing step. In this case, it is the keypoints detected.

This is the basics of using ControlNet! When you are done, uncheck the Enable checkbox to disable the ControlNet extension.

ALL ControlNet settings explained

You see a lot of settings in the ControlNet extension! It can be a bit intidimating when you first use it but let’s go through them one by one.

It’s going to be a deep dive. Take a break and go to the bathroom if you need to…

Input controls

Image Canvas: You can drag and drop the input image here. You can also click on the canvas and select a file using the file browser. The input image will be processed by the selected preprocessor in the Preprocessor dropdown menu. A control map will be created.

Pro tip: Use Ctrl-V (Windows) or Cmd-V (Mac) to paste an image to the ControlNet image canvas.

Camera icon: Take a picture using your device’s camera and use it as the input image. You will need to grant permission to your browser to access the camera.

Enable: Whether to enable ControlNet.

Invert input color: Swap black and white. It can be used when you upload a scribble, for example. ControlNet expects a black background with white scribbles. You must use this option if you create a scribble using external software with a white background. You don’t need to use this option if you create a scribble using ControlNet’s interface.

RGB to BGR: This is to change the order of color channels of the uploaded image. Or the order of the coordinates of the uploaded normal map. You don’t need to check this box if you upload an image and use preprocessing.

Low VRAM: For GPU with less than 8GB VRAM. It is an experimental feature. Check if you are out of GPU memory, or want to increase the number of images processed.

Guess Mode: Also known as non-prompt mode. The image generation can be totally unguided by text prompts. It enforces the ControlNet encoder to follow the input control map (like depth, edge, etc) even if there’s no prompt. Use higher steps, e.g. 50 when using this mode. You normally don’t check this box.

Preprocessor and model

Preprocessor: The preprocessor (called annotator in the research article) for preprocessing the input image, such as detecting edges, depth, and normal maps. None uses the input image as the control map.

Model: ControlNet model to use. If you have selected a preprocessor, you would normally select the corresponding model. The ControlNet model is used together with the Stable Diffusion model selected at the top of AUTOMATIC1111 GUI.

Weight and guidance strength

I will use the following image to illustrate the effect of weight and guidance strength. It’s an image of a girl sitting down.

But in the prompt, I will ask to generate a woman standing up.

full body, a young female, highlights in hair, standing outside restaurant, blue eyes, wearing a dress, side light

Weight: How much emphasis to give the control map relative to the prompt. It is similar to keyword weight in the prompt but applies to the control map.

The following images are generated using ControlNet OpenPose preprocessor, and with the OpenPose model.

As you can see, Controlnet weight controls how much the control map is followed relative to the prompt. The lower the weight, the less ControlNet demands the image to follow the control map.

Guidance strength: This is the number of steps the ControlNet is applied. It is analogous to denoising strength in image-to-image. If the guidance strength is 1, ControlNet is applied to 100% of the sampling steps. If guidance strength is 0.7 and you are doing 50 steps, ControlNet is applied to the first 70% of the sampling steps, i.e. the first 35 steps.

Below are generated with guidance strength ranging from 0.1 to 1 and keeping weight as 1.

Since the initial steps set the global composition (Stable Diffusion removes maximum amount of noise in each step, and it starts with a random tensor in latent space), the pose is set even if you only apply ControlNet to as few as 20% of the first sampling steps. Guidance strength needs to be set to pretty low to have an effect.

Resize mode

Resize mode controls what to do when the size of the input image or control map is different from size of the images to be generated. You don’t need to worry about these options if they are in the same aspect ratio.

I will demonstrate the effect of resize modes by setting text-to-image to generate a landscape image, while the input image/control map is a portrait image.

Envelope (Outer Fit): Fits the image canvas to be within the control map. Crop the control map so that it is the same size as the canvas.

Because the control map is cropped at the top and the bottom, so does our girl.

Scale to Fit: Fit the whole control map to the image canvas. Extend the control map with empty values so that it is the same size as the image canvas.

Compared to the original input image, there are more spaces on the side.

Just Resize: Scale the width and height of the control map independently to fit the image canvas. This will change the aspect ratio of the control map.

The girl now needs to lean forward so that she’s still within the canvas. You can create some interesting effect with this mode.

Scribble canvas settings

You only need to worry about these settings if you use the ControlNet GUI for creating scribbles. These settings have no effect when you upload an input image and use a preprocessor.

Canvas Width and Canvas Height are the width and height of the blank canvas you create when you press the Create blank canvas button.

The up down arrows switches the height and the width. Useful when you were confused and adjusted height for width (as I often do…).

Preview annotation

The ControlNet extension will show you a copy of the control map after each round of image generation. But sometimes you just want to see the control map and experiment with the parameters.

Preview annotator result: Generate control map based on the preprocessor setting. A control map will show right next to the input image.

Hide annotator result: Hide the control map.

Preprocessors

I will again use the same input image to demonstrate the control map produced with different preprocessors.

Canny

Canny is an old-school edge detector. It does a fair job in extracting curvy and straight lines. But it could also be susceptible to noise.

Control map generated by Canny edge detector.

The result is very decent. With the bokehs on background nicely reproduced.

Depth

Depth map is useful for conveying how far away the objects in an image are. White means closer and dark means further away. This function is similar to depth-to-image in Stable Diffusion v2 and uses the same MiDaS depth estimator.

To be used with the control_depth model.

Depth control map.

The shape of the girl is well preserved.

Midas resolution slider: Changes the width of the control map. It is a way to control the resolution of the control map relative to the size of the input image.

Depth LeRes

LeRes is a more recent depth estimation model designed for reconstructing 3D scene from a single image. LeRes generally recovers more details compared to MiDaS, and more control parameters to tweak the control map.

To be used with the control_depth model.

Depth LeRes

This is one of those time that depth map with too much detail fails. See the legs and compare with the depth map above.

HED

HED (Holistically-Nested Edge Detection) is an edge detection AI model good at producing outlines from an image like a human would. It functionally similar to Canny but less noisy.

To be used with control_hed model.

HED

Not entirely sure why but HED model produces artificial colorings like this. Perhaps it’s a bug.

mlsd

 M-LSD (Mobile Line Segment Detection) is a fast straight line detector. It is useful for extracting outlines with straight edges like interior designs, buildings, street scenes, picture frames, and paper edges.

To be used with control_mlsd model.

However, don’t use mlsd with portraits or subjects with curve lines. It won’t be able to extract them.

mlsd fails to extract the outline of the girl.

The outline of the girl didn’t transfer well, as expected from the mlsd control map.

Normal map

Normal map specifies the orientation of the normal vector of the surface that a pixel rests on. It is commonly used to fake texture or depth in 3D models. The RGB values of each pixel now store the vector values, instead of color. Normal is a mathematical concept using the vector perpendicular to a surface to represent its orientation.

Of course, there’s no 3D model. ControlNet only gets a 2D image. The normal map is estimated from the depth map.

Use with control_normal model.

Normal control map.

The 3D orientation of the input image is transferred to a new image faithfully with normal map.

OpenPose

OpenPose is a real-time human keypoint detection software. This option is for extracting human poses without copying other details such as hair style, outfits, and background.

Use with control_openpose model.

OpenPose control map.

Because there are no constraints on how the image needs to be generated other than the subject having a specific pose, the model can do its job freely and produce a natural-looking image:

Pidinet

pidinet (Pixel Difference network) detects curve and straight edges. It’s result is similar to HED but usually results in cleaner lines with less details.

Use with control_hed model.

Scribble

Scribble is for preprocessing user-drawn scribbles. This preprocessor should not be used on a realistic image.

Use with control_scribble model.

Fake Scribble

Fake Scribble produces scribble-like outlines from the input image.

Use with control_scribble model.

Fake Scribble

See artifact on legs.

Segmentation

Segmentation map assigns different objects with different colors based on an educated guess. Of course it could be wrong because it is a pretty difficult problem for computers.

Use with control_seg model.

Segmentation control map.

Image generation will also follows the regions based on the segmentation map.

OK, now (hopefully) you know all the settings. Let’s explore some ideas to use ControlNet.

Some ideas to use ControlNet

Copying human pose

Perhaps the most common application of ControlNet is copying human poses. This is because it is usually hard to control poses… until now! The input image can be an image generated by Stable Diffusion, or can be taken from a real camera.

OpenPose model

To use ControlNet for transferring human poses, follow the instructions to enable ControlNet in AUTOMATIC1111. Use the following settings.

Preprocessor: openpose

Model: control-openpose-fp16

Make sure you have checked Enable.

Here are a few examples.

Example 1: Copying pose from an image

As a basic example, let’s copy the pose of the following image of a woman admiring leaves.

Input image

Using various models and prompts, you can dramatically change the content but keep the pose.

Example 2: Remix a movie scene

You can recast the iconic dance scene in Pulp Fiction to some yoga exercises in the park.

This uses ControlNet with DreamShaper model.

Prompt: photo of women doing yoga, outside in a park. Negative prompt: disfigured, ugly, bad, immature

This is with the same prompt, but using Inkpunk Diffusion model. (You will need to add the activation keyword nvinkpunk to the prompt)

Same prompt with inkpunk diffusion model.

ControlNet model comparison for copying poses

Many ControlNet models work well for copying human poses, but with different strengths and weaknesses. Let’s use the painting of Beethoven as an example, and transform it with different ControlNet models.

Input image

I will use the following prompt with the DreamShaper model.

elegant snobby rich Aerith Gainsborough looks intently at you in wonder and anticipation. ultra detailed painting at 16K resolution and epic visuals. epically surreally beautiful image. amazing effect, image looks crazily crisp as far as it’s visual fidelity goes, absolutely outstanding. vivid clarity. ultra. iridescent. mind-breaking. mega-beautiful pencil shadowing. beautiful face. Ultra High Definition. process twice.

HED model copies the original image most faithfully, down to small little details. Canny Edge, depth, and normal map are doing a decent job but with a bit more variations. As expected, OpenPose copies the pose but allows the DreamShaper model to do the rest of the job, including hairstyle, face, and outfit. m-LSD was designed to detect straight lines and resulted in a total failure here (in copying pose).

Stylize image with ControlNet

Below are with v1.5 model but various prompts to achieve different styles. ControlNet with various preprocessing was used. It is best to experiment and see which one works best.

You can also use models to stylize images. Below are generated using the prompt “Painting of Beethoven” with Anythingv3, DreamShaper and OpenJourney models.

Controlling poses with Magic Pose

Sometimes you may be unable to find an image with the exact pose you want. You can create your custom pose using software tools like Magic Poser (credit).

Step 1: Go to the Magic Poser website.

Step 2: Move the keypoints of the model to customize the pose.

Step 3: Press Preview. Take a screenshot of the model. You should get an image like below.

Human pose from Magic Poser.

Step 4: Use OpenPose ControlNet model. Select the model and prompt of your choice to generate images.

Below are some images generated using 1.5 model and DreamShaper model. The pose was copied well in all cases.

Interior design ideas

You can use Stable Diffusion ControlNet’s straight-line detector M-LSD model to generate interior design ideas. Below are the ControlNet settings.

You can start with any interior design photos. Let’s use the one below as an example.

Input image for interior design.

Prompt:

award winning living room

Model: Stable Diffusion v1.5

Below are a few design ideas generated.

Alternatively, you can use the depth model. Instead of straight lines, it will emphasize preserving the depth information.

Settings:

Generated images:

Difference between Stable Diffusion depth model and ControlNet

Stability AI, the creator of Stable Diffusion, released a depth-to-image model. It shares a lot of similarities with ControlNet but there are important differences.

Let’s first talk about what’s similar.

  1. They are both Stable Diffusion models…
  2. They both use two conditionings (a preprocessed image and text prompt).
  3. They both use MiDAS to estimate the depth map.

The differences are

  1. Depth-to-image model is a v2 model. ControlNet can be used with any v1 or v2 models. This point is huge because v2 models are notoriously hard to use. People have a hard time generating good images. The fact that ControlNet can use any v1 model not only opened up depth conditioning to the v1.5 base model, but also thousands of special models that were released by the community.
  2. ControlNet is more versatile. In addition to depth, it can also condition with edge detection, pose detection, and so on.
  3. ControlNet’s depth map has a higher resolution than depth-to-image’s.

How does ControlNet work?

This tutorial won’t be complete without explaining how ControlNet works under the hood.

ControlNet works by attaching trainable network modules to various parts of the U-Net (noise predictor) of Stable Diffusion Model. The weight of Stable Diffusion model is locked so that they are unchanged during training. Only the attached modules are modified during training.

The model diagram from the research paper sums it up well. Initially, the weights of the attached network module are all zero, making the new model able to take advantage of the trained and locked model.

During training, two conditionings are supplied along with each training image. (1) text prompt, and (2) annotation such as OpenPose keypoints or Canny edges. This way, the ControlNet model can learn to generate images based on these two inputs.

Each annotation method is trained independently.

More readings


Buy Me A Coffee

7 comments

  1. Thank you for this. Now i need a batch funktion to load video sequence images in a row and start generation with the same seed every time to make a video with different faces .

      1. I have to say, this is possibly the best ever written article on Stable Diffusion/ControlNet that I’ve came across.

        Highly detailed information which will help newcomers to using AI image generators, you’ve done an excellent job of compiling this Andrew.

        Looking forward to more articles in the near future.

Leave a Reply