ControlNet is a stable diffusion model that can copy compositions and human poses. Seasoned Stable Diffusion users know how hard it is to generate the exact pose you want. Everything is kind of… random… like diffusion. ControlNet solves this problem. It is powerful and versatile, allowing you to use it with any Stable Diffusion model.
In this post, You will learn everything you need about ControlNet: What it is, how to install and use it in AUTOMATIC1111, what each setting means, and some common use cases.
Contents
What is ControlNet?
ControlNet is a modified Stable Diffusion model. The most basic form of Stable Diffusion model is text-to-image. It uses text prompts as the conditioning to steer image generation. ControlNet adds one more conditioning. Let me show you two ControlNet examples: (1) edge detection and (2) human pose detection.
Edge detection
In the workflow illustrated below, ControlNet takes an additional input image and detects its outlines using the Canny edge detector. The detected edges are saved as a control map and then fed into the ControlNet model as extra conditioning, in addition to the text prompt.

The process of extracting specific information (edges in this case) from the input image is called annotation (in the research article) or preprocessing (in the ControlNet extension).
Human pose detection
As you may have suspected, edge detection is not the only way an image could be preprocessed. Openpose is a fast keypoint detection model that can extract human poses like positions of hands, legs, and head. See the example below.

Below is the ControlNet workflow using OpenPose. Keypoints are extracted from the input image using OpenPose, and saved as a control map containing the positions of keypoints. It is then fed to Stable Diffusion as an extra conditioning together with the text prompt. Images are generated based on these two conditionings.

What’s the difference between using Canny edge detection and Openpose? Canny edge detector extracts the edges of the subject and background alike. It tends to translate the scene more faithfully. You can see the dancing man became a woman but the outline and hairstyle were similar.
OpenPose only detects keypoints so the image generation is more liberal but follows the original pose. In the above example, it generated a woman jumping up with the left foot pointing sideways, different from the original image and the one in Canny edge example. The reason is that OpenPose’s keypoint detection does not specify the orientations of the feet.
ControlNet models available
Now let’s see what ControlNet models can do. You will learn exactly what they do when I explain how to use the ControlNet extension.
OpenPose detector
OpenPose detects human keypoints such as positions of the head, shoulders, hands, etc. It is useful for copying human poses but not other details such as outfits, hairstyles, and backgrounds.

Canny edge detector
Canny edge detector is a general-purpose, old-school edge detector. It extracts outlines of an image. It is useful for retaining the composition of the original image.

Straight line detector
ControlNet can be used with M-LSD (Mobile Line Segment Detection), a fast straight-line detector. It is useful for extracting outlines with straight edges like interior designs, buildings, street scenes, picture frames, and paper edges.

HED edge detector
HED (Holistically-Nested Edge Detection) is an edge detector good at producing outlines like an actual person would. According to ControlNet’s authors, HED is suitable for recoloring and restyling an image.

Scribbles
Controlnet can also turn something you scribble to an image!

Other models
Additional models are
- Human Pose – Use OpenPose to detect keypoints.
- Semantic Segmentation – Generate images based on a segmentation map extracted from the input image.
- Depth Map – Like depth-to-image in Stable diffusion v2, ControlNet can infer a depth map from the input image. ControlNet’s depth map has a higher resolution than Stable Diffusion v2’s.
- Normal map – Normal map specifies the 3D orientation of an object’s surface. It is a common way to fake texture on low-resolution surfaces made of polygons. Of course, all you have is a 2D input image. The normal map is calculated from the depth map.
Installing Stable Diffusion ControlNet
Let’s walk through how to install ControlNet in AUTOMATIC1111, a popular and full-featured (and free!) Stable Diffusion GUI. I will use this extension, which is the de facto standard, to enable ControlNet.
If you already have ControlNet installed, you can skip to the next section to learn how to use it.
Install ControlNet in Google Colab
It’s easy to use ControlNet with the 1-click Stable Diffusion Colab notebook in our Quick Start Guide.
In the Extensions section of the Colab notebook, check ControlNet.

Press the Play button to start AUTOMATIC1111. That’s it!
Install ControlNet on Windows PC or Mac
You can use ControlNet with AUTOMATIC1111 on Windows PC or Mac. Follow the instructions in these articles to install AUTOMATIC1111 if you have not already done so.
If you already have AUTOMATIC1111 installed, make sure your copy is up-to-date.
Install ControlNet extension (Windows/Mac)
To install ControlNet extension, go to Extensions tab and select Available sub-tab. Press the Load from button.

In the newly appeared list, find the row with the extension sd-webui-controlnet. Press Install.

Restart AUTOMATIC1111 webui.
If the extension is successfully installed, you will see a new collapsible section in the txt2img tab called ControlNet. It should be right above the Script drop-down menu.

This indicates the extension installation was successful.
Install ControlNet Models (Windows/Mac)
The authors of ControlNet have released a handful of pretrained ControlNet models. They can be found on this page.
The webui community hosts half-precision versions of ControlNet models, which have smaller file sizes. They are faster to download, and easier to store, so why not? You can find them here.
To use these models, download the model files and put them in the model folder of the extension. The path to the model folder is
stable-diffusion-webui/extensions/sd-webui-controlnet/models
You don’t need to download all models. If this is your first time using ControlNet, you can just download the openpose model.
Using ControlNet in AUTOMATIC1111
You will learn one simple example of using the extension in this section. You will see a detailed explanation of each setting later.
You should have the ControlNet extension installed to follow this section. You can verify by seeing the ControlNet section below.



Press the caret on the right to expand the ControlNet panel. It shows the full section of control knobs and an image upload canvas.


I will use the following image to show you how to use ControlNet. You can download the image using the download button to follow the tutorial.


Text-to-image settings
ControlNet will need to be used with a Stable Diffusion model. In the Stable Diffusion checkpoint dropdown menu, select the model you want to use with ControlNet. Select v1-5-pruned-emaonly.ckpt
to use v1.5 base model.


In the txt2image tab, write a prompt and optionally a negative prompt to be used by ControlNet. I will use the prompts below.
Prompt:
full-body, a young female, highlights in hair, dancing outside a restaurant, brown eyes, wearing jeans
Negative prompt:
disfigured, ugly, bad, immature
Set image size for image generation. I will use width 512 and height 776 for my demo image. Note that the image size is set in the txt2img section, NOT in the ControlNet section.
The GUI should look below.


ControlNet settings
Now I will talk about what you need to do in the ControlNet panel.
First upload an image to the image canvas.
Check the Enable checkbox.
You will need to select a preprocessor and a model. Preprocessor is just a different name for the annotator mentioned earlier, such as the OpenPose keypoint detector. Let’s select openpose as Preprocessor.
The selected ControlNet model has to be consistent with the preprocessor. For OpenPose, you should select control_openpose-fp16 as the model.
The ControlNet panel should look like this.


That’s all. Now press Generate to start generating images using ControlNet.
You should see the images generated to follow the pose of the input image. The last image is straightly from the preprocessing step. In this case, it is the keypoints detected.


This is the basics of using ControlNet! When you are done, uncheck the Enable checkbox to disable the ControlNet extension.
ALL ControlNet settings explained
You see a lot of settings in the ControlNet extension! It can be a bit intidimating when you first use it but let’s go through them one by one.
It’s going to be a deep dive. Take a break and go to the bathroom if you need to…
Input controls


Image Canvas: You can drag and drop the input image here. You can also click on the canvas and select a file using the file browser. The input image will be processed by the selected preprocessor in the Preprocessor dropdown menu. A control map will be created.
Pro tip: Use Ctrl-V (Windows) or Cmd-V (Mac) to paste an image to the ControlNet image canvas.
Camera icon: Take a picture using your device’s camera and use it as the input image. You will need to grant permission to your browser to access the camera.
Enable: Whether to enable ControlNet.
Invert input color: Swap black and white. It can be used when you upload a scribble, for example. ControlNet expects a black background with white scribbles. You must use this option if you create a scribble using external software with a white background. You don’t need to use this option if you create a scribble using ControlNet’s interface.
RGB to BGR: This is to change the order of color channels of the uploaded image. Or the order of the coordinates of the uploaded normal map. You don’t need to check this box if you upload an image and use preprocessing.
Low VRAM: For GPU with less than 8GB VRAM. It is an experimental feature. Check if you are out of GPU memory, or want to increase the number of images processed.
Guess Mode: Also known as non-prompt mode. The image generation can be totally unguided by text prompts. It enforces the ControlNet encoder to follow the input control map (like depth, edge, etc) even if there’s no prompt. Use higher steps, e.g. 50 when using this mode. You normally don’t check this box.
Preprocessor and model
Preprocessor: The preprocessor (called annotator in the research article) for preprocessing the input image, such as detecting edges, depth, and normal maps. None uses the input image as the control map.
Model: ControlNet model to use. If you have selected a preprocessor, you would normally select the corresponding model. The ControlNet model is used together with the Stable Diffusion model selected at the top of AUTOMATIC1111 GUI.
Weight and guidance strength


I will use the following image to illustrate the effect of weight and guidance strength. It’s an image of a girl sitting down.


But in the prompt, I will ask to generate a woman standing up.
full body, a young female, highlights in hair, standing outside restaurant, blue eyes, wearing a dress, side light
Weight: How much emphasis to give the control map relative to the prompt. It is similar to keyword weight in the prompt but applies to the control map.
The following images are generated using ControlNet OpenPose preprocessor, and with the OpenPose model.








As you can see, Controlnet weight controls how much the control map is followed relative to the prompt. The lower the weight, the less ControlNet demands the image to follow the control map.
Guidance strength: This is the number of steps the ControlNet is applied. It is analogous to denoising strength in image-to-image. If the guidance strength is 1, ControlNet is applied to 100% of the sampling steps. If guidance strength is 0.7 and you are doing 50 steps, ControlNet is applied to the first 70% of the sampling steps, i.e. the first 35 steps.
Below are generated with guidance strength ranging from 0.1 to 1 and keeping weight as 1.








Since the initial steps set the global composition (Stable Diffusion removes maximum amount of noise in each step, and it starts with a random tensor in latent space), the pose is set even if you only apply ControlNet to as few as 20% of the first sampling steps. Guidance strength needs to be set to pretty low to have an effect.
Resize mode
Resize mode controls what to do when the size of the input image or control map is different from size of the images to be generated. You don’t need to worry about these options if they are in the same aspect ratio.


I will demonstrate the effect of resize modes by setting text-to-image to generate a landscape image, while the input image/control map is a portrait image.
Envelope (Outer Fit): Fits the image canvas to be within the control map. Crop the control map so that it is the same size as the canvas.
Because the control map is cropped at the top and the bottom, so does our girl.




Scale to Fit: Fit the whole control map to the image canvas. Extend the control map with empty values so that it is the same size as the image canvas.
Compared to the original input image, there are more spaces on the side.




Just Resize: Scale the width and height of the control map independently to fit the image canvas. This will change the aspect ratio of the control map.
The girl now needs to lean forward so that she’s still within the canvas. You can create some interesting effect with this mode.




Scribble canvas settings


You only need to worry about these settings if you use the ControlNet GUI for creating scribbles. These settings have no effect when you upload an input image and use a preprocessor.
Canvas Width and Canvas Height are the width and height of the blank canvas you create when you press the Create blank canvas button.
The up down arrows switches the height and the width. Useful when you were confused and adjusted height for width (as I often do…).
Preview annotation


The ControlNet extension will show you a copy of the control map after each round of image generation. But sometimes you just want to see the control map and experiment with the parameters.
Preview annotator result: Generate control map based on the preprocessor setting. A control map will show right next to the input image.


Hide annotator result: Hide the control map.
Preprocessors
I will again use the same input image to demonstrate the control map produced with different preprocessors.


Canny
Canny is an old-school edge detector. It does a fair job in extracting curvy and straight lines. But it could also be susceptible to noise.


The result is very decent. With the bokehs on background nicely reproduced.


Depth
Depth map is useful for conveying how far away the objects in an image are. White means closer and dark means further away. This function is similar to depth-to-image in Stable Diffusion v2 and uses the same MiDaS depth estimator.
To be used with the control_depth
model.


The shape of the girl is well preserved.


Midas resolution slider: Changes the width of the control map. It is a way to control the resolution of the control map relative to the size of the input image.
Depth LeRes
LeRes is a more recent depth estimation model designed for reconstructing 3D scene from a single image. LeRes generally recovers more details compared to MiDaS, and more control parameters to tweak the control map.
To be used with the control_depth
model.


This is one of those time that depth map with too much detail fails. See the legs and compare with the depth map above.


HED
HED (Holistically-Nested Edge Detection) is an edge detection AI model good at producing outlines from an image like a human would. It functionally similar to Canny but less noisy.
To be used with control_hed
model.


Not entirely sure why but HED model produces artificial colorings like this. Perhaps it’s a bug.


mlsd
M-LSD (Mobile Line Segment Detection) is a fast straight line detector. It is useful for extracting outlines with straight edges like interior designs, buildings, street scenes, picture frames, and paper edges.
To be used with control_mlsd
model.
However, don’t use mlsd with portraits or subjects with curve lines. It won’t be able to extract them.


The outline of the girl didn’t transfer well, as expected from the mlsd control map.


Normal map
Normal map specifies the orientation of the normal vector of the surface that a pixel rests on. It is commonly used to fake texture or depth in 3D models. The RGB values of each pixel now store the vector values, instead of color. Normal is a mathematical concept using the vector perpendicular to a surface to represent its orientation.
Of course, there’s no 3D model. ControlNet only gets a 2D image. The normal map is estimated from the depth map.
Use with control_normal
model.


The 3D orientation of the input image is transferred to a new image faithfully with normal map.


OpenPose
OpenPose is a real-time human keypoint detection software. This option is for extracting human poses without copying other details such as hair style, outfits, and background.
Use with control_openpose
model.


Because there are no constraints on how the image needs to be generated other than the subject having a specific pose, the model can do its job freely and produce a natural-looking image:


Pidinet
pidinet (Pixel Difference network) detects curve and straight edges. It’s result is similar to HED but usually results in cleaner lines with less details.
Use with control_hed
model.


Scribble
Scribble is for preprocessing user-drawn scribbles. This preprocessor should not be used on a realistic image.
Use with control_scribble
model.
Fake Scribble
Fake Scribble produces scribble-like outlines from the input image.
Use with control_scribble
model.


See artifact on legs.


Segmentation
Segmentation map assigns different objects with different colors based on an educated guess. Of course it could be wrong because it is a pretty difficult problem for computers.
Use with control_seg
model.


Image generation will also follows the regions based on the segmentation map.


OK, now (hopefully) you know all the settings. Let’s explore some ideas to use ControlNet.
Some ideas to use ControlNet
Copying human pose
Perhaps the most common application of ControlNet is copying human poses. This is because it is usually hard to control poses… until now! The input image can be an image generated by Stable Diffusion, or can be taken from a real camera.
OpenPose model
To use ControlNet for transferring human poses, follow the instructions to enable ControlNet in AUTOMATIC1111. Use the following settings.
Preprocessor: openpose
Model: control-openpose-fp16


Make sure you have checked Enable.
Here are a few examples.
Example 1: Copying pose from an image
As a basic example, let’s copy the pose of the following image of a woman admiring leaves.


Using various models and prompts, you can dramatically change the content but keep the pose.








Example 2: Remix a movie scene
You can recast the iconic dance scene in Pulp Fiction to some yoga exercises in the park.


This uses ControlNet with DreamShaper model.


This is with the same prompt, but using Inkpunk Diffusion model. (You will need to add the activation keyword nvinkpunk to the prompt)


ControlNet model comparison for copying poses
Many ControlNet models work well for copying human poses, but with different strengths and weaknesses. Let’s use the painting of Beethoven as an example, and transform it with different ControlNet models.


I will use the following prompt with the DreamShaper model.
elegant snobby rich Aerith Gainsborough looks intently at you in wonder and anticipation. ultra detailed painting at 16K resolution and epic visuals. epically surreally beautiful image. amazing effect, image looks crazily crisp as far as it’s visual fidelity goes, absolutely outstanding. vivid clarity. ultra. iridescent. mind-breaking. mega-beautiful pencil shadowing. beautiful face. Ultra High Definition. process twice.












HED model copies the original image most faithfully, down to small little details. Canny Edge, depth, and normal map are doing a decent job but with a bit more variations. As expected, OpenPose copies the pose but allows the DreamShaper model to do the rest of the job, including hairstyle, face, and outfit. m-LSD was designed to detect straight lines and resulted in a total failure here (in copying pose).
Stylize image with ControlNet
Below are with v1.5 model but various prompts to achieve different styles. ControlNet with various preprocessing was used. It is best to experiment and see which one works best.








You can also use models to stylize images. Below are generated using the prompt “Painting of Beethoven” with Anythingv3, DreamShaper and OpenJourney models.








Controlling poses with Magic Pose
Sometimes you may be unable to find an image with the exact pose you want. You can create your custom pose using software tools like Magic Poser (credit).
Step 1: Go to the Magic Poser website.


Step 2: Move the keypoints of the model to customize the pose.
Step 3: Press Preview. Take a screenshot of the model. You should get an image like below.


Step 4: Use OpenPose ControlNet model. Select the model and prompt of your choice to generate images.
Below are some images generated using 1.5 model and DreamShaper model. The pose was copied well in all cases.






Interior design ideas
You can use Stable Diffusion ControlNet’s straight-line detector M-LSD model to generate interior design ideas. Below are the ControlNet settings.


You can start with any interior design photos. Let’s use the one below as an example.


Prompt:
award winning living room
Model: Stable Diffusion v1.5
Below are a few design ideas generated.








Alternatively, you can use the depth model. Instead of straight lines, it will emphasize preserving the depth information.
Settings:


Generated images:








Difference between Stable Diffusion depth model and ControlNet
Stability AI, the creator of Stable Diffusion, released a depth-to-image model. It shares a lot of similarities with ControlNet but there are important differences.
Let’s first talk about what’s similar.
- They are both Stable Diffusion models…
- They both use two conditionings (a preprocessed image and text prompt).
- They both use MiDAS to estimate the depth map.
The differences are
- Depth-to-image model is a v2 model. ControlNet can be used with any v1 or v2 models. This point is huge because v2 models are notoriously hard to use. People have a hard time generating good images. The fact that ControlNet can use any v1 model not only opened up depth conditioning to the v1.5 base model, but also thousands of special models that were released by the community.
- ControlNet is more versatile. In addition to depth, it can also condition with edge detection, pose detection, and so on.
- ControlNet’s depth map has a higher resolution than depth-to-image’s.
How does ControlNet work?
This tutorial won’t be complete without explaining how ControlNet works under the hood.
ControlNet works by attaching trainable network modules to various parts of the U-Net (noise predictor) of Stable Diffusion Model. The weight of Stable Diffusion model is locked so that they are unchanged during training. Only the attached modules are modified during training.
The model diagram from the research paper sums it up well. Initially, the weights of the attached network module are all zero, making the new model able to take advantage of the trained and locked model.


During training, two conditionings are supplied along with each training image. (1) text prompt, and (2) annotation such as OpenPose keypoints or Canny edges. This way, the ControlNet model can learn to generate images based on these two inputs.
Each annotation method is trained independently.
More readings
- Some images generated with Magic Poser and OpenPose.
- Research article: Adding Conditional Control to Text-to-Image Diffusion Models (Feb 10, 2023)
- ControlNet Github page
Thank you for this. Now i need a batch funktion to load video sequence images in a row and start generation with the same seed every time to make a video with different faces .
Thanks! This is by far the best information about ControlNet that I’ve ever seen anywhere on the net.
Thanks! That’s what I am aiming for.
Thank you for all the selfless hard work you put into these tutorials!
Thanks for recognition! Indeed I spent a lot of them on them.
I have to say, this is possibly the best ever written article on Stable Diffusion/ControlNet that I’ve came across.
Highly detailed information which will help newcomers to using AI image generators, you’ve done an excellent job of compiling this Andrew.
Looking forward to more articles in the near future.
Your comment made my day! Thank you. Yes, I have been aiming to do what you said in each of my articles.