DALL·E 3 is a text-to-image AI model you can use with ChatGPT. In this post, we will compare DALL·E 3 with Stable Diffusion XL to see what each model excels at.
- DALL·E 3
- Stable Diffusion
- DALL·E 3 vs Stable Diffusion XL
- Improving Stable Diffusion
- Using Stable Diffusion and DALL·E 3 together
What is DALL·E 3?
DALL·E 3 is a text-to-image generative AI that turns text descriptions into images. The training and model architecture is described in the paper “Improving Image Generation with Better Captions” by James Betker and coworkers.
The major improvement in DALL·E 3 is the ability to generate images that follow the prompt closely. The authors found that the current text-to-image models do not follow prompts well because the captions of the training images are noisy. By using highly descriptive captions generated by a captioning model, they were able to improve the prompt-following ability of DALL·E 3 significantly.
Note that DALL·E 3 has other undisclosed improvements over the previous version. So, better performance does not all come from better captioning in training.
How to use DALL·E 3?
You will need to subscribe to ChatGPT Plus to use DALL·E 3.
Follow these steps to use DALLE3:
- Open ChatGPT.
- Tell ChatGPT to “Create an image with….”. Type a description of the image. ChatGPT will revise and expand your description and display the images generated with DALL·E 3.
You then interactively tell ChatGPT which one you want to continue to modify. You can’t revise the prompt directly. ChatGPT acts as a middleman between you and DALLE 3. In addition to the prompt, you can ask ChatGPT to change the image’s aspect ratio.
What is Stable Diffusion?
Similar to DALL·E 3, Stable Diffusion is a text-to-image generative AI model. It is a latent diffusion model in which image synthesis occurs in a smaller latent space. It has the advantage of being smaller and can be run on a personal computer.
How to use Stable Diffusion?
DALL·E 3 vs Stable Diffusion XL
We will compare DALL·E 3 and Stable Diffusion XL 1.0 in this section.
The main improvement in DALL·E 3 is its prompt-following ability. In my testing, DALL·E 3 generates images that match the prompt much closer, by a wide margin. This agrees with the research article.
Test 1: Double helix
An aerial perspective of a vast forest landscape that forms a DNA double helix pattern, with rivers and clearings symbolizing its features.
Stable Diffusion XL doesn’t display a double helix pattern. The double helix pattern begins to appear when the keyword weight of “DNA double helix pattern” is increased to 1.2, but the blending is subpar.
Test 2: Nuclear war
A photo of a young boy and girl holding hands, witnessing the aftermath of an atomic bomb detonation from an elevated vantage point.
Both images are faithful to the prompt, but I would rate DALL·E 3 more accurate because the couple stands at an elevated viewpoint which is what the prompt specifies, and the atomic bomb is closer to the public’s imagination.
Although Stable Diffusion XL represents a quantum leap in rendering text, it performs worse than DALL·E 3, in my opinion.
Test 1: Billboard
An illustration of a vibrant billboard sign emphasizing the message “Stable Diffusion XL is better than DALLE 3” with radiant light beams.
Test 2: Hiking sign
A female hiker triumphantly reaching the summit with a wooden sign reading “Get High”.
Stable Diffusion XL is doing better with short phrases. This shouldn’t be a surprise for anyone. But I would still rate DALLE 3 better.
Stable Diffusion has an advantage in rendering a variety of styles. It generates realistic photos better than DALLE 3 out of the box, not to mention you can use community-developed models fine-tuned for realistic images.
Even if the Stable Diffusion base model does not perform, you can likely find a fine-tuned model to render the style you want.
Here’s another comparison for an impressionist painting style.
Inpainting and outpainting
As of writing, DALLE 3 is not capable of both. For example, asking DALLE 3 to outpaint an image changes it completely. Stable Diffusion can do both and is clearly the winner here.
You cannot directly control the prompt for DALLE 3. You tell ChatGPT what you want to draw, and it edits the prompt for you. This is both good and bad. It is good for beginners because it delegates prompt engineering to ChatGPT. It is bad for expert users because it takes away the ability to fine-tune the prompt.
Stable Diffusion hasn’t been the same since the invention of ControlNet. Thanks to ControlNet, you can steal a pose, a composition, and colors. Of course, none of them are available with DALLE 3.
To sum up, DALLE 3 generates images that follow prompts much better than Stable Diffusion. This also applies to text rendering. It integrates with ChatGPT to improve your prompt before rendering. These translate to a high chance of getting a usable image the first time you try.
The downside of DALLE 3, at least for now, is the inability to further dial in an image. It doesn’t support inpainting, outpainting, and ControlNet. Being a single model, the possible styles are more limited than Stable Diffusion.
DALLE 3 excels at ease of use. I found it practical. Compared to Stable Diffusion and MidJourney, I can count on it to generate the image I need in the shortest time. On the other hand, Stable Diffusion is for artistic creation and fun, with the ability to refine every aspect of the image until it is perfect.
The ChatGPT middleman makes it challenging for expert users to fine-tune the image because they cannot modify the prompt directly. This limitation likely arises from liability concerns. ChatGPT incorporates an additional safety filter to eliminate any inappropriate content from the prompt. As a result, users cannot use the AI model with complete freedom.
Perhaps the biggest divide is in the business model: DALLE 3 is a close, proprietary service. Stable Diffusion is an open-source, downloadable model. The power of Stable Diffusion lies in thousands of users spending millions of hours building tools for it and fine-tuning it.
Governments and big corporations around the world are keen on regulating open-source AI models prematurely. The effort will likely stifle open-source developments that have led to thousands of custom models on Civitai and amazing tools like ControlNet.
Improving Stable Diffusion
From the success of DALLE 3, a quick improvement is to fine-tune the Stable Diffusion XL model with highly accurate captions. This will significantly improve the out-of-box performance of Stable Diffusion, making it more useful as a text-to-image generator.
Using Stable Diffusion and DALL·E 3 together
Of course, Stable Diffusion and DALLE 3 are not mutually exclusive. We can use them together in their strengths.
You can first generate an image in DALLE 3 and use Stable Diffusion for inpainting. This compensates for DALLE 3’s inability to inpaint.
Reference images for ControlNet
If you have trouble generating an image with Stable Diffusion, you can try DALLE 3. Then use the image as a reference for ControlNet Canny, for example, to steal the composition.