A First Look at DALL-E 2 — How It Works Under the Hood

Know why everyone is talking about it

Published in

Better Programming

6 min readJul 10, 2022

OpenAI released their much-awaited model DALL-E 2 and it's creating headlines everywhere. Big YouTube channels like Marques Brownlee and Vox studios are also making videos about it now. So in this post, we will go through what is Dall -E2, what makes it special, and why everyone is talking about it.

What is DALL-E 2?

Dall-E 2 is the successor to Open AI’s Dall-E model. The name Dall-E is the portmanteau of Wall-E (a sci-fi film by Pixar) and Salvador Dalí (a Spanish artist renowned for his surrealistic style in his paintings). The model is used to generate photorealistic images from a given text description.

The model is not made available to the public yet but the Open AI team has made a nice demo on their website. As you can see, these images are what an artist/graphical designer will take hours if not days to produce but DALL-E2 does it in a matter of minutes and the images it produces are so impressive. It captures all the important characteristics of the prompt it is given and tries to incorporate them into the image.

Enough with this all fanboying over the impressive results, let's look into what is under the hood to see how we are able to generate these images. Fortunately, the Open AI team also released the paper behind DALL-E2 giving us exclusive access to the training process of DALL-E2. DALL-E2 comprises two models CLIP and Diffusion models. CLIP is used to generate CLIP text embeddings given a caption (shown above the dotted line in Fig.2) and a diffusion model (shown below the dotted lines) which first generates an image embedding given the CLIP text embedding using diffusion prior. This image embedding is then decoded using a diffusion decoder to generate the final image.

These terms may sound complicated but the concepts behind them are easy to grasp. So to know more about it, let's dive deeper.

What is CLIP?

CLIP was released along with the original DALL-E paper back in January 2021. CLIP stands for Contrastive Learning Image Pre-training. The basic idea behind CLIP is to take images and text as input and try to connect them. It does this by using an image encoder (Resnet/ViT) to generate image embeddings and a text encoder (Transformer) to generate text embeddings.

Based on these embeddings, we try to learn which image embedding corresponds to which text embedding in a contrastive manner. In simpler words, we try to minimize the dot product of the correct match of image embedding and text embedding and try to maximize the dot product of all the incorrect combinations of image and text embedding.

Before running inference, we generate text prompts from the dataset denoted in stage 2 of the overview. These text prompts are simple prompts generated from the labels of the dataset. The idea behind using prompts is that they will contain more information than just the label. And during inference when an image is given, the encoder generates the image embedding and the model tries to predict which text prompt in the dataset is closest to this image embedding.

This interconnection of text embeddings and image embedding is one of the reasons that DALL-E2 is able to such images based on a text description.

What are Diffusion Models?

Diffusion models are a new kind of Generative models that are outperforming GANs in Image Synthesis and photorealism tasks. Surprise, surprise they are also from Open AI.

These diffusion models work by adding infinitesimal noise to an existing image at each stage. And if we do this step, an infinite number of times we can safely assume that after those steps we would be left with just noise. The task of these diffusion models is to extract an image by reversing this noise addition process.

DALL-E2 does this using the CLIP generated text embeddings and adding a prior to it to convert it to an image embedding. This embedding is then fed to the diffusion decoder to produce those impressive images.

Results

Enough of this tech-talk, now let's look at some images:

Wow, what an image of the panda mad scientist. An even more commendable, thing is that DALL-E2 also understands reflection as we see a greenish tinge on the glasses.

This is also quite an impressive image of an abstract concept of making coffee out of human souls. DALL-E2 understands what a coffee machine is, where should the coffee come out from and it also has an idea of what should human souls look like. This is just beyond impressive!!

These results are basically cherry-picked by OpenAI to show just the best appealing results. However, these results push the boundaries of Image generation from text description to the extreme. Really curious what models we will see in this space in the next few years.

Will there be DALL-E3?

DALL-E2 though impresssive still has a number of shortcomings.

Spelling mistakes

Being a multi-billion parameter model trained on images and text, it is pretty funny that DALL-E2 makes mistakes on spellings. These are some images generated by the users who have access to DALL-E2. The below images shows the images generated by DALL-E2 when given a prompt “A sign that says deep learning”.

“A sign that says deep learning.” Credit: OpenAI

Relationships

DALL-E2 is shown to have a good understanding of the relationship between objects based on the cherry-picked results that Open AI showed us. However, sometimes it just messes up simple prompts like “Red cube on top of a blue cube”.

“A red cube on top of a blue cube.” Credit: OpenAI

I do not want to undermine the results of the DALL-E2, the results that they generate are impressive. However, the failure cases they have are still quite evident. These shortcomings make it incredibly likely that we would see DALL-E3 in the near future.

Why everyone is talking about it?

DALL-E2 is the talk of the town because the results are impressive and popular media houses are talking about it. However, the thing that everyone brings up is the question that “Is DALL-E2 the end of the artist/graphic designer?” YouTubers can now make their thumbnails from DALL-E2 now, they don’t need graphic designers.

Though these questions are concerning, I find these invalid. A machine will never be as good as a human in an abstract skill like art. Artists often struggle with coming up with a starting idea. DALL-E2 can be used as a good starting point that the artist/graphic designer can later modify to make it even better.

Conclusion

The results of DALL-E2 are impressive, they have pushed the boundaries of Image Generation through texts to the extreme. With these kinds of research, we should always look at what these would enable in the next 2–3 years. It's hard to imagine that the original AlexNet paper arrived just 10 years ago, and the ResNet which is now widespread in Computer Vision architectures just 6 years ago. No other field of research is moving as fast as the field of Deep Learning and I’m really excited about what the state-of-the-art will be in the next few years.