GANs, VAEs, and Diffusion Models
Before large language models dominated the headlines, three other generative architectures shaped the field. Understanding them gives you a richer picture of the design space and helps you choose the right tool for the right problem.
Generative Adversarial Networks (GANs)
Ian Goodfellow invented GANs in 2014, reportedly in a single evening. The idea is elegant and adversarial: train two networks against each other.
The generator takes random noise as input and attempts to produce realistic outputs: images that look like photographs, for example.
The discriminator takes an input (either a real image from the training data or a generated image from the generator) and attempts to classify it as real or fake.
The two networks are trained simultaneously. The generator's goal is to fool the discriminator. The discriminator's goal is to not be fooled. As training progresses, the generator gets better at producing realistic images and the discriminator gets better at spotting fakes. In the end, a well-trained generator produces outputs that even a good discriminator cannot distinguish from real data.
This minimax game can be formalised as:
Generator tries to minimise: log(1 - D(G(z)))
Discriminator tries to maximise: log(D(x)) + log(1 - D(G(z)))
Where z is random noise, G is the generator, D is the discriminator and x is real data.
The mode collapse problem
GANs are notoriously difficult to train. The most common failure mode is called mode collapse. The generator discovers a small set of outputs that consistently fool the discriminator and collapses to producing only those outputs, regardless of the input noise. Instead of generating diverse, varied images, it produces variations of the same few examples. This is because the generator has no incentive to be diverse: only to be convincing.
Researchers developed numerous techniques to combat mode collapse (Wasserstein loss, progressive growing, spectral normalisation), but it remained a fundamental challenge.
When GANs excel: Image synthesis, style transfer, super-resolution (enhancing image resolution), face aging and editing, data augmentation for training other models.
Variational Autoencoders (VAEs)
VAEs take a different approach, rooted in probabilistic modelling rather than adversarial training.
An autoencoder has two parts: an encoder that compresses input data into a lower-dimensional representation (called a latent vector or code) and a decoder that reconstructs the original input from that latent representation. The goal is to learn a compressed representation that captures the essential structure of the data.
A variational autoencoder adds a crucial twist: instead of encoding each input to a single point in latent space, it encodes it to a probability distribution (specifically, a Gaussian). During training, the model is forced to keep this latent space smooth and continuous: nearby points in latent space should decode to similar outputs.
Why does this matter for generation? Once the VAE is trained, you can sample a random point from the latent space and decode it to get a generated output. Because the latent space is smooth and well-structured, interpolating between two points in latent space smoothly interpolates between the corresponding outputs. You can walk through the latent space and watch faces gradually change expression or car designs smoothly transition between styles.
When VAEs excel: Generating diverse samples, latent space manipulation and interpolation, semi-supervised learning, anomaly detection (an input that encodes to an unusual latent position is likely anomalous).
Diffusion Models
Diffusion models are the architecture behind Stable Diffusion, DALL-E 2 and 3, Midjourney and most of the state-of-the-art image generation systems you interact with today.
The intuition is inspired by physics: specifically, thermodynamic diffusion.
The forward process (adding noise): Take a real image and progressively add Gaussian noise over many steps (typically 1,000 steps) until the image is indistinguishable from pure random noise. This process is fixed and requires no learning.
The reverse process (denoising): Train a neural network (typically a U-Net) to predict and remove the noise at each step. If you can learn to reverse the diffusion process, you can start from pure noise and iteratively denoise it into a realistic image.
At generation time, you start with a random noise sample and apply the learned denoising model repeatedly, step by step, until a coherent image emerges.
Text conditioning: To generate images from text prompts, the denoising network is conditioned on a text embedding. At each denoising step, the model is guided by both the noisy image and the text description, steering the generation toward outputs consistent with the prompt.
Why diffusion models surpassed GANs:
- More stable training (no adversarial game, no mode collapse)
- Better sample diversity
- More controllable generation via the conditioning mechanism
- Easier to scale
When diffusion models excel: High-quality image and video generation, inpainting (filling in missing regions of images), image editing, text-to-image generation.
Choosing the Right Approach
| Model | Strengths | Weaknesses |
|---|---|---|
| GAN | Fast inference, sharp images | Training instability, mode collapse |
| VAE | Smooth latent space, interpretable | Slightly blurry outputs |
| Diffusion | High quality, diverse, controllable | Slow inference (many steps) |
Quiz: Explain mode collapse in GANs and why it happens. What is the key insight behind how diffusion models generate images, starting from noise?