Image generation AI works nothing like you'd expect. It doesn't sketch an outline and fill it in. It starts with complete random noise — literally a static image — and gradually refines it until recognizable content appears.
**The process:**
**Training**: Show the model millions of images. Progressively add Gaussian noise to each image over many steps until it's unrecognizable static. Train the model to predict and remove the noise at each step. After training, the model has learned to 'denoise' — to move from chaos toward coherent images.
**Inference (generation)**:
1. Start with pure random noise
2. Feed in your text prompt (encoded as embeddings)
3. The model predicts what noise to remove
4. Repeat 20-50 times
5. A coherent image matching your prompt emerges
The text conditioning happens via **cross-attention** — the same mechanism as in LLMs, but applied to image patches. The prompt embeddings are attended to at every denoising step, steering the image toward the described content.
**Why it works so well**: The diffusion process is mathematically guaranteed to converge. The model isn't making things up arbitrarily — it's navigating a learned probability distribution over 'what images look like.'
Key models: Stable Diffusion (open source, runs locally), DALL-E 3 (OpenAI), Midjourney, Imagen (Google), Flux (Black Forest Labs). Each uses variations on this core diffusion idea.
Video generation models (Sora, Runway, Kling) extend diffusion to the time dimension.
**Key takeaway:** Image AI starts with noise and removes it, guided by your text prompt — every step making the image more coherent.