DiffEdit: Text-Guided Image Editing via Diffusion

Implements the DiffEdit paper using stable diffusion. The key insight: you can locate what needs to change in an image by comparing the denoising behavior under two different text prompts — no manual masking required.

DiffEdit output

Key ideas covered:

Mask generation: contrasting noise predictions from a source and target prompt to identify the edit region
DDIM inversion: encoding the input image back into latent noise to preserve unedited regions
Targeted denoising: applying the diffusion process only within the generated mask
Stable Diffusion internals: CLIP text encoding, UNet denoising, VAE decoding