Implements the DiffEdit paper using stable diffusion. The key insight: you can locate what needs to change in an image by comparing the denoising behavior under two different text prompts — no manual masking required.

DiffEdit output

Key ideas covered:

  • Mask generation: contrasting noise predictions from a source and target prompt to identify the edit region
  • DDIM inversion: encoding the input image back into latent noise to preserve unedited regions
  • Targeted denoising: applying the diffusion process only within the generated mask
  • Stable Diffusion internals: CLIP text encoding, UNet denoising, VAE decoding