This paper introduces the first gradient-based framework for prompt
optimization in text-to-image diffusion models. We formulate prompt engineering
as a discrete optimization problem over the language space. Two major
challenges arise in efficiently finding a solution to this problem: (1)
Enormous Domain Space: Setting the domain to the entire language space poses
significant difficulty to the optimization process. (2) Text Gradient:
Efficiently computing the text gradient is challenging, as it requires
backpropagating through the inference steps of the diffusion model and a
non-differentiable embedding lookup table. Beyond the problem formulation, our
main technical contributions lie in solving the above challenges. First, we
design a family of dynamically generated compact subspaces comprised of only
the most relevant words to user input, substantially restricting the domain
space. Second, we introduce "Shortcut Text Gradient" -- an effective
replacement for the text gradient that can be obtained with constant memory and
runtime. Empirical evaluation on prompts collected from diverse sources
(DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that
substantially improve (prompt enhancement) or destroy (adversarial attack) the
faithfulness of images generated by the text-to-image diffusion model.