Technion – Israel Institute of Technology
The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. In this work we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire diffusion/flow process as a black box, enabling optimization through the whole process without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate the effectiveness of FlowOpt in the context of image editing, showcasing two use cases: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to the target text prompt. In both settings, our method achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods.
FlowOpt is a zero-order (gradient-free) optimization framework that allows optimizing through whole flow processes for the purpose of performing editing with pre-trained flow models.
FlowOpt treats the flow process as a black box function \(f\), as illustrated in the figure above.
This function accepts an initial noise map \(\boldsymbol{z}_{1}\) and optionally a text prompt \(c\) and generates the image \(\boldsymbol{z}_{0}\)
at the end of the flow path as \(\boldsymbol{z}_{0} = f(\boldsymbol{z}_{1}, c)\).
For any given source image \(\boldsymbol{y}\) one wishes to edit, FlowOpt can be used to solve the optimization problem
\[ \boldsymbol{z}_1^* = \argmin_{\boldsymbol{z}_1} \; \frac{1}{2} \left\lVert f(\boldsymbol{z}_1, c) - \boldsymbol{y} \right\rVert^2 \]
without using the gradients of \(f\). This optimization problem can be used for both
inversion (recovering the initial noise \(\boldsymbol{z}_{1}\) that causes the flow process to generate the image \(\boldsymbol{y}\) when the text \(c\) describes that image) and
direct editing (generating a modified image that is as similar as possible to \(\boldsymbol{y}\) but which conforms to a text prompt \(c\) describing a desired edit).
A key advantage of FlowOpt is that, unlike most inversion methods, it can work with a small number of diffusion timesteps.
Combined with the fact that only a small number of optimization iterations are required for minimizing the loss, this translates to a total number of NFEs that is comparable to other methods.
The following figures demonstrate the reconstruction quality of FlowOpt for the inversion task for both FLUX and Stable Diffusion 3 (SD3), compared to other methods as a function of the number of NFEs, where the RMSE is an average over a dataset:
See our paper for more details.
Intermediate samples attained during our zero-order optimization. As the iterations progress, the reconstruction converges to the ground truth image.
Bibtex
@article{TBD }