Abstract: Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various down-stream image synthesis tasks, ...