1 City University of Hong Kong, Hong Kong SAR. 2 Tianjin University, China.
[Paper Link] . Supplementary materials can be found in Arxiv version.
Recently, text-to-image denoising diffusion probabilistic models (DDPMs) have demonstrated impressive image generation capabilities and have also been successfully applied to image inpainting. However, in practice, users often require more control over the inpainting process beyond textual guidance, especially when they want to composite objects with customized appearance, color, shape, and layout. Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. Furthermore, our Uni-paint is based on pretrained Stable Diffusion and does not require task-specific training on specific datasets, enabling few-shot generalizability to customized images. We have conducted extensive qualitative and quantitative evaluations that show our approach achieves comparable results to existing single-modal methods while offering multimodal inpainting capabilities not available in other methods.
conda env create -f environment.yaml
conda activate ldm
Download pretrained Stable Diffusion v1.4 from here and place it at ckpt/sd-v1-4-full-ema.ckpt
.
Please refer to official SD repo for more details.
Download pre-computed CLIP text embedding (see paper Eq.6 for explanation) from onedrive and place it at ckpt/clip_emb_normalized(49407x768).pth
. Or you can skip downloading now, the code also will generate this file if it's not found, this process may take several mintues.
inpaint.ipynb
.inpaint_with_exemplar.ipynb
.We also made an interactive gradio demo for convenient use. Here are the step-by-step guidelines:
Launch the demo script gradio_demo/demo.py
.
By default, go to http://127.0.0.1:7860/ in your browser, the demo should be displayed there. If you are runing the model on a server, you may forward the demo to your local pc browser by using the command ssh username@xxx.xxx.xxx.xxx -p 22 -L 7860:localhost:7860
.
Input image: at the left-top section, provide the input image and draw the mask area.
[Optional] Exemplar image: In second column, provide an exemplar image and check the box Enable exemplar
.
Initialize: Click Initialize
button (this will setup the model and prepare your inputs).
Finetune: Click Finetune
button to launch the finetuning on your inputs. Please wait until finetuning is finished (which takes ~1 minute, you will see button changes from Finetuning...
back to Finetune
when it's done).
Inference:
Enable text
, Enable exemplar
, Enable stroke
), then click Inference
button.Enable text
, and input your text prompt, then click Inference
button.Enable exemplar
, then click Inference
button.Enable stroke
, then you will see the masked input being displayed below, use the color brush tool to draw the color stroke within the black masked area. Or you can upload your own stroke image (the background needs to be black). Finally click Inference
button. Mixed inpainting: for example, to perform text + stroke inpainting, check both Enable text
and Enable stroke
boxe, uncheck Enable exemplar
box, input your text prompt and draw the color stroke, then click Inference
button.
Note: you can adjust the stroke blending timestep slide bar to adjust the realism-faithfulness trade-off (larger value leads to more realistic but less aligned result).
Other notes:
Enable stroke
box, this might be caused by the unknown bug of the gradio, check and uncheck the Enable stroke
box several times can solve this issue.@inproceedings{unipaint,
author = {Yang, Shiyuan and Chen, Xiaodong and Liao, Jing},
title = {Uni-Paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model},
year = {2023},
isbn = {9798400701085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3581783.3612200},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {3190–3199},
location = {Ottawa ON, Canada},
series = {MM '23}
}
The code is built based on LDM and Textual Inversion.