Prompt-Tuning for text-to-image diffusion models (especially the CLIP text encoder)

AHHHZ975 commented 2 months ago

Hi, I need to fine-tune the stable diffusion model (e.g. "runwayml/stable-diffusion-v1-5" or "CompVis/stable-diffusion-v1-4" or similar ones) using the prompt-tuning method mentioned in this paper: https://huggingface.co/papers/2104.08691.

Specifically, fine-tuning such a stable diffusion model using this approach would be equal to freezing all the SD model parameters (including the parameters of the UNet, VAE, and the CLIP text encoder) and trying to train and learn new tokens that are prepended to the embedding of the input text prompts that is fed to the CLIP text prompt (the same way as mentioned in the prompt-tuning paper). So, I was wondering if there is already any implementation for this purpose here or generally Prompt-Tuning for text-to-image diffusion models similar to the implementation of prompt-tuning which exists in the PEFT library for language models.

I would appreciate any help/guidance on this. Thanks!

zer0int commented 2 months ago

Sorry, I have no experience with that - but if anything, I think it would be best to ask around on repos that actually work with fine-tuning entire text-to-image generative AI (I'm only familiar with CLIP, and not T5 nor the rest of it all!). Here are some great repos I can recommend from personal experience; some also have a discord, if I remember right. Good luck with your project! 👍

https://github.com/bghira/SimpleTuner

https://github.com/kohya-ss/sd-scripts

https://github.com/ostris/ai-toolkit

AHHHZ975 commented 2 months ago

Thank you for your suggestion and for sharing these repos. I quickly took a look at them and they're actually helpful. Thanks for your wish and I wish you the best as well 🙏

zer0int / CLIP-fine-tune

Prompt-Tuning for text-to-image diffusion models (especially the CLIP text encoder) #12