zer0int / CLIP-fine-tune

Fine-tuning code for CLIP models
MIT License
166 stars 8 forks source link

Instructions on how to use with huggingface/diffusers? #5

Closed voodoohop closed 5 months ago

voodoohop commented 5 months ago

As the question states. Is it possible to drop in to any Stable Diffusionndiffusers pipeline?

zer0int commented 5 months ago

I have just uploaded the file "ViT-L-14-GmP-ft-TE-only-HF-format.safetensors" to huggingface. https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/tree/main It uses the exact same format (naming, dtype / precision) as the ViT-L TE that is wrapped inside e.g. SDXL. So, it should be possible to now wrap that up with the U-Net and VAE and the 2nd, big CLIP-G text encoder, and fine-tune the whole thing (assuming that's what you intend to do). However, I only tested this as working for inference (generating images with SDXL using the above TE instead of standard ViT-L). If you encounter any freak accidents / unexpected glitches, please let me know!

voodoohop commented 5 months ago

I have just uploaded the file "ViT-L-14-GmP-ft-TE-only-HF-format.safetensors" to huggingface. https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/tree/main It uses the exact same format (naming, dtype / precision) as the ViT-L TE that is wrapped inside e.g. SDXL. So, it should be possible to now wrap that up with the U-Net and VAE and the 2nd, big CLIP-G text encoder, and fine-tune the whole thing (assuming that's what you intend to do). However, I only tested this as working for inference (generating images with SDXL using the above TE instead of standard ViT-L). If you encounter any freak accidents / unexpected glitches, please let me know!

Actually it's just for inference at the moment