zer0int / CLIP-fine-tune

Fine-tuning code for CLIP models
MIT License
166 stars 8 forks source link

CLIP-G Training? #1

Closed bash-j closed 6 months ago

bash-j commented 6 months ago

Hi, thanks for providing this training script for training the CLIP L model. Is it possible to modify it to train CLIP G? I tried but the clip library doesn't have CLIP G. Then I tried using open clip, but it was using about 48GB of VRAM just to do batch size 1 and I'm not even sure if the modifications I made were working, haha.

While training CLIP L does improve the model, wouldn't it make sense to also train CLIP G on the same data to improve it even more? I found when fine tuning the full SDXL model that CLIP G picks up artist styles really well compared to CLIP L.

zer0int commented 6 months ago

Hey there! What you have observed, unfortunately, sounds about right, as far as I know.

If the following model, CLIP ViT-L/14, requires 20 GB of VRAM for a given fine-tuning:

Vision Transformer - Number of Layers: 24 with 4096 Features / Layer
(attn): MultiheadAttention -> (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
MLP: -> (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) -> (c_proj): Linear(in_features=4096, out_features=1024, bias=True)

Text Transformer - Number of Layers: 12 with 3072 Features / Layer
(attn): MultiheadAttention -> (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
MLP: -> (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True), (c_proj): Linear(in_features=3072, out_features=768, bias=True)

How much VRAM will this CLIP ViT-G/14 model of the same architecture take? Model:

Vision Transformer - Number of Layers: 40 with 6144 Features / Layer
(attn): MultiheadAttention -> (out_proj): NonDynamicallyQuantizableLinear(in_features=1408, out_features=1408, bias=True)
MLP: -> (ln_2): LayerNorm((1408,), eps=1e-05, elementwise_affine=True), (c_proj): Linear(in_features=6144, out_features=1408, bias=True)

Text Transformer - Number of Layers: 24 with 4096 Features / Layer
(attn): MultiheadAttention -> (out_proj): NonDynamicallyQuantizableLinear(in_features=1024, out_features=1024, bias=True)
MLP: -> (c_proj): (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True), Linear(in_features=4096, out_features=1024, bias=True)

The memory usage for self-attention and feedforward layers scales quadratically with the number of features (ouch!).

Alas, skipping right to the answer: If CLIP ViT-L/14 requires 20 GB of VRAM for a given fine-tune, the CLIP ViT-G/14 model will require ~500 GB of VRAM for the same fine-tune.

So yeah, if you are willing to throw out your wardrobe and replace it with a small GPU compute cluster, it would absolutely make sense to fine-tune that CLIP in which the "G" stands for "intimidatingly GIGANTIC".

Sorry! 🙃

zer0int commented 6 months ago

Also, due to CLIP's contrastive learning (which basically pushes things in the embeddings away when dis-similar, and brings them close when similar), a small batch size is really bad. It's even bad for ViT-L with 24 GB VRAM, with overfitting being, in essence, a pre-determined outcome. If you see research papers, they often propose CLIP fine-tuning in the range of batch sizes between 512 and 2048, haha. Not some batch size of 48 that you might squeeze into your high-end consumer GPU...

And what happens in a training gone wrong (here: just a VIT-L overfitting massively) is, in essence, some super-dense black hole forming in the embeddings, into which everything collapses. So everything is similar to everything (cosine similarity) and the model is ruined. Now, in ViT-L, some well-ah-okay batch size and careful fine-tuning can prevent such dramatic outcomes (which I am hoping to archive with the code I provided!). But in ViT-G, with a batch size of 1 or 2, you could pretty much expect this to happen, no matter what. Here's a ruined ViT-L finetune:

github-zer0int-clip-fine-tune-or-sdxl

bash-j commented 6 months ago

Thanks for the explanation! I guess it's sticking to using existing tools like kohya, onetrainer, etc to train CLIP-G, which I have to do very slowly so it doesn't ruin the model. Would be awesome to have this trainer in onetrainer.🙂

How did you make that PCA chart?

I'm currently running the fine tune on 90k images and it seems to be going well so far. A lot of a the gradient norms charts will have a single downward spike at different spot each epoch.

gradient_norms_epoch_21_log

zer0int commented 6 months ago

Yeah, fine-tuning the big CLIP-G together with the U-Net seems to be the best option if you are limited to non-professional GPUs / 24 GB VRAM, as that kind of "forces regularization" on CLIP like no other method will do (when fine-tuned standalone, CLIP gets worse with high weight decay, CLIP just stops learning when you clip its gradients (sounds like a pun, but isn't), etc. etc.).

So, indeed, I would go with slowly fine-tuning the CLIP-G together with the U-Net (potentially separately continue training the U-Net with a frozen CLIP before CLIP gets ruined), and then fine-tuning ViT-L standalone, and puzzling it all back together for inference.

Other than that, it's all better with a better dataset; multiple labels accurately describing the same image, like "a photo of fruit sitting in a bowl on a table" vs. "a bowl with apples and bananas are on a round wooden table", chosen at random during training, will be nice "textual noise" to prevent CLIP from going berserk into overfitting. A great task to delegate to GPT-4o, too! 👍

PS: I just added the PCA script to the repo. =)

bash-j commented 6 months ago

You're awesome, thank you!

minienglish1 commented 5 months ago

Other than that, it's all better with a better dataset; multiple labels accurately describing the same image, like "a photo of fruit sitting in a bowl on a table" vs. "a bowl with apples and bananas are on a round wooden table", chosen at random during training, will be nice "textual noise" to prevent CLIP from going berserk into overfitting. A great task to delegate to GPT-4o, too! 👍

Trying to figure out how to correctly structure my captions. It looks like from the examples, the caption was broken into sections/tags. I already have 25k image dataset with 5 captions (cog, llava, etc) per image. I would want to break down each caption into sections, combine to a list, append to the dataset-labels.json, correct? It would be incorrect to just make a list of the 5 captions to use for training.

zer0int commented 5 months ago

Yes, each unique caption / way of describing an image should be separate. For example, the CoCo-SPRIGHT dataset has the original short CoCo labels as well as a long, spatially aware (SPatially RIGHT = SPRIGHT) caption. In this example, a random choice between the two would be made during training:

coco-labels

If I just want to train on the short labels, the easiest way is to make sure the "if" condition is not met, and:

spright-short

Or, for a CLIP+BLIP (CLIP interrogator) caption set:

howto-labels

I hope this helps!

minienglish1 commented 5 months ago

Makes sense. Thanks for the help!