This repo is for fine-tuning CLIP in the command line. It does not add custom nodes to ComfyUI; however, you can easily use your fine-tune with ComfyUI:
First, fine-tune with ft-B-train-OpenAI-CLIP-ViT-L-14.py
Or, try the experimental and potentially superior exp-ft-B-GmP-finetune-OpenAI-ViT-L-14.py
If you used "exp-ft-B-GmP", use this to convert the model: exp-ft-C-convert-GmP-back-to-weight.py
Then, for both fine-tune scripts, use ft-C-convert-for-SDXL-comfyUI-OpenAI-CLIP.py
Now you have a state_dict you can plug into ComfyUI for use with SD / SDXL!
Added a new model saver: Saves either as GmP + full model object (default, legacy behavior)
Optional conversion to .weight (converting back with extra script no longer needed)
Option to save as full model, state_dict, or torch.jit.trace model (or all of these)
Check the top of the code, set True / False as desired to enable / disable!
test-models-new-saver.py
to test the various flavors of saved modelsAdded folder Convert-for-HuggingFace-Spaces-etc
Includes the convert_clip_original_pytorch_to_hf.py script from HuggingFace + configuration .json files.
Includes optional code to subsequently extract the Text Encoder only model (e.g. for Flux.1 guidance)
Includes optional code to add metadata {"format": "pt"}
- use it in case you get an error about 'pt'!
how-to-use.txt
& code comments for detailsAdded a-loss-to-penalize-overfit-via-entropy.py
A custom loss with an entropy penalty
term that penalizes over-confidence (overfit)
For a diverse and general (!) dataset, the results of fine-tuning are good, but slightly worse than without entropy penalty (fine-tune on COCO-SPRIGHT):
lambda_entropy
factor. Actual example from log_train.txt
of 1.
:
Epoch n:
Validation Acc: 0.9012, Validation F1: 0.8749
Training Loss: 0.6043, Validation Loss: 1.1853
Epoch n+1:
Validation Acc: 0.8942, Validation F1: 0.8652 <- decrease
Training Loss: 0.6018, Validation Loss: 1.1894 <- increase
Now, for the diverse dataset, this was overtraining, not overfitting; the model had already converged (good Acc/F1, low loss). In this case, early stopping (or saving checkpoints every epoch, then hand-selecting the best one - an earlier one, in this case) is recommended. However, I did not observe such an uptick with entropy penalty for a few epochs of overtraining (albeit the model converged at less ideal Validation Acc: 0.8902, Validation F1: 0.8600
). So, give it a try if you see CLIP do this with your dataset (very extreme example; better to check log_train.txt
to catch it early!):
ft-C-convert-with-org-dtype-fp16.py
-> Save with mixed precision as per OpenAI, model size ~900 MBft-C-convert-to-safetensors.py
-> Should be obvious, but check code comments for details. :-)exp-acts-ft-SMOOTH-finetune-OpenAI-CLIP-ViT-L-14-GmP.py
π₯³exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py
It even further improved the results for the high-quality COCO-40K-SPRIGHT dataset to >91% accuracy, trained on 1x RTX4090 π€―:
You can download this model on my HuggingFace if you don't want to reproduce the fine-tune with the provided code. :-)
Technical / code summary of changes:
class ContrastiveLoss(nn.Module):
def __init__(self, temperature=0.07):
super(ContrastiveLoss, self).__init__()
self.temperature = temperature
self.criterion = nn.CrossEntropyLoss()
def forward(self, logits_per_image, logits_per_text):
# Normalize the features to avoid overflow or underflow
logits_per_image = F.normalize(logits_per_image, p=2, dim=1)
logits_per_text = F.normalize(logits_per_text, p=2, dim=1)
# Calculate logits
logits = torch.matmul(logits_per_image, logits_per_text.t()) / self.temperature
labels = torch.arange(logits.size(0), device=logits.device)
# Calculate loss as the mean of the two cross-entropy losses
loss_img = self.criterion(logits, labels)
loss_txt = self.criterion(logits.t(), labels)
return (loss_img + loss_txt) / 2
class ContrastiveLoss(nn.Module):
def __init__(self, temperature=0.07, smoothing=0.1):
super(ContrastiveLoss, self).__init__()
self.temperature = temperature
self.smoothing = smoothing
def forward(self, logits_per_image, logits_per_text):
# Normalize the features to avoid overflow or underflow
logits_per_image = F.normalize(logits_per_image, p=2, dim=1)
logits_per_text = F.normalize(logits_per_text, p=2, dim=1)
# Calculate logits
logits = torch.matmul(logits_per_image, logits_per_text.t()) / self.temperature
labels = torch.arange(logits.size(0), device=logits.device)
# Apply label smoothing
N = logits.size(0)
smoothed_labels = torch.full_like(logits, self.smoothing / (N - 1))
smoothed_labels.scatter_(1, labels.unsqueeze(1), 1.0 - self.smoothing)
# Calculate loss manually using log-softmax and smoothed labels
log_probs = F.log_softmax(logits, dim=1)
loss_img = -(smoothed_labels * log_probs).sum(dim=1).mean()
log_probs = F.log_softmax(logits.t(), dim=1)
loss_txt = -(smoothed_labels * log_probs).sum(dim=1).mean()
return (loss_img + loss_txt) / 2
β¬οΈ Download my best-performing fine-tune (see Update 12/June/24) here:
Background: I identified an "adverb neuron" in the vision transformer of ViT-L/14. When the activation value is scaled by a factor of 1000, CLIP's "opinion" about any image will be mainly consisting of adverbs (see link above for code & details). I scaled the activaton value of predominantly this penultimate layer neuron by x1000 during fine-tuning on the usual general dataset (CoCo-40k-SPRIGHT), expecting either overfit / "adverb CLIP" or destruction of the model. Initially, training seemed to converge toward the latter, with Validation Accuracy and Validation F1 being in the 0.0X range while gradients truly exploded (reached inf) even after Epoch 0, and given a LR=1e-7. As the scheduler kicked in to increase the learning rate up to 5e-7, a dramatic drop in loss and val loss was observed, with an immediate jump to Validation Acc 0.8, Val F1 0.75, further improving with every additional Epoch. The final model has an unprecedented ImageNet / ObjectNet accuracy of ~0.90 (original pre-trained model / OpenAI's CLIP: ~0.85). Apparently, the model compensated for those erratic, over-activated neurons, and in turn found a better solution / minimum for generalizing text-image contrastive learning. It unexpectedly turned out to be my best-performing fine-tune thus far. Alas I am sharing the code to reproduce the results (or modify other neuron activations experimentally) as-is.
Preliminary results of GmP-CLIP for SDXL-TE repair fine-tune:
Results: The above model, used as SDXL TE again (center samples):
Added scripts to puzzle together a full CLIP text-vision transformer from the SDXL text encoder .safetensors file as per this issue. See the readme in "merge-SDXL-TE-into-full-CLIP-model-object" for details. You can use this (full model object .pt) with all of my scripts as usual, but beware that if you fine-tuned the TE in SDXL (e.g. kohya), it will be unaligned / misaligned with the vision transformer and alas, latent space.
In other words, the model will be completely bonkers (see below), but you can try fine-tuning it "back into alignment" (freeze TE, fine-tune with careful LR). Good luck!
"Normal" CLIP MLP (multi-layer perceptron):
(mlp): Sequential(
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
| (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
GmP CLIP MLP:
Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude
(mlp): Sequential(
|-(c_fc): GeometricLinear()
| (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
(Same thing for [text] transformer.resblocks)
Example for catastrophic overfitting: embeddings collapse and "everything is similar to everything" (cosine similarity). Decrease learning rate, increase batch size, make a better dataset with multiple text labels to choose from, when you see something like this:
Optimized for: ViT-L/14 (Text Encoder of SD / SDXL) + I have 1 NVIDIA GPU with 24 GB VRAM available... π But you can train any OpenAI/CLIP model with this (just remember to tweak batch_size etc. for smaller model, if applicable!).