First, thanks for all your work on this repo, it's great stuff!

After fine-tune, how to correctly save the text encoder for use with: CLIPTextModel.from_pretrained & StableDiffusionXLPipeline.from_pretrained ?

1) I trained CLIP with: "exp-ft-B-GmP-finetune-OpenAI-ViT-L-14.py" 2) Converted it back to weights with "exp-ft-C-convert-GmP-back-to-weight.py"

after converting back, I tried: text_encoder = original_model.transformer text_encoder_state_dict = text_encoder.state_dict() torch.save(text_encoder_state_dict, 'ft-checkpoints/text_encoder_state_dict.pth')

but when I loaded the the stat_dict on to the text_encoder from the sdxl pipeline I got: [rank0]: RuntimeError: Error(s) in loading state_dict for CLIPTextModel: [rank0]: Missing key(s) in state_dict: ...

Any additional information you want to provide would be appreciated. My goal is to use the fine-tuned CLIP-ViT/L when I train the sdxl unet & maybe clip-G, then save the final fine-tuned model in diffusers/safetensors format. I'm using a custom accelerate FSDP script I wrote to train sdxl.

Thanks for the great repo!

Also, have you thought about using accelerate FSDP cpu_offset to increase the batch size? I ran some quick tests on my sdxl trainer and AdaBelief works fine with FSDP cpu_offset and sharding the unet. Should be some easy changes to your script to add cpu_offset increased batch size & sharding for multi-gpu training. Once I can get the fine-tuned CLIP-ViT/L working in my sdxl training script, I'll test out adding FSDP to your CLIP training script.

Yeah, the naming of the keys and the way they are converted / expected for HuggingFace (diffusers/transformers) is pretty "delicate"; fortunately, I recently discovered that the HF team updated their conversion script, so it works with recent versions of "transformers"!

https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py

Next, to extract the text encoder only and to ensure I include the correct keys, I used a trick:

Download a CLIP-L model.safetensors from HuggingFace, i.e. the SDXL CLIP-L. This ensures it's the "right CLIP".
Compare the finetune to this "right CLIP" model.safetensors, keep all keys that are in the original, discard the rest, and save:

import torch
from safetensors.torch import load_file, save_file

# Load the original and fine-tuned models
original_state_dict = load_file("model.safetensors")
finetuned_state_dict = load_file("finetune.safetensors")

# Create a new dictionary for the text encoder
filtered_state_dict = {k: v for k, v in finetuned_state_dict.items() if k in original_state_dict}

# Save the filtered state dictionary
save_file(filtered_state_dict, "finetune_TE-only.safetensors")

# Load the saved text encoder model
filtered_loaded_state_dict = load_file("finetune_TE-only.safetensors")

# Compare two model state dictionaries by key, shape, and dtype.
def compare_models(model1, model2):
    print(f"{'Key':<50} {'Model 1 Shape':<30} {'Model 2 Shape':<30} {'Match'}")
    print("-" * 130)
    for key in model1.keys() | model2.keys():
        shape1 = model1.get(key, None)
        shape2 = model2.get(key, None)
        if shape1 is not None and shape2 is not None:
            match = shape1.shape == shape2.shape and shape1.dtype == shape2.dtype
            print(f"{key:<50} {str(shape1.shape):<30} {str(shape2.shape):<30} {match}")
        else:
            print(f"{key:<50} {'N/A' if shape1 is None else str(shape1.shape):<30} "
                  f"{'N/A' if shape2 is None else str(shape2.shape):<30} {'No'}")

# Perform comparison
compare_models(original_state_dict, filtered_loaded_state_dict)

PS: I'm curious about your results with multi-GPU training for CLIP alone; technically, a larger batch_size (larger than 24 GB VRAM allows) should be great for CLIP (albeit Geometric Parametrization manages to offset the otherwise catastrophic effects of tiny batch sizes, but 'more' should still be better, in theory - I never tried, I only have 1 GPU).

Wishing you much success! :)

Thanks for the response!

I tried "convert_clip_original_pytorch_to_hf.py", but kept getting EOF errors, even when I trained with "ft-B-train-OpenAI-CLIP-ViT-L-14_test_0.py". So I gave up and instead modified "ft-B-train-OpenAI-CLIP-ViT-L-14_test_0.py" to use transformers. Training loss looks similar and I can save with save_pretrained. Just tested it out with sdxl pipeline, works easily. It's enough to start testing CLIP training. If needed later, I'll try to figure out what's going on with converting to huggingface.

Also, accelerate FSDP cpu_offset works with 1 gpu, and with little impact on training speed. It's real easy to setup if you want to try: 1) Modify script to use accelerate (like 10 lines of code for accelerator object, gradient_accumulation, and mixed_precision). just follow the accelerate tutorial. 2) configure accelerate (for fsdp_config use fsdp_sharding_strategy: NO_SHARD & fsdp_offload_params: true).

When I get some more time to sit down and test out CLIP training, I'll try to add accelerate FSDP in. If you try to add it yourself before then, and run into any problems, just let me know.

Thank you for the tip, too! Modifying the script to use accelerate was the easy part - but how on earth did you get to configure it at all? :-)

accelerate config -> edit -> that's not working as intended -> print "accelerator.state" -> some arbitrary default
pip uninstall, reinstall
accelerate config -> yes, default yaml is created, just set "num_processes: 5" so I can see a diff -> nope, arbitrary default =1
delete environmental variables with HF, nothing
replace any args that used to say "none" for config (in source) with my absolute path to yaml, nothing
sys.settrace -> ah, it ends up in 'if self.backend is None:' in 'state.py'!
let's try and force it to go & use FSDP from there... wait. maybe I should ask @minienglish1 before I try that!

Are you not simply using "accelerator = Accelerator()" (can't pass a config there, either, haha - I tried!) and "model, optimizer, train_dataloader, val_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, val_dataloader, scheduler)" and all in the python script, maybe? What's this sorcery? 🙃

Thanks a lot for your help at this point! 😀

I got it to work, on a single rtx4090 ran a bsz of 220 for 1 epoch. Took 2.2x as much clock time to run, gradients exploded everywhere, terrible loss, OOM crashed when 2nd epoch started, and I couldn't get PCA analysis to work inside the training loop. but it ran, so that's a starting point.

To setup your accelerate config file, in terminal activate venv if needed, then type "accelerate config", follow the instructions, and it'll put a config yaml in ~/.cache/huggingface/accelerate/ After that, you can copy the file to another location, edit it, and use it when launching accelerate such as in my launch.sh:

source venv/bin/activate accelerate launch --config_file "default_config.yaml" clip_finetune_3.py

Here's my accelerate config "default_config.yaml" that I used:

compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_activation_checkpointing: true fsdp_auto_wrap_policy: NO_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: false fsdp_forward_prefetch: true fsdp_offload_params: true fsdp_sharding_strategy: NO_SHARD fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

"fsdp_activation_checkpointing" through "fsdp_use_orig_params" is indented. I don't know how to make comments have indents.

yes, I use "accelerator = Accelerator()". But you put things like mixed_precision or gradient_accumulation stuff there. like in my sdxl training script: accelerator = Accelerator( gradient_accumulation_plugin=gradient_accumulation_plugin, mixed_precision=metadata["accelerate_mixed_precision"], )

For this: "model, optimizer, train_dataloader, val_dataloader, scheduler = accelerator.prepare(model, optimizer, train_dataloader, val_dataloader, scheduler)" when using FSDP, prepare the model first, then everything else. Saves memory with how the optimizer is processed or something

when using FSDP, prepare model first

model = accelerator.prepare(model)

then prepare everything else

optimizer, train_dataloader, val_dataloader, scheduler = accelerator.prepare(optimizer, train_dataloader, val_dataloader, scheduler)

Best of luck!

Thank you very much, I will try that! I actually had indentation etc. for (after) fsdp_config:, plus there are a few other things you did differently, as you mentioned; I'll see if it works this time!

PS: I also just committed Convert-for-HuggingFace-Spaces-etc - if you want to have a go at your original issue again.

Thanks for Convert-for-HuggingFace-Spaces-etc. When I finish these two projects, I'll go back and try again.

As I said before, my friend pointed me at this to see how multi-gpu was used to create huge batch sizes when training CLIP: https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py

Initially, I was thinking I could emulate multi-gpu by storing the data each batch, and collecting it together to calculate loss.But then I found this pull request which does exactly that (assuming I understand it correctly): https://github.com/mlfoundations/open_clip/pull/267

I think it may be a better alternative to FSDP, since it allows for an effective infinite batch size, assuming everything fits in memory. Accelerate distributed state could then be used instead for multi-gpu, which is much simpler.

I copied some of the pull request's code changes, and after some very very very long discussions with ChatGPT, I think I got it working. I tested it with --accum-freq of 2 and bsz 40, for a few epochs, and things look correct. I'm using the huggingface version of CLIP, but the core of the code is below:

` accum_image_features = [] accum_text_features = [] accum_data = [] accum_logit_scales = []

    progress_bar = tqdm(enumerate(train_dataloader), total=len(train_dataloader), desc=f'Epoch {epoch + 1}/{EPOCHS}', leave=True)
    for batch_idx, (images, texts) in progress_bar:
        images = [Image.open(image).convert("RGB") for image in images]
        inputs = processor(
            images=images,
            text=texts,
            return_tensors="pt",
            padding=True,
            truncation=True
        ).to(device)

        with torch.no_grad():
            with autocast():  # Apply autocast even during accumulation without gradients

                outputs = model(**inputs)

                image_embeds = outputs.image_embeds
                text_embeds = outputs.text_embeds
                logit_scale = model.logit_scale.exp()

                accum_image_features.append(image_embeds)
                accum_text_features.append(text_embeds)
                accum_data.append(inputs)
                accum_logit_scales.append(logit_scale)

        #if accum-freq reached, or is final batch: batch_idx == len(train_dataloader)
        # process accumulated batches
        if (batch_idx + 1) % args.accum_freq == 0 or batch_idx == len(train_dataloader):

            # Concatenate accumulated features
            all_image_features = torch.cat(accum_image_features, dim=0)
            all_text_features = torch.cat(accum_text_features, dim=0)
            logit_scale = accum_logit_scales[-1]  # Use the latest logit_scale

            #optimizer.zero_grad() #redundant, zero_grad after scheduler.step()

            # Recompute the forward pass for the accumulated batches with gradient tracking
            for j, inputs in enumerate(accum_data):
                with autocast():  # Autocast for mixed precision during forward pass with gradients
                    outputs = model(**inputs)
                    image_embeds = outputs.image_embeds
                    text_embeds = outputs.text_embeds
                    logit_scale = model.logit_scale.exp()

                    # Replace the cached features with the recomputed ones
                    all_image_features = torch.cat(
                        accum_image_features[:j] + [image_embeds] + accum_image_features[j+1:], dim=0
                    )
                    all_text_features = torch.cat(
                        accum_text_features[:j] + [text_embeds] + accum_text_features[j+1:], dim=0
                    )

                    # Compute logits over the accumulated features
                    logits_per_image = logit_scale * all_image_features @ all_text_features.t()
                    logits_per_text = logits_per_image.t()

                    # Compute the loss
                    total_loss = contrastive_loss(logits_per_image, logits_per_text)

                    #append logits
                    batch_logits_images.append(outputs.logits_per_image.mean().item())
                    batch_logits_texts.append(outputs.logits_per_text.mean().item())

                # Backpropagate the scaled loss
                scaler.scale(total_loss).backward()

            # Step the optimizer with scaled gradients
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()
            optimizer.zero_grad()

            # Reset accumulators
            accum_image_features, accum_text_features = [], []
            accum_data, accum_logit_scales = [], []`

I've put this code and the launch script, and the stuff I toyed with for accelerate FSDP at: https://github.com/minienglish1/clip_stuff Fair warning, it's all a hot mess. I'm basically just banging rocks together trying to get things working as quickly as possible for testing with complete disregard to code cleanliness & organization.

I modified the all the dataset related stuff so I could use my already prepared sdxl training dataset. So you'll need to put yours back in. Hope you find something useful.

I'm going to re-check this accum_freq code to make sure it's working correctly, then go to try to understand that GmP stuff your doing. Then I'll go back and test that Convert-for-HuggingFace-Spaces-etc you put up.

Again, thanks for this awesome CLIP training repo!

It's definitely interesting; with Flux.1, there's now a thing called block swapping (in reference to the diffusion transformer), and that's how you pull off fine-tuning 12 billion parameters on 24 GB of VRAM: https://github.com/bmaltais/kohya_ss/tree/sd3-flux.1.

I really wish I could clone myself into 10 AI agents that can go and explore all that is interesting in AI, then come back with an implementation for CLIP (because fine-tuning Big-G is another of the open issues on this repo, and I'd very much be interested in doing that, too!).

Meanwhile, I am hand-compiling PyTorch to even get this to work without downgrading - no libuv on Windows in pytorch. I compiled that and the good GPT-4o fixed the includes in C++ that did not apply to windows. Now it compiles without gloo (but with MPI), so AI & I just need to fix gloo! (Which has its own value, being able to compile PyTorch and dependencies, so I'm still gonna pursue this - but thank you for sharing the code & info & new approach, I'll be sure to try that as well! No worries about it being a 'hot mess' - I am very used to that, haha!).

Re: The GmP stuff, I also mentioned that in the now-quite-lengthy readme.md - but just in case, here's the link to the paper that inspired CLIP-GmP: https://arxiv.org/abs/2305.15912v4

It mentions ImageNet and ReLU, but why wouldn't it work for CLIP + GELU?! -> It does, I found out. After, just like you did - a lengthy discussion with GPT-4*. :-)

Screenshot from 2024-10-24 23-18-13

I tested accum_feat 1 through 32, with bsz 40-45 & lr 5e-7, for 10 epochs. The f1 & logits charts are missing values due to using wrong scale or being added later, but loss/val_loss/val_acc tell a good enough story. For the 5e-7 learning rate, looks like bsz 45 * accum_feat 4 (effective bsz 180) is best.

I tested bsz 45 accum_feat 32, which Cuda OOM error. So it's still memory bound. But bsz 40 accum_feat 32 (effective bsz 1024) worked, so I think an effective bsz of a couple thousand may be possible. At effective bsz 1024, it started occasionally giving an exploding gradient warning on a single layer, so I'll need to learn to deal with those. But having bsz in the thousands could be a real benefit to CLIP training.

After I'm done with CLIP-L, I'll look at CLIP-G. Maybe a peft of some kind with FSDP cpu_offset to increase vram? You mentioned the vram requirements were crazy huge.

Best of luck getting it to work on windows. My coding skills aren't good enough to trouble-shoot library issues on windows. After I realized I was going to be training models and learning deep learning for the foreseeable future, I slowly made the transition to a stand alone Ubuntu training box.

zer0int / CLIP-fine-tune

After fine-tune, how to correctly save the text encoder for use with StableDiffusionXLPipeline.from_pretrained? #15

when using FSDP, prepare model first

then prepare everything else