zer0int / CLIP-fine-tune

Fine-tuning code for CLIP models
MIT License
169 stars 8 forks source link

cant convert to hf with convert_clip_original_pytorch_to_hf.py #17

Open betterftr opened 3 weeks ago

betterftr commented 3 weeks ago

(trained with ft-B-train-OpenAI-CLIP-ViT-L-14 then used ft-C-convert-for-SDXL-comfyUI-OpenAI-CLIP and then tried to convert to HF and extract the TE, I am trying to copy for sd3.5L tenc1)

convert_clip_original_pytorch_to_hf.py", line 157, in convert_clip_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path) File "C:\OneTrainer\venv\lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context return func(*args, kwargs) File "C:\OneTrainer\CLIP-fine-tune\Convert-for-HuggingFace-Spaces-etc\convert_clip_original_pytorch_to_hf.py", line 121, in convert_clip_checkpoint ptmodel, = load(checkpoint_path, device="cpu", jit=False) File "C:\OneTrainer\venv\lib\site-packages\clip\clip.py", line 136, in load state_dict = torch.load(opened_file, map_location="cpu") File "C:\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 1384, in load return _legacy_load( File "C:\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 1628, in _legacy_load magic_number = pickle_module.load(f, pickle_load_args) EOFError: Ran out of input

as title says, + also extract TE outputs 1kb file: image

update: after messing around I managed to do the conversion like this, now it is loadable with sd35

import torch
from transformers import CLIPTextModelWithProjection, CLIPTextConfig

# Load the fine-tuned model and extract the state_dict
full_model = torch.load("C:/OneTrainer/CLIP-fine-tune/ft-checkpoints/my-finetune.pt")
state_dict = full_model.state_dict() if hasattr(full_model, "state_dict") else full_model

# Load the configuration and create the model
config = CLIPTextConfig.from_pretrained("C:/train/sd3.5/text_encoder/config.json")
fine_tuned_model = CLIPTextModelWithProjection(config)

# Load the state_dict into the fine-tuned model
fine_tuned_model.load_state_dict(state_dict, strict=False)

# Save only the text encoder part
fine_tuned_model.save_pretrained("C:/OneTrainer/CLIP-fine-tune/ft-checkpoints/")

interestingly the converted, extracted text encoder works with stable diffusion 3.5 (CLIPTextModelWithProjection) but not with flux (changing to CLIPTextModel)