wkcn / TinyCLIP

[ICCV2023] TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance
https://github.com/microsoft/Cream/tree/main/TinyCLIP
Other
58 stars 4 forks source link

I want to replace the clip model weights with Tinyclip model weights to initialize, how should I change the network architecture? #3

Closed loserking111 closed 5 days ago

loserking111 commented 2 months ago

I want to replace the clip model weights with Tinyclip model weights to initialize, how should I change the network architecture?

wkcn commented 2 months ago

Hi @loserking111 , thanks for your attention to our work!

For the TinyCLIP models trained with manual inheritance, the changes are the number of layers and the hidden size. You can refer to the model configs: https://github.com/wkcn/TinyCLIP/tree/main/src/open_clip/model_configs

loserking111 commented 2 months ago

Okay, thank you, I would like to ask if it would be better to replace clip with tinyclip in the field of Reid

wkcn commented 2 months ago

The model architecture of TinyCLIP is the same as that of CLIP. You can diff their model configs.

loserking111 commented 2 months ago

My original model was a clip, after just changing the model config。Why am I using the TinyCLIP-ViT-39M-16-Text-19M model, and the parameters always feel that they are not embedded successfully。

wkcn commented 2 months ago

You can check whether the shape of each weight is matched.

willswordh commented 2 months ago

@wkcn Great work! About how much smaller is this TinyCLIP's size and memory usage compared original CLIP model? Thanks!

wkcn commented 2 months ago

Hi @willswordh , thanks for your interest to our work! Here is the comparison with the original CLIP model. The column named #Params (M) shows the model size. I did not record the specific memory usage. Compared with the original CLIP model, TinyCLIP occupies less memory usage since it has fewer layers and channels.

image
willswordh commented 1 month ago

@wkcn Hey Jackie! Thanks for your response! I have actually tested the Tinyclip model, its processing speed for images is similar to orginal CLIP's. I wonder is that supposed to be like that? I thought smaller parameters and memory usage can speed up TinyCLIP's processing speed. Thanks!

wkcn commented 1 month ago

Hi @willswordh , I have uploaded the script to measure the throughput.

https://github.com/wkcn/TinyCLIP/blob/main/measure_throughput.py

Example:

python3 measure_throughput.py --model-name ViT-B-32
python3 measure_throughput.py --model-name TinyCLIP-ViT-61M-32-Text-29M

The model names can be found in https://github.com/wkcn/TinyCLIP/tree/main/src/open_clip/model_configs

willswordh commented 1 month ago

@wkcn Thanks Jackie! I tested their throughput, and I noticed that there is not a huge enhancement for the tinyclip compared to original Vit b 32. I want to run CLIP on edge device efficiently with good speed, do you have any advice for it? Maybe quantize the tinyclip model? Thanks a lot for your sincere help!

wkcn commented 1 month ago

@willswordh Did you use the inference framework like tensorrt and onnxruntime? I did not try to quantize the TinyCLIP models.

willswordh commented 1 month ago

@wkcn Yes I used onnxruntime. The processing speed is still quite low even I have tried to quantized it to fp16.

wkcn commented 1 month ago

@willswordh

Did you compare CLIP-ViT-B/32 and TinyCLIP-ViT-40M-32-Text-19M?

Could you please provide more information, such as batch size, device, the inference time of image encoder and text encoder ?

You can increase the batch size as much as possible to measure the speed. In my test code, the batch size is 32.

willswordh commented 1 month ago

@wkcn I compared the throughput between the Vit B 32 with the smallest tinyclip model. I wonder what can I do to speedup the inference speed of the model on device like android phone. Maybe quantize it to int8? Thanks!

wkcn commented 1 month ago

I did not benchmark TinyCLIP on edge device. The quantization can accelerate the inference, but I am not sure whether TinyCLIP will be faster significantly than ViT-B/32.