Why resize token embedings when dataset is pororo or flintstone?

LiamTTT commented 1 year ago

Hi!

I am reproducing this work, and I noticed that the token embeding is resized when training on pororo or flintstone datasets. My question is:

Why do that?
Why resize to those num?

BTW, thanks for your opensource! look forward to your reply :)

xichenpan commented 1 year ago

Hi, thanks for your interests. I not quite clear about what dose "resized token embeding" means. Could you please refer the corresponding code using a link?

LiamTTT commented 1 year ago

oh, sorry. I mean why 'clip_embedding_tokens' or 'blip_embedding_tokens' is different when dataset change.

https://github.com/Flash-321/ARLDM/blob/94725e13e7c790ec1025fd5d485771becc367f02/config.yaml#L32-L56 https://github.com/Flash-321/ARLDM/blob/94725e13e7c790ec1025fd5d485771becc367f02/main.py#L119-L121

xichenpan commented 1 year ago

Thanks for your comments! this is because we added some new tokens for characters in these two datasets. https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/config.yaml#L35 https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/config.yaml#L42 https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/datasets/flintstones.py#L36-L39 https://github.com/Flash-321/ARLDM/blob/34b30703a2caeeb2364bdfb161345027217785c6/datasets/pororo.py#L37-L40 as a result, the vocab size of tokenizer has been changed, and we need to resize the token embeddings to make sure the embedding layer can still encode the sentence. And the num can be obtained by printing len(clip_tokenizer) and len(blip_tokenizer) after adding those new tokens.

LiamTTT commented 1 year ago

Got it! Thanks! That is a fantastic work! Looking forward to more work from you.

xichenpan commented 1 year ago

Thanks! Feel free to open an issue if you have any further questions!

xichenpan / ARLDM

Why resize token embedings when dataset is pororo or flintstone? #5