zh460045050 / V2L-Tokenizer

103 stars 7 forks source link

There is an error during running step3_global_codebook_filtering.py #8

Open ddghjikle opened 3 months ago

ddghjikle commented 3 months ago

Hi, thanks for sharing this wonderful work. During re-producing the released codes, I meet an error in step3_global_codebook_filtering.py.

I have checked the step2_generate_codebook_embedding.py, which generates "Subword_Bigram_Trigram_Embedding.pth". It seems that this pth is three times larger than the "Subword_Bigram_Trigram_Vocabulary.npy". As a result, the index generated according to "Subword_Bigram_Trigram_Embedding.pth" cannot be used to select tokens in "Subword_Bigram_Trigram_Vocabulary.npy".

Do you have any suggestions?

image

hastaluegoph commented 3 months ago

I met the same error

hastaluegoph commented 3 months ago

Hi, thanks for sharing this wonderful work. During re-producing the released codes, I meet an error in step3_global_codebook_filtering.py.

I have checked the step2_generate_codebook_embedding.py, which generates "Subword_Bigram_Trigram_Embedding.pth". It seems that this pth is three times larger than the "Subword_Bigram_Trigram_Vocabulary.npy". As a result, the index generated according to "Subword_Bigram_Trigram_Embedding.pth" cannot be used to select tokens in "Subword_Bigram_Trigram_Vocabulary.npy".

Do you have any suggestions?

image

I found the Subword_Bigram_Trigram_Vocabulary.npy is saved by dict like this {"1":value, "2":value, "3":value} in the step 1, So you may try to extract the key value and concat into one numpy array which its dimension match the effective_index.