Embedding lookup for zero paddings should use the last column instead of the 0th ?

wanghm92 commented 6 years ago

Hi Dr. Wang, I notice that paddings for sentences in each batch are zeros (idx)

but the pre-trained embeddings are declared with the last column to be all zeros (which I assume to be used for the word?), and all proceeding ones read from pre-trained embedding file.

So during the embedding lookup step, paddings are actually using the embedding of the first word stored in the pre-trained embedding file, which I think should be the last instead.

The subroutine can be traced here: (1) word_vocab is constructed with 'txt3' https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchTrainer.py#L128 where cur_index starts from 0, self.word_vecs is declared with one extra last column of 0s https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/vocab_utils.py#L118-L143 (2) The sequence of word idx is then passed for embedding lookup https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchDataStream.py#L53-L54 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/vocab_utils.py#L285 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/vocab_utils.py#L264 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchTrainer.py#L264-L265 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchModelGraph.py#L37-L38

Please correct me if I'm wrong and hope to hear from you! Thank you!

zhiguowang commented 6 years ago

For the padding positions, I multiply them with a mask in the model graph. So they will not affect the whole model.

wanghm92 commented 6 years ago

I see, thanks!

zhiguowang / BiMPM

Embedding lookup for zero paddings should use the last column instead of the 0th ? #25