Hi Dr. Wang,
I notice that paddings for sentences in each batch are zeros (idx)
but the pre-trained embeddings are declared with the last column to be all zeros (which I assume to be used for the word?), and all proceeding ones read from pre-trained embedding file.
So during the embedding lookup step, paddings are actually using the embedding of the first word stored in the pre-trained embedding file, which I think should be the last instead.
Hi Dr. Wang, I notice that paddings for sentences in each batch are zeros (idx)
but the pre-trained embeddings are declared with the last column to be all zeros (which I assume to be used for the word?), and all proceeding ones read from pre-trained embedding file.
So during the embedding lookup step, paddings are actually using the embedding of the first word stored in the pre-trained embedding file, which I think should be the last instead.
The subroutine can be traced here: (1) word_vocab is constructed with 'txt3' https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchTrainer.py#L128 where cur_index starts from 0, self.word_vecs is declared with one extra last column of 0s https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/vocab_utils.py#L118-L143 (2) The sequence of word idx is then passed for embedding lookup https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchDataStream.py#L53-L54 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/vocab_utils.py#L285 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/vocab_utils.py#L264 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchTrainer.py#L264-L265 https://github.com/zhiguowang/BiMPM/blob/7052c19acb83452ad077da14512bcac19a00c3d0/src/SentenceMatchModelGraph.py#L37-L38
Please correct me if I'm wrong and hope to hear from you! Thank you!