yoonkim / lstm-char-cnn

LSTM language model with CNN over characters
MIT License
826 stars 221 forks source link

Reconstruction of table 6 from paper - Dealing with OOV words #13

Open ThorJonsson opened 8 years ago

ThorJonsson commented 8 years ago

Hi, thank you very much for this.

I wanted to ask you whether you could elaborate on how table 6 is constructed, I am having some difficulties reconstructing it after training on the PTB-data. Specifically for OOV words.

I think I understand how to compute the cosine similarity between two words that exist in the word_vecs lookup table. However when I compute the nearest neighbor words based on cosine similarity I get different results from what is described in the paper:

th> get_sim_words('his',5,cpchar,word2idx,idx2word)                                         
{
  1 : 
   {
      1 : "his"
      2 : 1
    }
  2 : 
    {
      1 : "my"
      2 : 0.67714271790195
    }
  3 : 
    {
      1 : "your"
      2 : 0.67532773464339
    }
  4 : 
    {
      1 : "its"
      2 : 0.63439247861717
    }
  5 : 
    {
      1 : "her"
      2 : 0.62416681420755
    }
}

Here I am simply using the lookup table found in checkpoint.protos.rnn.modules[2].weight:double(). I obtain the row in the lookup table which corresponds to the word for which I want the nearest neighbors. Compute the matrix vector product and sort based on similarity.

I assume that for the nearest neighbor words of OOV words you are using the character embedding space? Any help or tips on how you did this would be very appreciated.

Thanks,

bqcao commented 8 years ago

Any progress to share please?

yoonkim commented 8 years ago

There is randomness built into the models (due to initialization) so you shouldn't expect the nearest neighbors to be exactly the same. Your nearest neighbors seem to make sense (and close to the ones in the paper as well).