Open AndreCI opened 7 years ago
Hi,
Thanks :) In my tests, the performance is EM 55% on the test set. What's missing is at least:
Then there are things to be verified:
What I'll not do
And for the performance, I plan to get it to at least EM 65% in the following month, but maybe not to rival the original implementation in the article.
You are looking at this model as well ?
Hey,
Thanks for your detailled answer! I'm planning to implement it on tensorflow, and your version is a really good baseline.
I'll submit issues if I see them! Best,
Hi again,
Concerning your NaN problem, it seems that it comes from the tf.exp(alpha_flat), as it returns inf values. I fixed it by imposing stddev=0.1 to the weights for the ReLU, as the values are not that big and thus exp doesn't return inf.
W = tf.Variable(tf.random_normal(shape=[input_size, input_size], mean=0.0, stddev=0.1, dtype=tf.float32, name='ReLU_weight'))
I have some questions if you have the time to answer them.
Why aren't you using bias? I see that you commented them, but I fail to see the reason behind.
What is the purpose of doc_input_size? I don't see it being used later on. Moreover, what is use_qemb meaning? Why would you add the embedding dimension? Is this for the Aligned question embedding? If so, I guess the reason behind the fact that it's False by default is due to the NaN that you're getting?
doc_input_size = opt['embedding_dim'] + opt['num_features']
if opt['use_qemb']:
doc_input_size += opt['embedding_dim']
if opt['pos']:
doc_input_size += opt['pos_dim']
if opt['ner']:
doc_input_size += opt['ner_dim']
Can you justify the initialization of the different variables? I never know how to initialize them
with tf.variable_scope('BilinearSeqAttention'):
W = tf.Variable(tf.truncated_normal([y_size,x_size], dtype=tf.float32))
and
with tf.variable_scope('SeqAttnMatch'):
W = tf.Variable(tf.random_normal(shape=[input_size, input_size], dtype=tf.float32))
I don't understand this as well, and if I let the flags True, my code does not compile.
if opt['concat_rnn_layers']:
doc_hidden_size *= opt['doc_layers']
question_hidden_size *= opt['question_layers']
Sorry to bother you with all of theses questions!! Thanks again for your code, it helped me quite a lot!
Hi, Great that you've looked at the issue with NaN. Besides,
I commented out bias because I looked at the the torch.Linear and also how the original DrQA works, the behavior is just like matrix multiplication. Yes, we should verify it and try with bias.
doc_input_size
is the size of the input to the Bidirectional LSTM. Yes, sometimes the size of the variable can be found by just looking at the Tensor, so there is not necessarily the need to have this variable.
Yes, use_qemb
is for Aligned question embedding. Currently when it's true
, I'm getting NaNs, so that's why it's false
... Setting it to true
and solving the NaN problem with improve the performance.
I remember from somewhere what truncated_normal
helps to avoid the NaN problem(I'm not sure)
This concat_rnn_layers
is what I was talking about on the difference between Torch and TF, in Pytorch, there are the outputs from every layer in a Multi-layer RNN.
But in TF, with the MultiRNNCell
wrapper, I'm only getting the output from one layer.
Changing the number of layer doesn't change the size of the output of MultiRNNCell
, thus it's not necessary to set opt['concat_rnn_layers']=True
.
But this is also something to be verified, if I'm doing it wrong here, it can be another place where we can improve its accuracy.
No problem for that ! Great that it's helping you. I find it extremely helpful for me to re-write the Pytorch version to Tensorflow, I was looking for an opportunity to understand how Attention works but the articles and schemas on the internet are still far from actually implementing it and understanding the math behind it.
Hi,
Thanks for your answer!
I'm having trouble running the code, as the batch size is pretty small. In fact, I have a total of approximatively 2500 batches per epoch. Is this an issue for you? I ran the code last night and it didn't made any epoch. However, if I force the code to do only one batche, it makes it without problems. How many batches do you have, and how long does it takes you to train the model?
I tried something for the top-k word embedding, no idea if it works but it should
embedding_trainable = tf.get_variable("word_embedding_trainable", shape=[params.tune_partial, params.embedding_dim], trainable=True)
embedding_fixed = tf.get_variable("word_embedding_fixed", shape = [params.fvocab_size - params.tune_partial, params.embedding_dim], trainable=False)
embedding_trainable.assign(params.embedding[:params.tune_partial])
embedding_fixed.assign(params.embedding[params.tune_partial:])
embedding = tf.concat([embedding_trainable, embedding_fixed], axis=0, name="word_embed")
Concerning the dropout, I think that this should work
doc_emb = tf.nn.dropout(doc_emb, params.dropout_rnn, name='dropout_doc_emb')
quest_emb = tf.nn.dropout(quest_emb, params.dropout_rnn, name='dropout_quest_emb')
Finally, why do you compute the length of different component ? For example, in StackedBRNN, this is computed:
with tf.name_scope("doc_length"):
words_used_in_sent = tf.sign(tf.reduce_max(tf.abs(input_data), reduction_indices=2))
length = tf.cast(tf.reduce_sum(words_used_in_sent, reduction_indices=1), tf.int32)
Wouldn't it possible to simply use the parameters (opt) rather than dedicating some of the computational power to get the length? Or am I missing something here?
I agree that it is really helpful to re write stuff! I started on attention mechanisms with Dynamic Memory Module but I still have some trouble to prove that the math behind does work. this paper is what I focused on https://arxiv.org/abs/1506.07285, in p. 8 there is some graph showing the attention mechanism in action.
Hi, I've finally finished some personal stuff and free to come back to this project. Besides,
Yes, the batch size is quite small, only 32 questions per batch. When you say it didn't make any batch, there is a bug or it's only being too slow ? It takes about 1.5 seconds (or 3, I don't remember) per batch on a AWS p2.xlarge machine where there is a K80 GPU, but yes, Tensorflow might not be not very fast (and this implementation can also be optimized)
Yes, I think what you've written for partial training on the embedding works :)
I think your dropout on embedding works too.
The doc_length
variable is used to find the length of every document at input, which is important for RNN, and thus it's not the same as the parameters in the object opt
.
And I really have a feeling that the Attention part is not right, it doesn't seem to be something similar to what is described here: https://distill.pub/2016/augmented-rnns/ Yeah, I've seen your article somewhere, too. There are so many interesting articles...
Hi, Recently I was implementing DrQA in tensorflow, but EM in testing is 56%. Is there any hack technology inside? Btw, I found it costs 50 min for a single epoch in tensorflow and it is quite slow. Thanks
Hi, Indeed, I can also only achieve about 56% for EM, it's really strange, that's why I've marked this project as "in progress", I wanted to at least achieve 60+% for EM.
How are you implementing your version ? I am converting the Pytorch version to Tensorflow, and I think I've got most of the technical details right ... I' not clear on why it's only 56% for EM....
By the way, yes, it's slow for me as well. But I've also read some comparison between Tensorflow and Pytorch, and TF doesn't seem to be that slow, well ... so I'm still sticking with TF
Hi, My version is still vanilla, eg: no fixing most of the embedding and fine-tune the top 1000 words. It is a little weird.
Yeah, 2 months ago I thought the 56 % is because I don't have question-document aligned embedding, or the fine tuning of 100 words, then I fixed the issue for the RNN, and added the alignment embedding, it's still around 56%
I think it is better to use DrQA's way to clean training data. In the data, some answers are from the middle of a word instead of begin of a word.
Hi,
Thanks a lot for porting this to tensorflow. Your readme says that it is not completed, could you tell what is missing? If you are planning to complete it, could you tell me when?
Thanks again!