Not completed yet? - Githubissues

AndreCI commented 7 years ago

Hi,

Thanks a lot for porting this to tensorflow. Your readme says that it is not completed, could you tell what is missing? If you are planning to complete it, could you tell me when?

Thanks again!

zhaolewen commented 7 years ago

Hi,

Thanks :) In my tests, the performance is EM 55% on the test set. What's missing is at least:

Fixing most of the embedding and fine-tune the top 1000 words.
Attention weighted matching between question and document(I'm getting NaN errors when I include it)
Dropout on the word embedding

Then there are things to be verified:

The Multilayer RNN in TF and Pytorch seems to be working differently
Verify if I'm doing it correctly for the attention layers

What I'll not do

Provide so many options as is in the Facebook/@hitvoice version, I'll try to implement the version with the best performance
Summary recording with Tensorboard. I'm beginning to use Elasticsearch and Kibana, it turns out to be easy and quite nice.

And for the performance, I plan to get it to at least EM 65% in the following month, but maybe not to rival the original implementation in the article.

You are looking at this model as well ?

AndreCI commented 7 years ago

Hey,

Thanks for your detailled answer! I'm planning to implement it on tensorflow, and your version is a really good baseline.

I'll submit issues if I see them! Best,

AndreCI commented 7 years ago

Hi again,

Concerning your NaN problem, it seems that it comes from the tf.exp(alpha_flat), as it returns inf values. I fixed it by imposing stddev=0.1 to the weights for the ReLU, as the values are not that big and thus exp doesn't return inf.

W = tf.Variable(tf.random_normal(shape=[input_size, input_size], mean=0.0, stddev=0.1, dtype=tf.float32, name='ReLU_weight'))

AndreCI commented 7 years ago

I have some questions if you have the time to answer them.

Why aren't you using bias? I see that you commented them, but I fail to see the reason behind.
What is the purpose of doc_input_size? I don't see it being used later on. Moreover, what is use_qemb meaning? Why would you add the embedding dimension? Is this for the Aligned question embedding? If so, I guess the reason behind the fact that it's False by default is due to the NaN that you're getting? doc_input_size = opt['embedding_dim'] + opt['num_features'] if opt['use_qemb']: doc_input_size += opt['embedding_dim'] if opt['pos']: doc_input_size += opt['pos_dim'] if opt['ner']: doc_input_size += opt['ner_dim']
Can you justify the initialization of the different variables? I never know how to initialize them with tf.variable_scope('BilinearSeqAttention'): W = tf.Variable(tf.truncated_normal([y_size,x_size], dtype=tf.float32)) and with tf.variable_scope('SeqAttnMatch'): W = tf.Variable(tf.random_normal(shape=[input_size, input_size], dtype=tf.float32))
I don't understand this as well, and if I let the flags True, my code does not compile. if opt['concat_rnn_layers']: doc_hidden_size *= opt['doc_layers'] question_hidden_size *= opt['question_layers']

Sorry to bother you with all of theses questions!! Thanks again for your code, it helped me quite a lot!

zhaolewen commented 7 years ago

Hi, Great that you've looked at the issue with NaN. Besides,

I commented out bias because I looked at the the torch.Linear and also how the original DrQA works, the behavior is just like matrix multiplication. Yes, we should verify it and try with bias.
doc_input_size is the size of the input to the Bidirectional LSTM. Yes, sometimes the size of the variable can be found by just looking at the Tensor, so there is not necessarily the need to have this variable. Yes, use_qemb is for Aligned question embedding. Currently when it's true, I'm getting NaNs, so that's why it's false... Setting it to true and solving the NaN problem with improve the performance.
I remember from somewhere what truncated_normal helps to avoid the NaN problem(I'm not sure)
This concat_rnn_layers is what I was talking about on the difference between Torch and TF, in Pytorch, there are the outputs from every layer in a Multi-layer RNN. But in TF, with the MultiRNNCell wrapper, I'm only getting the output from one layer. Changing the number of layer doesn't change the size of the output of MultiRNNCell, thus it's not necessary to set opt['concat_rnn_layers']=True. But this is also something to be verified, if I'm doing it wrong here, it can be another place where we can improve its accuracy.

No problem for that ! Great that it's helping you. I find it extremely helpful for me to re-write the Pytorch version to Tensorflow, I was looking for an opportunity to understand how Attention works but the articles and schemas on the internet are still far from actually implementing it and understanding the math behind it.

AndreCI commented 7 years ago

Hi,

Thanks for your answer!

I'm having trouble running the code, as the batch size is pretty small. In fact, I have a total of approximatively 2500 batches per epoch. Is this an issue for you? I ran the code last night and it didn't made any epoch. However, if I force the code to do only one batche, it makes it without problems. How many batches do you have, and how long does it takes you to train the model?
I tried something for the top-k word embedding, no idea if it works but it should embedding_trainable = tf.get_variable("word_embedding_trainable", shape=[params.tune_partial, params.embedding_dim], trainable=True) embedding_fixed = tf.get_variable("word_embedding_fixed", shape = [params.fvocab_size - params.tune_partial, params.embedding_dim], trainable=False) embedding_trainable.assign(params.embedding[:params.tune_partial]) embedding_fixed.assign(params.embedding[params.tune_partial:]) embedding = tf.concat([embedding_trainable, embedding_fixed], axis=0, name="word_embed")
Concerning the dropout, I think that this should work doc_emb = tf.nn.dropout(doc_emb, params.dropout_rnn, name='dropout_doc_emb') quest_emb = tf.nn.dropout(quest_emb, params.dropout_rnn, name='dropout_quest_emb')
Finally, why do you compute the length of different component ? For example, in StackedBRNN, this is computed: with tf.name_scope("doc_length"): words_used_in_sent = tf.sign(tf.reduce_max(tf.abs(input_data), reduction_indices=2)) length = tf.cast(tf.reduce_sum(words_used_in_sent, reduction_indices=1), tf.int32) Wouldn't it possible to simply use the parameters (opt) rather than dedicating some of the computational power to get the length? Or am I missing something here?

I agree that it is really helpful to re write stuff! I started on attention mechanisms with Dynamic Memory Module but I still have some trouble to prove that the math behind does work. this paper is what I focused on https://arxiv.org/abs/1506.07285, in p. 8 there is some graph showing the attention mechanism in action.

zhaolewen commented 7 years ago

Hi, I've finally finished some personal stuff and free to come back to this project. Besides,

Yes, the batch size is quite small, only 32 questions per batch. When you say it didn't make any batch, there is a bug or it's only being too slow ? It takes about 1.5 seconds (or 3, I don't remember) per batch on a AWS p2.xlarge machine where there is a K80 GPU, but yes, Tensorflow might not be not very fast (and this implementation can also be optimized)
Yes, I think what you've written for partial training on the embedding works :)
I think your dropout on embedding works too.
The doc_length variable is used to find the length of every document at input, which is important for RNN, and thus it's not the same as the parameters in the object opt.

And I really have a feeling that the Attention part is not right, it doesn't seem to be something similar to what is described here: https://distill.pub/2016/augmented-rnns/ Yeah, I've seen your article somewhere, too. There are so many interesting articles...

developeratdaguanyuan commented 7 years ago

Hi, Recently I was implementing DrQA in tensorflow, but EM in testing is 56%. Is there any hack technology inside? Btw, I found it costs 50 min for a single epoch in tensorflow and it is quite slow. Thanks

zhaolewen commented 7 years ago

Hi, Indeed, I can also only achieve about 56% for EM, it's really strange, that's why I've marked this project as "in progress", I wanted to at least achieve 60+% for EM.

How are you implementing your version ? I am converting the Pytorch version to Tensorflow, and I think I've got most of the technical details right ... I' not clear on why it's only 56% for EM....

By the way, yes, it's slow for me as well. But I've also read some comparison between Tensorflow and Pytorch, and TF doesn't seem to be that slow, well ... so I'm still sticking with TF

developeratdaguanyuan commented 7 years ago

Hi, My version is still vanilla, eg: no fixing most of the embedding and fine-tune the top 1000 words. It is a little weird.

zhaolewen commented 7 years ago

Yeah, 2 months ago I thought the 56 % is because I don't have question-document aligned embedding, or the fine tuning of 100 words, then I fixed the issue for the RNN, and added the alignment embedding, it's still around 56%

developeratdaguanyuan commented 6 years ago

I think it is better to use DrQA's way to clean training data. In the data, some answers are from the middle of a word instead of begin of a word.

zhaolewen / DrQA-TF

Not completed yet? #1