Problem found in Calculating Loss in biRNN

GuangChen2016 commented 7 years ago

@zakizhou I have looked into the code of your implementation of BLSTM for Seq2Seq labeling, and when calculate the loss for BLSTM, you use "mask = tf.cast(tf.sign(inputs.tags), dtype=tf.float32)" to identity the real sequence of the data. And I have a a problem for this for the padding for input.tag is tag2id['PAD'] = TAGS_SIZE - 1 rather than 0. So I want to know if this is a mistake or not? Thank you very much.

zakizhou commented 7 years ago

@GuangChen2016 What a mistake that I even didn't find! As you can see in my code, there are two lines for setting a 'PAD' for tag.Obviously, the second line was to set the 'unk' for tag not PAD. Thanks for pointing out and I've already change it:)

GuangChen2016 commented 7 years ago

@zakizhou Therefore, tag:0 is only used for PAD ? and the actually class number including the "unk" is TAGS_SIZE-1? Am I right? Another question is that why do you use two different implementation of the Loss for LSTM and BLSTM?

GuangChen2016 commented 7 years ago

@zakizhou And if it is the sequence regression task, how can I modify the loss in order not to take into the loss that brought by the padding？ Could you give some suggestion ？ Thank you！

zakizhou commented 7 years ago

I think the TAGS_SIZE(including PAD and UNK) is predefined before starting implementation and here I choose PAD = 0 and UNK = TAGS_SIZE - 1 so that totally there will be TAGS_SIZE tags and actually TAGS_SIZE - 2 tags is meaningful.

When I start this repo I chose to use single directional lstm but then I realized that the back directional info is also helpful because our humans determine pos of words in a sentence only after it is totally written down (not like language model)

GuangChen2016 commented 7 years ago

@zakizhou Thank you. I think you have mistaken my meaning. You explain the meaning of LSTM and BiLSTM, but what I want to know is that the reason that the implementation of the LOSS function in this LSTM and BiLSTM are different. As you use another way to achieve the LOSS calculation in LSTM rather than the "mask" that used in BiLSTM. Thank you.

zakizhou commented 7 years ago

I don't understand what you mean by "use another way", in my code I did "mask" the loss in BiLSTM, after dynamic_bidirectional_rnn, I get two tensors shaped [batch_size, max_steps, forward_units] and [batch_size, max_steps, backward_units] and then I concat them in the third dim and get a [batch_size, max_steps, forward_units + backward_units] tensor , do a fully connection and softmax with labels and finally get the loss tensor [batch_size, max_steps], each element in this tensor is a float, also I created a mask tensor shaped [batch_size, max_steps] each element is 0 or 1(1 means this loss should exist 0 means not), doing a element-wise product will mask out the true loss

GuangChen2016 commented 7 years ago

@zakizhou Surely, I understand the process of your implementation of BiLSTM. What I don't understand is that the implementation of LOSS function in single direction LSTM as the following: with tf.namescope("loss"): targets = [tf.squeeze(tag, [1]) for tag_ in tf.split([1], max_steps, inputs.tags)] weights_bool = [tf.greater_equal(inputs.sequence_lengths, step) for step in range(1, max_steps + 1)] weights = [tf.cast(weight, tf.float32) for weight in weights_bool] cross_entropy_per_example = tf.nn.seq2seq.sequence_loss_by_example(logits=logits, targets=targets, weights=weights) Why don't you use masking in this?

zakizhou commented 7 years ago

What I implemented in single directional LSTM is indeed also a mask, but I rename it "weights_bool" which is also a 0/1 elemented tensor. In that class, I just want to test the seq2seq library in tf.nn but I found that library is not as convient as I excepted so finally I abandoned it and you can see I even didn't import that class in train.py You second question. If you want to do sequence regression, idea is the same, after you get a tensor shaped [batch_size, max_steps, forward_units + backward_units], create a weight shaped [forward_units + backward_units] and do fully connection you get a [batch_size, max_steps] and your targets is also shaped [batch_size, max_steps](the core problem is that 'PAD' is 0 in this tensor, so you can hardly tell the whether it is a pad or a true float value close to 0), calculate SSE and mask out true loss and minimize it.

zakizhou commented 7 years ago

If you can not create a mask from your targets(like using tf.sign for classification), I think the best way to create it is to use sequence, supposing after parsing tfrecords file, each example contains a scalar tensor sequence_length, and your target is to create a mask shaped [batch_size, max_steps] with a batch_size scalars sequence, but I haven't find that function in tf because at that time you don't konw the max_steps is(indeed only at run time until it is fully known).

Personlly I spent much time on studying dynamic_rnn in tf but till now I haven't found it convient to use. See the official seq2seq tutorial and you will find it so awkward.

Maybe after release v1.0, it can become better.

GuangChen2016 commented 7 years ago

Thank you for your reply. I have also spent lots of time into Seq2Seq modeling. I am use V 0.12 now. As you suggested, I can get a [batch_size, max_steps, Dim_for_each_frame], and my targets is also shaped [batch_size, max_steps, Dim_for_each_frame] after Padding. And I have two questions:

Can I use a large number like 100000 instead of 0 to avoid the problem of hardly identifying where a frame is a pad or a ture float?
How can generate the mask matrix for targets? And after that, how can I calculate the SSE？ Could you show me with preudo code? Thank you very much.

zakizhou commented 7 years ago

From what I have learnt about rnn in tf, if you manually pad all the sequence to the same max_steps(all sequences not only a single batch), then you can do like this:

# outputs and targets is the two tensor you mentioned
fake_loss_per_example_per_step = tf.sum(tf.square(ouputs - targets), axis=2) # [batch_size, max_steps]
# create a mask
# sequence_lengths is a vector concated from sequence_length of a batch
mask = tf.cast(tf.sequence_mask(sequence_lengths,max_steps), tf.float32) # [batch_size, max_steps]
masked_loss_per_example = tf.sum(tf.mul(fake_loss_per_example_per_step, mask), axis=1) / sequence_lengths
sse = tf.reduce_mean(masked_loss_per_example)
opt = ...
train_op = opt.minimize(sse)

if we don't have max_steps(whether a python int or a scalar int tensor), I don't know how to implement it :(

GuangChen2016 commented 7 years ago

Thank you very much. I got it. How can you get familiar with and remember all the function provided by tensorflow, like the tf.sequence_mask, and the use of axis. I just don't how to generate a mask using (sequence_lengths,max_steps), as I don't know tensorflow have already have such realization, and I often make mistakes when using "axis". Do you look into all the function provided by tensorflow first and could you give some suggestions in learning tensorflow, especially for Seq2Seq model? Thank you.

zakizhou commented 7 years ago

I think currently Seq2Seq model in TF is not complete, if you take a close look at it you will find that all the inputs are a list of tensors(like the input of static tf.nn.rnn) which can't handle the dynamic pad, so I choose not to use it(good news is that functions like dynamic_rnn_decoder are on progress).

I didn't look at all functions, just ones that I need to use. Besides, not all functions are listed in api, I think docs in official site should update as soon as possible but they are focusing on core part of TF and do not have so much time on docs :(

GuangChen2016 commented 7 years ago

Yeah, I have also noticed that not all functions are listed in api. So how can you know whether TF have such functions that you want or not ? And do you have some other resources for learning more about Seq2Seq? including some other realization in other applications.

zakizhou commented 7 years ago

Sorry I didn't know any other resources, I followed the steps to learning nlp and find the seq2seq for auto-reply and translation, but when I find the official api is not complete, I decide to study other things in nlp(like knowledge graph and relationship extraction) and wait for the release v1.0. As I can see from the commits of core team of tf, there will be big changes to rnn apis in tf. So just keep patient and wait for v1.0. If you are eager to learn it, why not read some papers about seq2seq instead of implememtations in TF?

GuangChen2016 commented 7 years ago

I have read enough paper for Seq2Seq, I used to use kaldi to implement Seq2Seq, but it is not that convenient and I want to move to TF. Thank you very much, hope we can keep communicating and discussing in future. Thank you for your help again.

zakizhou / RNN-Tagger

Problem found in Calculating Loss in biRNN #1