yanggeng1995 / vae_tacotron

MIT License
51 stars 19 forks source link

Need some clarification regarding Reference encoder architecture #6

Open rishikksh20 opened 5 years ago

rishikksh20 commented 5 years ago

@yanggeng1995 On paper author mentioned on ReferenceEncoder (section 2.2 ) the output of the GRU layers passed through two separate Fully connected layers, but in this implementation, last GRU state passed to two separate FC layer

def ReferenceEncoder(inputs, input_lengths, filters, kernel_size, strides, is_training, scope='reference_encoder'):
    with tf.variable_scope(scope):
        reference_output = tf.expand_dims(inputs, axis=-1)
        for i, channel in enumerate(filters):
            reference_output = conv2d(reference_output, channel, kernel_size,
                      strides, tf.nn.relu, is_training, 'conv2d_{}'.format(i))

        shape = shape_list(reference_output)
        reference_output = tf.reshape(reference_output, shape[:-2] + [shape[2] * shape[3]])

        #GRU
        encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
           cell=GRUCell(128),
           inputs=reference_output,
           sequence_length=input_lengths,
           dtype=tf.float32
        )
        return encoder_state

As you see encoder_state return instead of encoder_output . On the other author mentioned in the same section that they used same reference_encoder as used in GST tacotron , and I go through the best gst-tacotron implementation on github i.e. https://github.com/syang1993/gst-tacotron here also reference encoder returned encoder_output and it working fine.

def reference_encoder(inputs, filters, kernel_size, strides, encoder_cell, is_training, scope='ref_encoder'):
  with tf.variable_scope(scope):
    ref_outputs = tf.expand_dims(inputs,axis=-1)
    # CNN stack
    for i, channel in enumerate(filters):
      ref_outputs = conv2d(ref_outputs, channel, kernel_size, strides, tf.nn.relu, is_training, 'conv2d_%d' % i)

    shapes = shape_list(ref_outputs)
    ref_outputs = tf.reshape(
      ref_outputs, 
      shapes[:-2] + [shapes[2] * shapes[3]])
    # RNN
    encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
      encoder_cell,
      ref_outputs,
      dtype=tf.float32)

    reference_state = tf.layers.dense(encoder_outputs[:,-1,:], 128, activation=tf.nn.tanh) # [N, 128]
    return reference_state

Interesting thing is that on GST tacotron paper author mentioned to use last GRU state as the reference embedding. Please take note and clarify weather to take encoder_output or encoder_state as the output of reference_encoder.

Thanks

yanggeng1995 commented 5 years ago

Sorry, I am on the Spring Festival holiday, and it is a little late to see the news, I use encoder_state as the output of reference_encoder, but the paper did not specify whether to use state or output, this need to be verified by experiments, and "On paper author mentioned on ReferenceEncoder (section 2.2 ) the output of the GRU layers passed through two separate Fully connected layers," you can find implementation on here: https://github.com/yanggeng1995/vae_tacotron/blob/b0288f1caa776a98195dd94d1e8ea7ca6ec05f57/models/modules.py#L5-L20