How is the query vector masked and shifted so that the model does not cheat when predicting grammar rules?

brando90 commented 3 years ago

I realized that the query vector has to be masked and shifted - otherwise the model can cheat. Just right shifting the query will not work for a general grammar. This is because if the input to the decoder takes in the entire parse tree then one can reverse engineer the rules from the non-terminals. e.g.

pair -> pair "," pair

then if the the query vector contains [start, pair, ",", pair] (assuming bfs ordering) but it is not masked or shifted the model can cheat. So what the model has to see for the first step is only [start, mask, mask, mask].

Note, I decided to not use the path for simplicity of exposition but you can do [start->start, start->mask, start->mask, start->mask]

zysszy commented 3 years ago

The query vector is a path from the root node to the next node to be expanded in a partial AST. It doesn't cheat the model (all nodes are selected from a partial AST and we don't use the self-attention mechanism). Thus, we didn't mask the query vector.

Zeyu

brando90 commented 3 years ago

The query vector is a path from the root node to the next node to be expanded in a partial AST. It doesn't cheat the model (all nodes are selected from a partial AST and we don't use the self-attention mechanism). Thus, we didn't mask the query vector.

Zeyu

Hi Zeyu,

Thanks for the reply! I always appreciate it. However, it didn't address my concern. I am not worried about the way paths are generated. I am worried what the input (the query vector itself) is to the transformer decoder. That is what I am worried about and the decoder does seem to use multi header self-attention.

Perhaps if I phrase it this way it will be clearer. The input to the AST reader is the true rule sequence. Each time one executes the "current rule" one generates a set of non-terminals. This results in the query vector being larger in length than the target rule sequence the model is leanring. Thus, a simple right shift and a mask does not work the same on the query vector as it does in the input to the AST reader. Note this assume the path embeddings has already been generated correctly seeing only previously generated nodes.

Did that make sense? Of course I might be overlooking something - hence my initiative for a discussion.

Cheers!

zysszy commented 3 years ago

Sorry, maybe I do not fully understand.

That is what I am worried about and the decoder does seem to use multi header self-attention.

We don't use multi-head self-attention in the decoder. We only use multi-head attention to achieve the interaction between the Decoder (query) and AST/Input NL.

Zeyu

brando90 commented 3 years ago

Hi Zeyu,

Once agian thanks for the reply.

Do you mind clarifying then what an "NL attention" or "AST attention" in the decoder means? There are three arrows in the diagram so it looks like normal Multi Headed Attention but instead of being self attention the input is what the usualy transformer would take in the decoder phase. Is that what it is?

zysszy / TreeGen

How is the query vector masked and shifted so that the model does not cheat when predicting grammar rules? #16