-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the query vector masked and shifted so that the model does not cheat when predicting grammar rules? #16
Comments
The query vector is a path from the root node to the next node to be expanded in a partial AST. It doesn't cheat the model (all nodes are selected from a partial AST and we don't use the self-attention mechanism). Thus, we didn't mask the query vector. Zeyu |
Hi Zeyu, Thanks for the reply! I always appreciate it. However, it didn't address my concern. I am not worried about the way paths are generated. I am worried what the input (the query vector itself) is to the transformer decoder. That is what I am worried about and the decoder does seem to use multi header self-attention. Perhaps if I phrase it this way it will be clearer. The input to the AST reader is the true rule sequence. Each time one executes the "current rule" one generates a set of non-terminals. This results in the query vector being larger in length than the target rule sequence the model is leanring. Thus, a simple right shift and a mask does not work the same on the query vector as it does in the input to the AST reader. Note this assume the path embeddings has already been generated correctly seeing only previously generated nodes. Did that make sense? Of course I might be overlooking something - hence my initiative for a discussion. Cheers! |
Sorry, maybe I do not fully understand.
We don't use multi-head self-attention in the decoder. We only use multi-head attention to achieve the interaction between the Decoder (query) and AST/Input NL. Zeyu |
Hi Zeyu, Once agian thanks for the reply. Do you mind clarifying then what an "NL attention" or "AST attention" in the decoder means? There are three arrows in the diagram so it looks like normal Multi Headed Attention but instead of being self attention the input is what the usualy transformer would take in the decoder phase. Is that what it is? |
I realized that the query vector has to be masked and shifted - otherwise the model can cheat. Just right shifting the query will not work for a general grammar. This is because if the input to the decoder takes in the entire parse tree then one can reverse engineer the rules from the non-terminals. e.g.
then if the the query vector contains
[start, pair, ",", pair]
(assuming bfs ordering) but it is not masked or shifted the model can cheat. So what the model has to see for the first step is only[start, mask, mask, mask]
.Note, I decided to not use the path for simplicity of exposition but you can do
[start->start, start->mask, start->mask, start->mask]
The text was updated successfully, but these errors were encountered: