Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the query vector masked and shifted so that the model does not cheat when predicting grammar rules? #16

Open
brando90 opened this issue Jul 9, 2021 · 4 comments

Comments

@brando90
Copy link

brando90 commented Jul 9, 2021

I realized that the query vector has to be masked and shifted - otherwise the model can cheat. Just right shifting the query will not work for a general grammar. This is because if the input to the decoder takes in the entire parse tree then one can reverse engineer the rules from the non-terminals. e.g.

pair -> pair "," pair

then if the the query vector contains [start, pair, ",", pair] (assuming bfs ordering) but it is not masked or shifted the model can cheat. So what the model has to see for the first step is only [start, mask, mask, mask].

Note, I decided to not use the path for simplicity of exposition but you can do [start->start, start->mask, start->mask, start->mask]

@brando90 brando90 changed the title How is the query vector masked and shifted? How is the query vector masked and shifted so that the model does not cheat when predicting grammar rules? Jul 9, 2021
@zysszy
Copy link
Owner

zysszy commented Jul 12, 2021

The query vector is a path from the root node to the next node to be expanded in a partial AST. It doesn't cheat the model (all nodes are selected from a partial AST and we don't use the self-attention mechanism). Thus, we didn't mask the query vector.

Zeyu

@brando90
Copy link
Author

brando90 commented Jul 12, 2021

The query vector is a path from the root node to the next node to be expanded in a partial AST. It doesn't cheat the model (all nodes are selected from a partial AST and we don't use the self-attention mechanism). Thus, we didn't mask the query vector.

Zeyu

Hi Zeyu,

Thanks for the reply! I always appreciate it. However, it didn't address my concern. I am not worried about the way paths are generated. I am worried what the input (the query vector itself) is to the transformer decoder. That is what I am worried about and the decoder does seem to use multi header self-attention.

Perhaps if I phrase it this way it will be clearer. The input to the AST reader is the true rule sequence. Each time one executes the "current rule" one generates a set of non-terminals. This results in the query vector being larger in length than the target rule sequence the model is leanring. Thus, a simple right shift and a mask does not work the same on the query vector as it does in the input to the AST reader. Note this assume the path embeddings has already been generated correctly seeing only previously generated nodes.

Did that make sense? Of course I might be overlooking something - hence my initiative for a discussion.

Cheers!

@zysszy
Copy link
Owner

zysszy commented Jul 23, 2021

Sorry, maybe I do not fully understand.

That is what I am worried about and the decoder does seem to use multi header self-attention.

We don't use multi-head self-attention in the decoder. We only use multi-head attention to achieve the interaction between the Decoder (query) and AST/Input NL.

Zeyu

@brando90
Copy link
Author

brando90 commented Aug 4, 2021

Hi Zeyu,

Once agian thanks for the reply.

Do you mind clarifying then what an "NL attention" or "AST attention" in the decoder means? There are three arrows in the diagram so it looks like normal Multi Headed Attention but instead of being self attention the input is what the usualy transformer would take in the decoder phase. Is that what it is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants