Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing notation/source for AttentionalAggregation #5400

Closed
RafiBrent opened this issue Sep 9, 2022 · 3 comments · Fixed by #5449
Closed

Confusing notation/source for AttentionalAggregation #5400

RafiBrent opened this issue Sep 9, 2022 · 3 comments · Fixed by #5449

Comments

@RafiBrent
Copy link

📚 Describe the documentation issue

I may be misunderstanding the documentation/code, and if so please correct me. However, I believe there are a few issues with the documentation for AttentionalAggregation. The first confusing aspect is the use of the Hadamard product symbol between the softmaxed output of h_gate (shape [-1, 1]) and the output of h_theta (shape [-1, out_channels]). As I understand it, the mathematical convention is that this symbol is only used between arrays of the same size, so if what is actually happening is that each row of the output of h_theta is scalar-multiplied by the corresponding entry of h_gate, I believe there is a clearer way to express this. Secondly, and more importantly, I believe that this module is performing a fundamentally different aggregation function from that in the paper that is cited in the documentation. Despite the superficial similarity of the formulas involved, Equation 3 in the “Gated Graph Sequence Neural Networks” paper involves applying a neural network to a feature vector pertaining to a single node, outputting a vector (rather than a scalar) and then softmaxing across this modified feature vector. Thus, instead of a single softmaxed vector of size num_nodes, the neural network from the paper actually generates num_nodes different softmaxed vectors, each of which are independently element-wise multiplied by the corresponding output of the second neural network. Fundamentally, it seems that the operation in the paper is applying attentional weights to the channels of each (post-neural-network) feature vector, while the operation in PyG is applying attentional weights to the set of nodes as a whole (since all channels of a given node are multiplied by the same scalar output of h_gate). Please let me know if this interpretation is correct, and if so it would be helpful if the citation were modified in some way to avoid the confusion. Thanks so much for your help.

Suggest a potential alternative/fix

No response

@rusty1s
Copy link
Member

rusty1s commented Sep 15, 2022

Really sorry for the late reply. Interestingly, we use the implementation from https://arxiv.org/pdf/1904.12787.pdf (Eq. 3) which cites the initial work of Li et al., 2015. I will change the reference in the documentation.

@rusty1s
Copy link
Member

rusty1s commented Sep 15, 2022

#5448

@rusty1s
Copy link
Member

rusty1s commented Sep 15, 2022

Also added support for feature-level gating, see #5449. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants