-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Improve GAT models #348
Conversation
def edge_attention(self, edges): | ||
# an edge UDF to compute unnormalized attention values from src and dst | ||
a = self.leaky_relu(edges.src['a1'] + edges.dst['a2']) | ||
a = torch.exp(a).clamp(-10, 10) # use clamp to avoid overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A caveat is that the softmax is technically still not the "correct and numerically-stable" softmax. For example, if the pre-softmax scores are [5, 10, 5, 5]
then the results would become uniform, since here it just clips the whole thing to have a maximum absolute value of 10.
We should leave a comment there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Is this a reasonable approximation? I found in the torch implementation
https://github.com/Diego999/pyGAT/blob/3301fd68cbf349925d7458d9d47766e48021cf78/layers.py#L111
They even use torch.exp(-leaky_relu(...))
. Not sure whether this is equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-leaky_relu
is actually better (although not perfect due to underflow) since the exponential of that would be always no greater than 1. The opposite is not since the result will easily overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is it still correct? It seems to me that they change the activation function to -leaky_relu
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value range of the pre-softmax value doesn't matter if we disregard numerical stability and no clipping is involved. So we only need to prevent the exponential from overflowing and underflowing. But clipping it in the current way is bad because the clipped exponential will always often reside between 1 and 10 easily go above 10. The model will have trouble learning a sharp attention this way. -leaky_relu
without clipping doesn't have this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think -leaky_relu
is just a little better than leaky_relu
because the negative slope (the author used 0.2) is less steep than the positive slope (is 1.0). It can still go beyond 10 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I was thinking of relu
, my bad. Still, I think it will be hard to have the value explode in the negative case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me. It seems that the accuracy of pubmed and citeseer hasn't matched paper's.
If this is because of the clamping in softmax impl, I actually wonder how much would it cost to have an extra update_all
round just to calculate the max of pre-softmax attention values?
@ylfdq1118 There are two available methods to avoid
If my understanding is right, I think the in the first method we compute the max value in a degree-bucketing way, then we could use SPMV to accelerate the following steps( Is that the point why you mentioned adding an extra |
According to pytorch's implementations, they minus the max before doing the softmax. |
You guys are right. We don't have kernel support for max reduction yet. Previously, I tried scipy to calculate max outside DGL, but that would cost extra communication between host and device and may not be doable since edge features are not scalar any more. |
One thing I noticed from the tensorflow baseline. Their loss value is much larger than ours (all the hyperparameters are the same) and the training accuracy is lower than validation accuracy. Do you guys know why this will happen? |
Figured out what happens. There is a weird dropout. Adding that and multi output attention heads gives better results. Updated. |
Description
A more efficient GAT implementation:
Results
Checklist
Please feel free to remove inapplicable items for your PR.
or have been fixed to be compatible with this change
@zheng-da @eric-haibin-lin Cannot implement it in MXNet because MXNet does not support autograd on the sparse matrix for matmul. Based on the experience from GCN, if MXNet is able to support this, MXNet should be faster.