-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10
Comments
Thanks for your attention. Sorry for late reply. We use word convolution in NL encoder mainly because it can extract local features from other words. Separable conv actually has fewer parameters and is easier to train. "Did you compare to original transformer encoder?" Zeyu |
Thanks for the reply Zeyu! In the normal transformer, they use fully connected layers (dense layer) [linking the paper because I doubled check just to make sure I was right https://arxiv.org/pdf/1706.03762.pdf]. So just to clarify, when you compared your experiments with having a dense layer the word convolution performed pretty much the same but the (separable) convolution was easier to train so you went with that? Thanks for the reply again and really nice work. |
Perhaps this would clarify (sorry for the spam). Your decoder has a fully connected layer (dense) but your encoders do not. Why is that? Thanks for your time and attention again! |
In Transformer, the most widely used feed-forward layers are fully connected layers or convolutional layers. In our experiment, we compared TreeGen with the Transformer which uses the convolutional layers as feed-forward layers in the encoder. "having a dense layer the word convolution performed pretty much the same but the (separable) convolution was easier to train so you went with that?" "Your decoder has a fully connected layer (dense) but your encoders do not. Why is that?" Zeyu |
That makes sense thanks Zeyu! I appreciate your response. One more comment. I was checking the original transformer and it seems their feed-forward FF layer is a dense layer:
I assume because of the word "fully connected feed-forward network". Just a comment not that it matters for your work if you found convolution works better. I appreciate your time it was very useful! |
Yes. The feed-forward layer is a dense layer in the original Transformer. However, using a convolutional layer as the feed-forward layer is also widely used in various tasks. It will be very nice if my answer can help you. Zeyu |
Thanks Zeyu I think I understand better and appreciate your responses. Thanks for sharing the code, paper and kindly answering my questions :) |
Hi,
I was wondering, why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? Is it mainly because separable conv is easier to train? Did you compare to original transformer encoder?
Thanks in advance for the paper and code! :)
The text was updated successfully, but these errors were encountered: