Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10

Open
brando90 opened this issue Jun 4, 2021 · 7 comments

Comments

@brando90
Copy link

brando90 commented Jun 4, 2021

Hi,

I was wondering, why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? Is it mainly because separable conv is easier to train? Did you compare to original transformer encoder?

Thanks in advance for the paper and code! :)

@zysszy
Copy link
Owner

zysszy commented Jun 6, 2021

Thanks for your attention. Sorry for late reply.

We use word convolution in NL encoder mainly because it can extract local features from other words.

Separable conv actually has fewer parameters and is easier to train.

"Did you compare to original transformer encoder?"
Separable conv has a very similar performance compared with the convolutional layer used in the original transformer encoder.

Zeyu

@brando90
Copy link
Author

brando90 commented Jun 7, 2021

Thanks for your attention. Sorry for late reply.

We use word convolution in NL encoder mainly because it can extract local features from other words.

Separable conv actually has fewer parameters and is easier to train.

"Did you compare to original transformer encoder?"
Separable conv has a very similar performance compared with the convolutional layer used in the original transformer encoder.

Zeyu

Thanks for the reply Zeyu!

In the normal transformer, they use fully connected layers (dense layer) [linking the paper because I doubled check just to make sure I was right https://arxiv.org/pdf/1706.03762.pdf]. So just to clarify, when you compared your experiments with having a dense layer the word convolution performed pretty much the same but the (separable) convolution was easier to train so you went with that?

Thanks for the reply again and really nice work.

@brando90
Copy link
Author

brando90 commented Jun 7, 2021

Perhaps this would clarify (sorry for the spam). Your decoder has a fully connected layer (dense) but your encoders do not. Why is that?

Thanks for your time and attention again!

@zysszy
Copy link
Owner

zysszy commented Jun 8, 2021

In Transformer, the most widely used feed-forward layers are fully connected layers or convolutional layers. In our experiment, we compared TreeGen with the Transformer which uses the convolutional layers as feed-forward layers in the encoder.

"having a dense layer the word convolution performed pretty much the same but the (separable) convolution was easier to train so you went with that?"
Yes, convolutional layers perform much better than dense layers but perform similar to separable convolutional layers. Separable convolutional layers have fewer parameters, which is easier to train.

"Your decoder has a fully connected layer (dense) but your encoders do not. Why is that?"
Convolutional layers can capture local features from the word before and after. However, in the decoder, we should ensure that a rule cannot use the features extracted from its next rule. For example, for a ground-truth rule sequence "1 2 5", when generating rule 2, we should ensure that it cannot use the features of rule 5. This is the reason why we don't use convolutional layers in the decoder.

Zeyu

@brando90
Copy link
Author

brando90 commented Jun 8, 2021

That makes sense thanks Zeyu! I appreciate your response.

One more comment. I was checking the original transformer and it seems their feed-forward FF layer is a dense layer:

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. W

I assume because of the word "fully connected feed-forward network". Just a comment not that it matters for your work if you found convolution works better.

I appreciate your time it was very useful!

@zysszy
Copy link
Owner

zysszy commented Jun 9, 2021

That makes sense thanks Zeyu! I appreciate your response.

One more comment. I was checking the original transformer and it seems their feed-forward FF layer is a dense layer:

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. W

I assume because of the word "fully connected feed-forward network". Just a comment not that it matters for your work if you found convolution works better.

I appreciate your time it was very useful!

Yes. The feed-forward layer is a dense layer in the original Transformer. However, using a convolutional layer as the feed-forward layer is also widely used in various tasks.

It will be very nice if my answer can help you.

Zeyu

@brando90
Copy link
Author

Thanks Zeyu I think I understand better and appreciate your responses.

Thanks for sharing the code, paper and kindly answering my questions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants