Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10

brando90 · 2021-06-04T20:45:53Z

Hi,

I was wondering, why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? Is it mainly because separable conv is easier to train? Did you compare to original transformer encoder?

Thanks in advance for the paper and code! :)

zysszy · 2021-06-06T15:53:25Z

Thanks for your attention. Sorry for late reply.

We use word convolution in NL encoder mainly because it can extract local features from other words.

Separable conv actually has fewer parameters and is easier to train.

"Did you compare to original transformer encoder?"
Separable conv has a very similar performance compared with the convolutional layer used in the original transformer encoder.

Zeyu

brando90 · 2021-06-07T12:54:35Z

Thanks for your attention. Sorry for late reply.

We use word convolution in NL encoder mainly because it can extract local features from other words.

Separable conv actually has fewer parameters and is easier to train.

"Did you compare to original transformer encoder?"
Separable conv has a very similar performance compared with the convolutional layer used in the original transformer encoder.

Zeyu

Thanks for the reply Zeyu!

In the normal transformer, they use fully connected layers (dense layer) [linking the paper because I doubled check just to make sure I was right https://arxiv.org/pdf/1706.03762.pdf]. So just to clarify, when you compared your experiments with having a dense layer the word convolution performed pretty much the same but the (separable) convolution was easier to train so you went with that?

Thanks for the reply again and really nice work.

brando90 · 2021-06-07T14:34:48Z

Perhaps this would clarify (sorry for the spam). Your decoder has a fully connected layer (dense) but your encoders do not. Why is that?

Thanks for your time and attention again!

zysszy · 2021-06-08T06:18:39Z

In Transformer, the most widely used feed-forward layers are fully connected layers or convolutional layers. In our experiment, we compared TreeGen with the Transformer which uses the convolutional layers as feed-forward layers in the encoder.

"having a dense layer the word convolution performed pretty much the same but the (separable) convolution was easier to train so you went with that?"
Yes, convolutional layers perform much better than dense layers but perform similar to separable convolutional layers. Separable convolutional layers have fewer parameters, which is easier to train.

"Your decoder has a fully connected layer (dense) but your encoders do not. Why is that?"
Convolutional layers can capture local features from the word before and after. However, in the decoder, we should ensure that a rule cannot use the features extracted from its next rule. For example, for a ground-truth rule sequence "1 2 5", when generating rule 2, we should ensure that it cannot use the features of rule 5. This is the reason why we don't use convolutional layers in the decoder.

Zeyu

brando90 · 2021-06-08T17:24:14Z

That makes sense thanks Zeyu! I appreciate your response.

One more comment. I was checking the original transformer and it seems their feed-forward FF layer is a dense layer:

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. W

I assume because of the word "fully connected feed-forward network". Just a comment not that it matters for your work if you found convolution works better.

I appreciate your time it was very useful!

zysszy · 2021-06-09T11:15:45Z

That makes sense thanks Zeyu! I appreciate your response.

One more comment. I was checking the original transformer and it seems their feed-forward FF layer is a dense layer:

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. W

I assume because of the word "fully connected feed-forward network". Just a comment not that it matters for your work if you found convolution works better.

I appreciate your time it was very useful!

Yes. The feed-forward layer is a dense layer in the original Transformer. However, using a convolutional layer as the feed-forward layer is also widely used in various tasks.

It will be very nice if my answer can help you.

Zeyu

brando90 · 2021-06-11T14:43:16Z

Thanks Zeyu I think I understand better and appreciate your responses.

Thanks for sharing the code, paper and kindly answering my questions :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10

Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10

brando90 commented Jun 4, 2021

zysszy commented Jun 6, 2021

brando90 commented Jun 7, 2021

brando90 commented Jun 7, 2021

zysszy commented Jun 8, 2021

brando90 commented Jun 8, 2021

zysszy commented Jun 9, 2021

brando90 commented Jun 11, 2021

Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10

Why use WordConv (separable convolution) in NL encoder and not the usual Feedforward NN (like original transformer)? #10

Comments

brando90 commented Jun 4, 2021

zysszy commented Jun 6, 2021

brando90 commented Jun 7, 2021

brando90 commented Jun 7, 2021

zysszy commented Jun 8, 2021

brando90 commented Jun 8, 2021

zysszy commented Jun 9, 2021

brando90 commented Jun 11, 2021