-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make a clear plan to support Transformer on Fluid. #6876
Comments
@lcy-seso Thanks so much for the writeup. This is very helpful. As discussed in the Hi group, let's pick the items one by one, discuss each of them and figure out a plan. |
The transformer also follows the encoder-decoder architecture. Encoder and decoder are stacks of many identical modules. Computations/operators each part requires:
Because Transformer does not depend on any recurrence and convolutions, there are not many operators that we have not implemented yet. I think the only one is the layer normalization. But, one difficult I think is how to implement self-attention efficiently. Theoretically, this step can be highly parallelled. I guess we can make use of The architecture of transformer is quite simple. I guess there will be many tricks to tune it once we can successfully run the model. |
I add a brief TODO list first for our discussion. We can not directly move into the transformer. The basic functional requirements it depends also need to be tested and debugged first. The core idea of the transformer (high parallelization and dispensing recurrence and convolution) is also used in ConvS2S. The list still needs priorities and time schedule. They can be done in parallel.
The above steps help us to debug the framework and guarantee the basic functional requirements are ready.
Not a top priority right now.
|
The following picture is from googleblog, which well demonstrates how the transformer works. A nice PPT about Transformer: https://nlp.stanford.edu/seminar/details/lkaiser.pdf |
Currently, I have surveyed some implementation for Transformer. |
We need still a better design for Transformer. @pkuyym If you have already surveyed some implementation for the transformer, I have one question, how they implement the self-attention both time and memory efficiently? Does it need some special operator, such as "broadcast"? Is self-attention based on some elementary operators or we can just write a very specific operator for it? |
@lcy-seso Yes, I only surveyed the PyTorch version carefully to check the details of Transformer.
q_s = q.repeat(n_head, 1, 1).view(n_head, -1, d_model) # n_head x (mb_size*len_q) x d_model
k_s = k.repeat(n_head, 1, 1).view(n_head, -1, d_model) # n_head x (mb_size*len_k) x d_model
v_s = v.repeat(n_head, 1, 1).view(n_head, -1, d_model) # n_head x (mb_size*len_v) x d_model
# treat the result as a (n_head * mb_size) size batch
q_s = torch.bmm(q_s, self.w_qs).view(-1, len_q, d_k) # (n_head*mb_size) x len_q x d_k
k_s = torch.bmm(k_s, self.w_ks).view(-1, len_k, d_k) # (n_head*mb_size) x len_k x d_k
v_s = torch.bmm(v_s, self.w_vs).view(-1, len_v, d_v) # (n_head*mb_size) x len_v x d_v
outputs, attns = self.attention(q_s, k_s, v_s, attn_mask=attn_mask.repeat(n_head, 1, 1)) The implementation of attn = torch.bmm(q, k.transpose(1, 2)) / self.temper
if attn_mask is not None:
assert attn_mask.size() == attn.size(), \
'Attention mask shape {} mismatch ' \
'with Attention logit tensor shape ' \
'{}.'.format(attn_mask.size(), attn.size())
attn.data.masked_fill_(attn_mask, -float('inf'))
attn = self.softmax(attn)
attn = self.dropout(attn)
output = torch.bmm(attn, v) As we can see, the key operation is
|
I think I understand the point. Becuase both ConS2S and transformer use the "dot product attention". This kind of attention is very special. For a single sequence, the process of "dot product" can be implemented by using the outer product, so that it is highly efficient. But when it comes to batch computation for variable length sequence, it cannot be directly implemented by the outer product. Whether we need padding, I will think more about this. ConvS2S has the same problem. |
Thanks to @guoshengCS , I think I can better understand how to batch compute the self-attention and attention in ConvS2S.
|
It seems that implementation based on |
The padding cannot be removed from transformer and ConvS2S. |
I close this issue first. We can discuss the problem found in a new issue or reopen this issue if needed. Thanks, everyone. |
We are going to support popular NMT models on Fluid, including but not limited to RNN search, ConvS2S, and Transformer.
I think the first important thing for us is to understand and figure out the problems.
We choose Google' Transformer as our starting point. Here I list some questions should be answered:
A tensor2tensor implementation: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py
About the Transformer:
About Fluid (This part is relatively open currently)
At the end of this step, we will share our notes with everyone, both about Transformer and Fluid, we can try to make it part of the document. all of us should:
Related issue: #6821
The text was updated successfully, but these errors were encountered: