Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch-transformers returns output of 13 layers? #1332

Closed
3 tasks done
BramVanroy opened this issue Sep 25, 2019 · 9 comments · Fixed by #1346
Closed
3 tasks done

pytorch-transformers returns output of 13 layers? #1332

BramVanroy opened this issue Sep 25, 2019 · 9 comments · Fixed by #1346

Comments

@BramVanroy
Copy link
Collaborator

BramVanroy commented Sep 25, 2019

📚 Migration

Model I am using (Bert, XLNet....): BertModel

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • my own modified scripts: (give details)

The tasks I am working on is:

  • my own task or dataset: (give details)

Details of the issue:

I am using pytorch-transformers for the rather unconventional task of regression (one output). In my research I use BERT and I'm planning to try out the other transformers as well. When I started, I got good results with pytorch-pretrained-bert. However, running the same code with pytorch-transformers gives me results that are a lot worse.

In the original code, I use the output of the model, and concatenate the last four layers - as was proposed in the BERT paper. The architecture that I used looks like this:

from pytorch_pretrained_bert.modeling import BertModel
import torch
from torch import nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bert_model = BertModel.from_pretrained('bert-base-uncased')
        self.pre_classifier = nn.Linear(3072, 512)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(512, 1)

    def forward(self, bert_ids, bert_mask):
        all_bert_layers, _ = self.bert_model(bert_ids, attention_mask=bert_mask)
        print('hidden_states', len(all_bert_layers))
        # concat last four layers
        out = torch.cat(tuple([all_bert_layers[i] for i in [-1, -2, -3, -4]]), dim=-1)
        print('output', out.size())

        # Pooling by also setting masked items to zero
        bert_mask = bert_mask.unsqueeze(2)
        # Multiply output with mask to only retain non-paddding tokens
        out = torch.mul(out, bert_mask)
        print('output', out.size())

        # First item ['CLS'] is sentence representation
        out = out[:, 0, :]
        print('pooled_output', out.size())

        out = self.pre_classifier(out)
        print('pre_classifier', out.size())

        out = self.dropout(out)
        print('dropout', out.size())

        out = self.classifier(out)
        print('classifier', out.size())

        return out

When porting this to pytorch-transformers, the main thing was that now we get a tuple back from the model and we have to explicitly ask to get all hidden states back. As such, the converted code looks like this:

from pytorch_transformers import BertModel
import torch
from torch import nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
        self.pre_classifier = nn.Linear(3072, 512)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(512, 1)

    def forward(self, bert_ids, bert_mask):
        out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
        hidden_states = out[2]
        print('hidden_states', len(hidden_states))

        out = torch.cat(tuple([hidden_states[i] for i in [-1, -2, -3, -4]]), dim=-1)
        print('output', out.size())

        # Pooling by also setting masked items to zero
        bert_mask = bert_mask.unsqueeze(2)
        # Multiply output with mask to only retain non-paddding tokens
        out = torch.mul(out, bert_mask)
        print('output', out.size())

        # First item ['CLS'] is sentence representation
        out = out[:, 0, :]
        print('pooled_output', out.size())

        out = self.pre_classifier(out)
        print('pre_classifier', out.size())

        out = self.dropout(out)
        print('dropout', out.size())

        out = self.classifier(out)
        print('classifier', out.size())

        return out

As I said before, this leads to very different results. Seeding cannot be the issue, since I set all seeds manually in both cases, like this:

def set_seed():
    torch.manual_seed(3)
    torch.cuda.manual_seed_all(3)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(3)
    random.seed(3)
    os.environ['PYTHONHASHSEED'] = str(3)

I have added the print statements as a sort of debugging and I quickly found that there is a fundamental difference between the two architectures. The hidden_states print statement will yield 12 for pytorch-pretrained-bert and 13 for pytorch-transformers! I am not sure how that relates, but I would assume that this could be the starting point to start looking.

I have tried comparing the created models, but in both cases the encoder consists of 12 layers, so I am not sure why pytorch-transformers returns 13? What's the extra one?

Going through the source code, it seems that the first hidden_state (= last hidden_state from the embeddings) is included. Is that true?

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L340-L352

Even so, since the embeddings would be the first item in all_hidden_states, the last four layers should be the same still. Therefore, I am not sure why there is such a big difference in the results of the above two. If you spot any faults, please advise.

Environment

  • OS: Win 10
  • Python version: 3.7
  • PyTorch version: 1.2
  • PyTorch Transformers version (or branch):
  • Using GPU ? Yes, CUDA 10
  • Distributed of parallel setup ? No

Checklist

  • I have read the migration guide in the readme.
@kexinhuang12345
Copy link

I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]

@BramVanroy
Copy link
Collaborator Author

I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]

Hm, I don't think so. The embedding state is passed to the forward function, and that state is used to initialize the all_hidden_states variable. Then you iterate over all layers and append to the tuple sequentially.

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L337-L359

@thomwolf
Copy link
Member

Hi Bram,

Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel

The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.

These lines:

out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

should be changed in:

model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]

@BramVanroy
Copy link
Collaborator Author

Hi Bram,

Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel

The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.

These lines:

out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

should be changed in:

model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]

Hi Thomas, thank you for your time

Apparently a mistake crept into my comment on GitHub. In my code, I do have the correct version, i.e.

out = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

The question that I have is, when you then print the length of those hidden states, you get different numbers.

print(len(hidden_states))
# 13 for pytorch_transformers, 12 for pytorch_pretrained_bert

Going through the source code, it seems that the input hidden state (final hidden state of the embeddings) is included when using pytorch_transformers, but not for pytorch_pretrained_bert.

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L337-L352

I couldn't find this documented anywhere, but I am curious to see the reasoning behind this - since the embedding state is not an encoder state, so it might not be what one expects to get back from the model. On the other hand, it does make it easy for users to get the embeddings.

@thomwolf
Copy link
Member

Hi Bram,
It's written in the link to the doc that I've sent you above and also in the docstring of the model:
image
I'll see if I can find a way to make it more visible.

There are a few reasons we did that, one is this great paper by Tenney et al (http://arxiv.org/abs/1905.05950) which use the output of the embeddings as well at the hidden states to study Bert's performances. Another is to have easy access to the embeddings as you mention.

@kexinhuang12345
Copy link

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

# Add last layer
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

@BramVanroy
Copy link
Collaborator Author

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

# Add last layer
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:

enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`

Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists only of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens before getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.

@kexinhuang12345
Copy link

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

# Add last layer
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:

enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`

Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists only of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens before getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.

You are right, thanks for the clarification!

@BramVanroy
Copy link
Collaborator Author

BramVanroy commented Sep 26, 2019

@thomwolf Thanks for the clarification. I was looking in all the wrong places, it appears. Particularly, I had expected this in the README's migration part. If you want I can do a small doc pull request for that.

Re-opened. Will close after doc change if requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants