pytorch-transformers returns output of 13 layers? #1332

BramVanroy · 2019-09-25T09:51:36Z

📚 Migration

Model I am using (Bert, XLNet....): BertModel

Language I am using the model on (English, Chinese....): English

The problem arise when using:

my own modified scripts: (give details)

The tasks I am working on is:

my own task or dataset: (give details)

Details of the issue:

I am using pytorch-transformers for the rather unconventional task of regression (one output). In my research I use BERT and I'm planning to try out the other transformers as well. When I started, I got good results with pytorch-pretrained-bert. However, running the same code with pytorch-transformers gives me results that are a lot worse.

In the original code, I use the output of the model, and concatenate the last four layers - as was proposed in the BERT paper. The architecture that I used looks like this:

from pytorch_pretrained_bert.modeling import BertModel
import torch
from torch import nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bert_model = BertModel.from_pretrained('bert-base-uncased')
        self.pre_classifier = nn.Linear(3072, 512)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(512, 1)

    def forward(self, bert_ids, bert_mask):
        all_bert_layers, _ = self.bert_model(bert_ids, attention_mask=bert_mask)
        print('hidden_states', len(all_bert_layers))
        # concat last four layers
        out = torch.cat(tuple([all_bert_layers[i] for i in [-1, -2, -3, -4]]), dim=-1)
        print('output', out.size())

        # Pooling by also setting masked items to zero
        bert_mask = bert_mask.unsqueeze(2)
        # Multiply output with mask to only retain non-paddding tokens
        out = torch.mul(out, bert_mask)
        print('output', out.size())

        # First item ['CLS'] is sentence representation
        out = out[:, 0, :]
        print('pooled_output', out.size())

        out = self.pre_classifier(out)
        print('pre_classifier', out.size())

        out = self.dropout(out)
        print('dropout', out.size())

        out = self.classifier(out)
        print('classifier', out.size())

        return out

When porting this to pytorch-transformers, the main thing was that now we get a tuple back from the model and we have to explicitly ask to get all hidden states back. As such, the converted code looks like this:

from pytorch_transformers import BertModel
import torch
from torch import nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
        self.pre_classifier = nn.Linear(3072, 512)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(512, 1)

    def forward(self, bert_ids, bert_mask):
        out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
        hidden_states = out[2]
        print('hidden_states', len(hidden_states))

        out = torch.cat(tuple([hidden_states[i] for i in [-1, -2, -3, -4]]), dim=-1)
        print('output', out.size())

        # Pooling by also setting masked items to zero
        bert_mask = bert_mask.unsqueeze(2)
        # Multiply output with mask to only retain non-paddding tokens
        out = torch.mul(out, bert_mask)
        print('output', out.size())

        # First item ['CLS'] is sentence representation
        out = out[:, 0, :]
        print('pooled_output', out.size())

        out = self.pre_classifier(out)
        print('pre_classifier', out.size())

        out = self.dropout(out)
        print('dropout', out.size())

        out = self.classifier(out)
        print('classifier', out.size())

        return out

As I said before, this leads to very different results. Seeding cannot be the issue, since I set all seeds manually in both cases, like this:

def set_seed():
    torch.manual_seed(3)
    torch.cuda.manual_seed_all(3)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(3)
    random.seed(3)
    os.environ['PYTHONHASHSEED'] = str(3)

I have added the print statements as a sort of debugging and I quickly found that there is a fundamental difference between the two architectures. The hidden_states print statement will yield 12 for pytorch-pretrained-bert and 13 for pytorch-transformers! I am not sure how that relates, but I would assume that this could be the starting point to start looking.

I have tried comparing the created models, but in both cases the encoder consists of 12 layers, so I am not sure why pytorch-transformers returns 13? What's the extra one?

Going through the source code, it seems that the first hidden_state (= last hidden_state from the embeddings) is included. Is that true?

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L340-L352

Even so, since the embeddings would be the first item in all_hidden_states, the last four layers should be the same still. Therefore, I am not sure why there is such a big difference in the results of the above two. If you spot any faults, please advise.

Environment

OS: Win 10
Python version: 3.7
PyTorch version: 1.2
PyTorch Transformers version (or branch):
Using GPU ? Yes, CUDA 10
Distributed of parallel setup ? No

Checklist

I have read the migration guide in the readme.

The text was updated successfully, but these errors were encountered:

kexinhuang12345 · 2019-09-25T23:15:08Z

I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]

BramVanroy · 2019-09-26T06:01:47Z

I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]

Hm, I don't think so. The embedding state is passed to the forward function, and that state is used to initialize the all_hidden_states variable. Then you iterate over all layers and append to the tuple sequentially.

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L337-L359

thomwolf · 2019-09-26T06:28:49Z

Hi Bram,

Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel

The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.

These lines:

out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

should be changed in:

model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]

BramVanroy · 2019-09-26T07:22:00Z

Hi Bram,

Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel

The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.

These lines:
out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]
should be changed in:
model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]

Hi Thomas, thank you for your time

Apparently a mistake crept into my comment on GitHub. In my code, I do have the correct version, i.e.

out = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

The question that I have is, when you then print the length of those hidden states, you get different numbers.

print(len(hidden_states))
# 13 for pytorch_transformers, 12 for pytorch_pretrained_bert

Going through the source code, it seems that the input hidden state (final hidden state of the embeddings) is included when using pytorch_transformers, but not for pytorch_pretrained_bert.

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L337-L352

I couldn't find this documented anywhere, but I am curious to see the reasoning behind this - since the embedding state is not an encoder state, so it might not be what one expects to get back from the model. On the other hand, it does make it easy for users to get the embeddings.

thomwolf · 2019-09-26T14:00:52Z

Hi Bram,
It's written in the link to the doc that I've sent you above and also in the docstring of the model:

I'll see if I can find a way to make it more visible.

There are a few reasons we did that, one is this great paper by Tenney et al (http://arxiv.org/abs/1905.05950) which use the output of the embeddings as well at the hidden states to study Bert's performances. Another is to have easy access to the embeddings as you mention.

kexinhuang12345 · 2019-09-26T14:18:20Z

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

transformers/pytorch_transformers/modeling_bert.py

Lines 350 to 352 in 7c0f2d0

    
           # Add last layer 
        
           if self.output_hidden_states: 
        
               all_hidden_states = all_hidden_states + (hidden_states,)

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

BramVanroy · 2019-09-26T14:27:39Z

Add last layer
 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)
transformers/pytorch_transformers/modeling_bert.py

Lines 350 to 352 in 7c0f2d0

# Add last layer

if self.output_hidden_states:

all_hidden_states = all_hidden_states + (hidden_states,)

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:

enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`

Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists only of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens before getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.

kexinhuang12345 · 2019-09-26T14:32:59Z

Add last layer
 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)
transformers/pytorch_transformers/modeling_bert.py

Lines 350 to 352 in 7c0f2d0

# Add last layer

if self.output_hidden_states:

all_hidden_states = all_hidden_states + (hidden_states,)

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.
No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:
enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`
Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists only of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens before getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.

You are right, thanks for the clarification!

BramVanroy · 2019-09-26T14:37:58Z

@thomwolf Thanks for the clarification. I was looking in all the wrong places, it appears. Particularly, I had expected this in the README's migration part. If you want I can do a small doc pull request for that.

Re-opened. Will close after doc change if requested.

Add small note about the output of hidden states (closes #1332)

BramVanroy closed this as completed Sep 26, 2019

BramVanroy reopened this Sep 26, 2019

BramVanroy mentioned this issue Sep 27, 2019

Add small note about the output of hidden states (closes #1332) #1346

Merged

thomwolf closed this as completed in #1346 Sep 27, 2019

thomwolf added a commit that referenced this issue Sep 27, 2019

Merge pull request #1346 from BramVanroy/documentation

f6de000

Add small note about the output of hidden states (closes #1332)

aurelien-clu mentioned this issue Feb 6, 2020

build: add poetry, an alternative to setup.py with dependency versions tracked #2760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch-transformers returns output of 13 layers? #1332

pytorch-transformers returns output of 13 layers? #1332

BramVanroy commented Sep 25, 2019 •

edited

Loading

kexinhuang12345 commented Sep 25, 2019

BramVanroy commented Sep 26, 2019

thomwolf commented Sep 26, 2019

BramVanroy commented Sep 26, 2019

thomwolf commented Sep 26, 2019

kexinhuang12345 commented Sep 26, 2019

Add last layer

BramVanroy commented Sep 26, 2019

Add last layer

kexinhuang12345 commented Sep 26, 2019

Add last layer

BramVanroy commented Sep 26, 2019 •

edited

Loading

pytorch-transformers returns output of 13 layers? #1332

pytorch-transformers returns output of 13 layers? #1332

Comments

BramVanroy commented Sep 25, 2019 • edited Loading

📚 Migration

Environment

Checklist

kexinhuang12345 commented Sep 25, 2019

BramVanroy commented Sep 26, 2019

thomwolf commented Sep 26, 2019

BramVanroy commented Sep 26, 2019

thomwolf commented Sep 26, 2019

kexinhuang12345 commented Sep 26, 2019

Add last layer

BramVanroy commented Sep 26, 2019

Add last layer

kexinhuang12345 commented Sep 26, 2019

Add last layer

BramVanroy commented Sep 26, 2019 • edited Loading

BramVanroy commented Sep 25, 2019 •

edited

Loading

BramVanroy commented Sep 26, 2019 •

edited

Loading