Skip to content

Commit

Permalink
[PaddlePaddle Hackathon] 第51题 (PaddlePaddle#1115)
Browse files Browse the repository at this point in the history
* add bert japanese

* fix model-weight files position

* add weights files url

* create package: bert_japanese

* update weights readme

* update weights files

* update config pretrain weights https

* 修复权重配置文件

* retest CI

* update

* update

* fix docstring

* update

* 预训练权重更新

* update weights readme

* remove weights url in codes

* update...

* update...

* update weights readme

* update

* update

* update docstring

* 清理冗余代码

Co-authored-by: yingyibiao <yyb0576@163.com>
  • Loading branch information
iverxin and yingyibiao authored Oct 28, 2021
1 parent 20acd16 commit 48e58f0
Show file tree
Hide file tree
Showing 15 changed files with 785 additions and 7 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@


# BERT base Japanese (character tokenization, whole word masking enabled)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.

Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.

The vocabulary size is 4000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.

## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM

path = "iverxin/bert-base-japanese-char-whole-word-masking/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
# [1, 5, 32000]
```

## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/vocab.txt"
}
60 changes: 60 additions & 0 deletions community/iverxin/bert-base-japanese-char/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@


# BERT base Japanese (character tokenization)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters.

The vocabulary size is 4000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.

## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM, MecabTokenizer

path = "iverxin/bert-base-japanese-char/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"
text2 = "櫓を飛ばす"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
```

## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese-char
6 changes: 6 additions & 0 deletions community/iverxin/bert-base-japanese-char/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/vocab.txt"
}
63 changes: 63 additions & 0 deletions community/iverxin/bert-base-japanese-whole-word-masking/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@


# BERT base Japanese (IPA dictionary, whole word masking enabled)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.

Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.

The vocabulary size is 32000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.

## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM

path = "iverxin/bert-base-japanese-whole-word-masking/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
```

## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/vocab.txt"
}
59 changes: 59 additions & 0 deletions community/iverxin/bert-base-japanese/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# BERT base Japanese (IPA dictionary)

This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language.

This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization.

The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0).

## Model architecture

The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

## Training Data

The model is trained on Japanese Wikipedia as of September 1, 2019.

To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles.

The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences.

## Tokenization

The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm.

The vocabulary size is 32000.

## Training

The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/).

## Acknowledgments

For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program.


## Usage
```python
import paddle
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM

path = "iverxin/bert-base-japanese/"
tokenizer = BertJapaneseTokenizer.from_pretrained(path)
model = BertForMaskedLM.from_pretrained(path)
text1 = "こんにちは"

model.eval()
inputs = tokenizer(text1)
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()}
output = model(**inputs)
print(output.shape)
```


## Weights source
https://huggingface.co/cl-tohoku/bert-base-japanese
6 changes: 6 additions & 0 deletions community/iverxin/bert-base-japanese/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_config.json",
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_state.pdparams",
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/tokenizer_config.pdparams",
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/vocab.txt"
}
1 change: 1 addition & 0 deletions paddlenlp/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from .bert.modeling import *
from .bert.tokenizer import *
from .bert_japanese.tokenizer import *
from .ernie.modeling import *
from .ernie.tokenizer import *
from .gpt.modeling import *
Expand Down
15 changes: 8 additions & 7 deletions paddlenlp/transformers/bert/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,17 @@
# limitations under the License.

import copy
import io
import json
import os
import six
import unicodedata

from .. import PretrainedTokenizer
from ..tokenizer_utils import convert_to_unicode, whitespace_tokenize, _is_whitespace, _is_control, _is_punctuation

__all__ = ['BasicTokenizer', 'BertTokenizer', 'WordpieceTokenizer']
__all__ = [
'BasicTokenizer',
'BertTokenizer',
'WordpieceTokenizer',
]


class BasicTokenizer(object):
Expand Down Expand Up @@ -290,9 +291,9 @@ class BertTokenizer(PretrainedTokenizer):
.. code-block::
from paddlenlp.transformers import BertTokenizer
berttokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = berttokenizer.tokenize('He was a puppeteer')
inputs = tokenizer('He was a puppeteer')
print(inputs)
'''
Expand Down Expand Up @@ -554,7 +555,7 @@ def create_token_type_ids_from_sequences(self,
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
token_ids_0 (List[int]):
Expand Down
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import paddle
import torch
import numpy as np
from paddle.utils.download import get_path_from_url

model_names = [
"bert-base-japanese", "bert-base-japanese-whole-word-masking",
"bert-base-japanese-char", "bert-base-japanese-char-whole-word-masking"
]

for model_name in model_names:
torch_model_url = "https://huggingface.co/cl-tohoku/%s/resolve/main/pytorch_model.bin" % model_name
torch_model_path = get_path_from_url(torch_model_url, '../bert')
torch_state_dict = torch.load(torch_model_path)

paddle_model_path = "%s.pdparams" % model_name
paddle_state_dict = {}

# State_dict's keys mapping: from torch to paddle
keys_dict = {
# about embeddings
"embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight",
"embeddings.LayerNorm.beta": "embeddings.layer_norm.bias",

# about encoder layer
'encoder.layer': 'encoder.layers',
'attention.self.query': 'self_attn.q_proj',
'attention.self.key': 'self_attn.k_proj',
'attention.self.value': 'self_attn.v_proj',
'attention.output.dense': 'self_attn.out_proj',
'attention.output.LayerNorm.gamma': 'norm1.weight',
'attention.output.LayerNorm.beta': 'norm1.bias',
'intermediate.dense': 'linear1',
'output.dense': 'linear2',
'output.LayerNorm.gamma': 'norm2.weight',
'output.LayerNorm.beta': 'norm2.bias',

# about cls predictions
'cls.predictions.transform.dense': 'cls.predictions.transform',
'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight',
'cls.predictions.transform.LayerNorm.gamma':
'cls.predictions.layer_norm.weight',
'cls.predictions.transform.LayerNorm.beta':
'cls.predictions.layer_norm.bias',
'cls.predictions.bias': 'cls.predictions.decoder_bias'
}

for torch_key in torch_state_dict:
paddle_key = torch_key
for k in keys_dict:
if k in paddle_key:
paddle_key = paddle_key.replace(k, keys_dict[k])

if ('linear' in paddle_key) or ('proj' in paddle_key) or (
'vocab' in paddle_key and 'weight' in paddle_key) or (
"dense.weight" in paddle_key) or (
'transform.weight' in paddle_key) or (
'seq_relationship.weight' in paddle_key):
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
torch_key].cpu().numpy().transpose())
else:
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[
torch_key].cpu().numpy())

print("torch: ", torch_key, "\t", torch_state_dict[torch_key].shape)
print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape,
"\n")

paddle.save(paddle_state_dict, paddle_model_path)
Loading

0 comments on commit 48e58f0

Please sign in to comment.