forked from PaddlePaddle/Paddle
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[PaddlePaddle Hackathon] 第51题 (PaddlePaddle#1115)
* add bert japanese * fix model-weight files position * add weights files url * create package: bert_japanese * update weights readme * update weights files * update config pretrain weights https * 修复权重配置文件 * retest CI * update * update * fix docstring * update * 预训练权重更新 * update weights readme * remove weights url in codes * update... * update... * update weights readme * update * update * update docstring * 清理冗余代码 Co-authored-by: yingyibiao <yyb0576@163.com>
- Loading branch information
1 parent
20acd16
commit 48e58f0
Showing
15 changed files
with
785 additions
and
7 deletions.
There are no files selected for viewing
64 changes: 64 additions & 0 deletions
64
community/iverxin/bert-base-japanese-char-whole-word-masking/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
|
||
|
||
# BERT base Japanese (character tokenization, whole word masking enabled) | ||
|
||
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language. | ||
|
||
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization. | ||
|
||
Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. | ||
|
||
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0). | ||
|
||
## Model architecture | ||
|
||
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. | ||
|
||
## Training Data | ||
|
||
The model is trained on Japanese Wikipedia as of September 1, 2019. | ||
|
||
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles. | ||
|
||
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences. | ||
|
||
## Tokenization | ||
|
||
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters. | ||
|
||
The vocabulary size is 4000. | ||
|
||
## Training | ||
|
||
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. | ||
|
||
For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. | ||
|
||
## Licenses | ||
|
||
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
## Acknowledgments | ||
|
||
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program. | ||
|
||
## Usage | ||
```python | ||
import paddle | ||
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM | ||
|
||
path = "iverxin/bert-base-japanese-char-whole-word-masking/" | ||
tokenizer = BertJapaneseTokenizer.from_pretrained(path) | ||
model = BertForMaskedLM.from_pretrained(path) | ||
text1 = "こんにちは" | ||
|
||
model.eval() | ||
inputs = tokenizer(text1) | ||
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()} | ||
output = model(**inputs) | ||
print(output.shape) | ||
# [1, 5, 32000] | ||
``` | ||
|
||
## Weights source | ||
https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking |
6 changes: 6 additions & 0 deletions
6
community/iverxin/bert-base-japanese-char-whole-word-masking/files.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_config.json", | ||
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/model_state.pdparams", | ||
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/tokenizer_config.pdparams", | ||
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char-whole-word-masking/vocab.txt" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
|
||
|
||
# BERT base Japanese (character tokenization) | ||
|
||
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language. | ||
|
||
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by character-level tokenization. | ||
|
||
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0). | ||
|
||
## Model architecture | ||
|
||
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. | ||
|
||
## Training Data | ||
|
||
The model is trained on Japanese Wikipedia as of September 1, 2019. | ||
|
||
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles. | ||
|
||
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences. | ||
|
||
## Tokenization | ||
|
||
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into characters. | ||
|
||
The vocabulary size is 4000. | ||
|
||
## Training | ||
|
||
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. | ||
|
||
## Licenses | ||
|
||
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
## Acknowledgments | ||
|
||
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program. | ||
|
||
## Usage | ||
```python | ||
import paddle | ||
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM, MecabTokenizer | ||
|
||
path = "iverxin/bert-base-japanese-char/" | ||
tokenizer = BertJapaneseTokenizer.from_pretrained(path) | ||
model = BertForMaskedLM.from_pretrained(path) | ||
text1 = "こんにちは" | ||
text2 = "櫓を飛ばす" | ||
|
||
model.eval() | ||
inputs = tokenizer(text1) | ||
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()} | ||
output = model(**inputs) | ||
print(output.shape) | ||
``` | ||
|
||
## Weights source | ||
https://huggingface.co/cl-tohoku/bert-base-japanese-char |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_config.json", | ||
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/model_state.pdparams", | ||
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/tokenizer_config.pdparams", | ||
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-char/vocab.txt" | ||
} |
63 changes: 63 additions & 0 deletions
63
community/iverxin/bert-base-japanese-whole-word-masking/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
|
||
|
||
# BERT base Japanese (IPA dictionary, whole word masking enabled) | ||
|
||
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language. | ||
|
||
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. | ||
|
||
Additionally, the model is trained with the whole word masking enabled for the masked language modeling (MLM) objective. | ||
|
||
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0). | ||
|
||
## Model architecture | ||
|
||
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. | ||
|
||
## Training Data | ||
|
||
The model is trained on Japanese Wikipedia as of September 1, 2019. | ||
|
||
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles. | ||
|
||
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences. | ||
|
||
## Tokenization | ||
|
||
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm. | ||
|
||
The vocabulary size is 32000. | ||
|
||
## Training | ||
|
||
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. | ||
|
||
For the training of the MLM (masked language modeling) objective, we introduced the **Whole Word Masking** in which all of the subword tokens corresponding to a single word (tokenized by MeCab) are masked at once. | ||
|
||
## Licenses | ||
|
||
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
## Acknowledgments | ||
|
||
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program. | ||
|
||
## Usage | ||
```python | ||
import paddle | ||
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM | ||
|
||
path = "iverxin/bert-base-japanese-whole-word-masking/" | ||
tokenizer = BertJapaneseTokenizer.from_pretrained(path) | ||
model = BertForMaskedLM.from_pretrained(path) | ||
text1 = "こんにちは" | ||
|
||
model.eval() | ||
inputs = tokenizer(text1) | ||
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()} | ||
output = model(**inputs) | ||
print(output.shape) | ||
``` | ||
|
||
## Weights source | ||
https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking |
6 changes: 6 additions & 0 deletions
6
community/iverxin/bert-base-japanese-whole-word-masking/files.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_config.json", | ||
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/model_state.pdparams", | ||
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/tokenizer_config.pdparams", | ||
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese-whole-word-masking/vocab.txt" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# BERT base Japanese (IPA dictionary) | ||
|
||
This is a [BERT](https://github.com/google-research/bert) model pretrained on texts in the Japanese language. | ||
|
||
This version of the model processes input texts with word-level tokenization based on the IPA dictionary, followed by the WordPiece subword tokenization. | ||
|
||
The codes for the pretraining are available at [cl-tohoku/bert-japanese](https://github.com/cl-tohoku/bert-japanese/tree/v1.0). | ||
|
||
## Model architecture | ||
|
||
The model architecture is the same as the original BERT base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads. | ||
|
||
## Training Data | ||
|
||
The model is trained on Japanese Wikipedia as of September 1, 2019. | ||
|
||
To generate the training corpus, [WikiExtractor](https://github.com/attardi/wikiextractor) is used to extract plain texts from a dump file of Wikipedia articles. | ||
|
||
The text files used for the training are 2.6GB in size, consisting of approximately 17M sentences. | ||
|
||
## Tokenization | ||
|
||
The texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) morphological parser with the IPA dictionary and then split into subwords by the WordPiece algorithm. | ||
|
||
The vocabulary size is 32000. | ||
|
||
## Training | ||
|
||
The model is trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps. | ||
|
||
## Licenses | ||
|
||
The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/). | ||
|
||
## Acknowledgments | ||
|
||
For training models, we used Cloud TPUs provided by [TensorFlow Research Cloud](https://www.tensorflow.org/tfrc/) program. | ||
|
||
|
||
## Usage | ||
```python | ||
import paddle | ||
from paddlenlp.transformers import BertJapaneseTokenizer, BertForMaskedLM | ||
|
||
path = "iverxin/bert-base-japanese/" | ||
tokenizer = BertJapaneseTokenizer.from_pretrained(path) | ||
model = BertForMaskedLM.from_pretrained(path) | ||
text1 = "こんにちは" | ||
|
||
model.eval() | ||
inputs = tokenizer(text1) | ||
inputs = {k: paddle.to_tensor([v]) for (k, v) in inputs.items()} | ||
output = model(**inputs) | ||
print(output.shape) | ||
``` | ||
|
||
|
||
## Weights source | ||
https://huggingface.co/cl-tohoku/bert-base-japanese |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config_file":"https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_config.json", | ||
"model_state": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/model_state.pdparams", | ||
"tokenizer_config_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/tokenizer_config.pdparams", | ||
"vocab_file": "https://paddlenlp.bj.bcebos.com/models/transformers/community/iverxin/bert-base-japanese/vocab.txt" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
69 changes: 69 additions & 0 deletions
69
paddlenlp/transformers/bert_japanese/convert_bert_japanese_params.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
import paddle | ||
import torch | ||
import numpy as np | ||
from paddle.utils.download import get_path_from_url | ||
|
||
model_names = [ | ||
"bert-base-japanese", "bert-base-japanese-whole-word-masking", | ||
"bert-base-japanese-char", "bert-base-japanese-char-whole-word-masking" | ||
] | ||
|
||
for model_name in model_names: | ||
torch_model_url = "https://huggingface.co/cl-tohoku/%s/resolve/main/pytorch_model.bin" % model_name | ||
torch_model_path = get_path_from_url(torch_model_url, '../bert') | ||
torch_state_dict = torch.load(torch_model_path) | ||
|
||
paddle_model_path = "%s.pdparams" % model_name | ||
paddle_state_dict = {} | ||
|
||
# State_dict's keys mapping: from torch to paddle | ||
keys_dict = { | ||
# about embeddings | ||
"embeddings.LayerNorm.gamma": "embeddings.layer_norm.weight", | ||
"embeddings.LayerNorm.beta": "embeddings.layer_norm.bias", | ||
|
||
# about encoder layer | ||
'encoder.layer': 'encoder.layers', | ||
'attention.self.query': 'self_attn.q_proj', | ||
'attention.self.key': 'self_attn.k_proj', | ||
'attention.self.value': 'self_attn.v_proj', | ||
'attention.output.dense': 'self_attn.out_proj', | ||
'attention.output.LayerNorm.gamma': 'norm1.weight', | ||
'attention.output.LayerNorm.beta': 'norm1.bias', | ||
'intermediate.dense': 'linear1', | ||
'output.dense': 'linear2', | ||
'output.LayerNorm.gamma': 'norm2.weight', | ||
'output.LayerNorm.beta': 'norm2.bias', | ||
|
||
# about cls predictions | ||
'cls.predictions.transform.dense': 'cls.predictions.transform', | ||
'cls.predictions.decoder.weight': 'cls.predictions.decoder_weight', | ||
'cls.predictions.transform.LayerNorm.gamma': | ||
'cls.predictions.layer_norm.weight', | ||
'cls.predictions.transform.LayerNorm.beta': | ||
'cls.predictions.layer_norm.bias', | ||
'cls.predictions.bias': 'cls.predictions.decoder_bias' | ||
} | ||
|
||
for torch_key in torch_state_dict: | ||
paddle_key = torch_key | ||
for k in keys_dict: | ||
if k in paddle_key: | ||
paddle_key = paddle_key.replace(k, keys_dict[k]) | ||
|
||
if ('linear' in paddle_key) or ('proj' in paddle_key) or ( | ||
'vocab' in paddle_key and 'weight' in paddle_key) or ( | ||
"dense.weight" in paddle_key) or ( | ||
'transform.weight' in paddle_key) or ( | ||
'seq_relationship.weight' in paddle_key): | ||
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[ | ||
torch_key].cpu().numpy().transpose()) | ||
else: | ||
paddle_state_dict[paddle_key] = paddle.to_tensor(torch_state_dict[ | ||
torch_key].cpu().numpy()) | ||
|
||
print("torch: ", torch_key, "\t", torch_state_dict[torch_key].shape) | ||
print("paddle: ", paddle_key, "\t", paddle_state_dict[paddle_key].shape, | ||
"\n") | ||
|
||
paddle.save(paddle_state_dict, paddle_model_path) |
Oops, something went wrong.