Skip to content

Commit

Permalink
[Pre-Training] Add tutorial for clue small 14g dataset (#1555)
Browse files Browse the repository at this point in the history
* add tutorial for clue small 14g.

* add pre-train weight to community.

* fix typos.

* fix typo.

* add dataset link.

* change name to ernie-1.0-cluecorpussmall

Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com>
  • Loading branch information
ZHUI and ZeyuChen authored Jan 15, 2022
1 parent cf51c8a commit a5f8a3e
Show file tree
Hide file tree
Showing 7 changed files with 234 additions and 11 deletions.
48 changes: 48 additions & 0 deletions community/zhui/ernie-1.0-cluecorpussmall/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# 详细介绍
本权重为使用PaddleNLP提供的[ERNIE-1.0预训练教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/ernie-1.0),在CLUECorpusSmall 14g数据集上训练得到的权重。

本模型结构与ernie-1.0完全相同。使用训练配置`batch_size=512, max_steps=100w`, 训练得到。模型使用方法与原始ernie-1.0权重相同。

预训练全流程参见:https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/ernie-1.0/README.md

# 使用示例

示例一:
```python
import paddle
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
model = ErnieForMaskedLM.from_pretrained('zhui/ernie-1.0-cluecorpussmall')

tokens = ['[CLS]', '', '', '[MASK]','', '', '','', '[SEP]']
masked_ids = paddle.to_tensor([tokenizer.convert_tokens_to_ids(tokens)])
segment_ids = paddle.to_tensor([[0] * len(tokens)])

outputs = model(masked_ids, token_type_ids=segment_ids)
prediction_scores = outputs
prediction_index = paddle.argmax(prediction_scores[0, 3]).item()
predicted_token = tokenizer.convert_ids_to_tokens([prediction_index])[0]
print(tokens)
#['[CLS]', '我', '的', '[MASK]', '很', '可', '爱', '。', '[SEP]']
print(predicted_token)
#
```

示例二:
```python
import paddle
from paddlenlp.transformers import *

tokenizer = AutoTokenizer.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
text = tokenizer('自然语言处理')

# 语义表示
model = AutoModel.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# 文本分类 & 句对匹配
model = AutoModelForSequenceClassification.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
# 序列标注
model = AutoModelForTokenClassification.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
# 问答
model = AutoModelForQuestionAnswering.from_pretrained('zhui/ernie-1.0-cluecorpussmall')
```
6 changes: 6 additions & 0 deletions community/zhui/ernie-1.0-cluecorpussmall/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/model_config.json",
"model_state": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/model_state.pdparams",
"tokenizer_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/tokenizer_config.json",
"vocab_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/ernie-1.0-cluecorpussmall/vocab.txt",
}
52 changes: 50 additions & 2 deletions examples/language_model/data_tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ chinese words:
可选。是否需要WWM策略。一般而言,Bert/Ernie模型需要,GPT不需要。
--cn_seg_func {lac,seg,jieba}
Words segment function for chinese words.
默认lac,jieba速度较快
默认jieba,jieba速度较快,lac模型更准确,计算量高。
--cn_splited Is chinese corpus is splited in to words.
分词后的文本,可选。设置此选项则,cn_seg_func不起作用。
例如分词后文本串 "百度 手机助手 是 Android 手机 的 权威 资源平台"
Expand All @@ -148,7 +148,7 @@ common config:
--workers WORKERS Number of worker processes to launch
处理文本id化的进程个数。
```
同过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
```
python -u create_pretraining_data.py \
--model_name ernie-1.0 \
Expand Down Expand Up @@ -190,3 +190,51 @@ sh run_static.sh
## 参考内容

注: 大部分数据流程,参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),特此表达感谢。


# 附录

## CLUECorpusSmall 数据集处理教程
**数据集简介**:可用于语言建模、预训练或生成型任务等,数据量超过14G,近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目
包含如下子语料库(总共14G语料):新闻语料[news2016zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/6bac09db4e6d4857b6d680d34447457490cb2dbdd8b8462ea1780a407f38e12b?responseContentDisposition=attachment%3B%20filename%3Dnews2016zh_corpus.zip), 社区互动语料[webText2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/83da03f7b4974871a52348b41c16c7e3b34a26d5ca644f558df8435be4de51c3?responseContentDisposition=attachment%3B%20filename%3DwebText2019zh_corpus.zip),维基百科语料[wiki2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/d7a166408d8b4ffdaf4de9cfca09f6ee1e2340260f26440a92f78134d068b28f?responseContentDisposition=attachment%3B%20filename%3Dwiki2019zh_corpus.zip),评论数据语料[comment2019zh_corpus.zip](https://bj.bcebos.com/v1/ai-studio-online/b66ddd445735408383c42322850ac4bb82faf9cc611447c2affb925443de7a6d?responseContentDisposition=attachment%3B%20filename%3Dcomment2019zh_corpus.zip)

**数据集下载**
用户可以通过官方github网页下载,https://github.com/CLUEbenchmark/CLUECorpus2020 。同时,为方便用户,我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598)[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据,下载好后,可以核对md5值:
```shell
> md5sum ./*
8a8be341ebce39cfe9524fb0b46b08c5 ./comment2019zh_corpus.zip
4bdc2c941a7adb4a061caf273fea42b8 ./news2016zh_corpus.zip
fc582409f078b10d717caf233cc58ddd ./webText2019zh_corpus.zip
157dacde91dcbd2e52a60af49f710fa5 ./wiki2019zh_corpus.zip
```
解压文件
```shell
unzip comment2019zh_corpus.zip -d clue_corpus_small_14g/comment2019zh_corpus
unzip news2016zh_corpus.zip -d clue_corpus_small_14g/news2016zh_corpus
unzip webText2019zh_corpus.zip -d clue_corpus_small_14g/webText2019zh_corpus
unzip wiki2019zh_corpus.zip -d clue_corpus_small_14g/wiki2019zh_corpus
```
将txt文件转换为jsonl格式
```
python trans_to_json.py --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
```
现在我们得到了jsonl格式的数据集,下面是针对训练任务的数据集应用,此处以ernie为例。
```
python -u create_pretraining_data.py \
--model_name ernie-1.0 \
--tokenizer_name ErnieTokenizer \
--input_path clue_corpus_small_14g.jsonl \
--split_sentences\
--chinese \
--cn_whole_word_segment \
--cn_seg_func jieba \
--output_prefix clue_corpus_small_14g_20220104 \
--workers 48 \
--log_interval 10000
```
数据共有文档`15702702`条左右,由于分词比较耗时,大概一小时左右可以完成。在当前目录下产出训练所需数据。
```
clue_corpus_small_14g_20220104_ids.npy
clue_corpus_small_14g_20220104_idx.npz
```
用户可以使用此数据进行预训练任务。
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def get_args():
group.add_argument(
'--cn_seg_func',
type=str,
default='lac',
default='jieba',
choices=['lac', 'seg', 'jieba'],
help='Words segment function for chinese words.')
group.add_argument(
Expand Down
38 changes: 35 additions & 3 deletions examples/language_model/ernie-1.0/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,12 @@ python -u -m paddle.distributed.launch \
--use_recompute false \
--max_lr 0.0001 \
--min_lr 0.00001 \
--max_steps 4000000 \
--max_steps 1000000 \
--save_steps 50000 \
--checkpoint_steps 5000 \
--decay_steps 3960000 \
--decay_steps 990000 \
--weight_decay 0.01 \
--warmup_rate 0.0025 \
--warmup_rate 0.01 \
--grad_clip 1.0 \
--logging_freq 20\
--num_workers 2 \
Expand Down Expand Up @@ -82,6 +82,32 @@ python -u -m paddle.distributed.launch \
- 一般而言, `global_batch_size = micro_batch_size * sharding_degree * dp_degree`。可以使用梯度累积的方式增大`global_batch_size`。设置`global_batch_size`为理论值的整数倍是,默认启用梯度累积。
- 训练断点重启,直接启动即可,程序会找到最新的checkpoint,开始重启训练。


### CLUECorpusSmall 数据集训练结果

数据准备部分参考[data_tools](../data_tools/)中的附录部分,根据文档,创建训练clue_corpus_small_14g数据集。
使用本训练脚本, batch_size=512, max_steps=100w,详细训练日志请参考:https://www.paddlepaddle.org.cn/paddle/visualdl/service/app/scalar?id=b0e19e554d68b9165a55901f0eb92812

最终训练loss结果:

|Loss | Train | Validation |
|-|-|-|
|loss |2.72 | 2.60 |
|lm_loss|2.60 | 2.50 |
|sop_loss|0.12 | 0.10 |

训练集 lm_loss=2.60 左右, 验证集 lm_loss=2.50 左右。

使用训练好的模型参数,在下游任务重进行finetune(需要先将静态图参数转换为动态图,请参考模型参数转换部分)。这里报告部分数据集上的finetune结果:

|Dataset | Dev | Test|
|--|--|--|
XNLI-CN | 0.79269 | 0.78339 |
ChnSentiCorp | 0.94495 | 0.95496 |
PeoplesDailyNer | 0.95128 | 0.94035 |
CMRC2018 | 72.05/85.67 | - |


### 其他
#### 模型参数转换
本示例提供了静态图训练脚本,但Paddle目前主要的使用方式是动态图。因此,本示例提供了静态图参数到动态图参数的转换脚本:
Expand All @@ -93,6 +119,12 @@ python converter/params_static_to_dygraph.py --model ernie-1.0 --path ./output/t
```
在当前目录下,可以看到转换后的参数`ernie-1.0_converted.pdparams`, 也可以设置脚本中`--output_path`参数,指定输出路径。

#### 为PaddleNLP贡献预训练参数
PaddleNLP为开发者支持了[community](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/community)模块,用户可以上传自己训练的模型,开源给其他用户使用。
使用本文档给出的参数配置,在CLUECorpusSmall数据集上训练,可以得到[zhui/ernie-1.0-cluecorpussmall](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/community/zhui/ernie-1.0-cluecorpussmall)参数,点击链接即可使用。

贡献预训练模型的方法,可以参考[贡献预训练模型权重](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/community/contribute_models/contribute_awesome_pretrained_models.rst)教程。


### 参考文献
- [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223.pdf)
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import argparse
import paddle
from paddlenlp.transformers import AutoModel
from paddlenlp.transformers import AutoModelForPretraining
from paddlenlp.utils.log import logger

paddle.set_device("cpu")
Expand All @@ -25,7 +25,7 @@ def init_dygraph_with_static(model, static_params_path):

def main(args):
logger.info("Loading model: %s" % args.model)
model = AutoModel.from_pretrained(args.model)
model = AutoModelForPretraining.from_pretrained(args.model)
logger.info("Loading static params and trans paramters...")
model_dict = init_dygraph_with_static(model, args.path)
save_name = args.output_path
Expand Down
95 changes: 92 additions & 3 deletions paddlenlp/transformers/ernie/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,14 @@
from .. import PretrainedModel, register_base_model

__all__ = [
'ErnieModel', 'ErniePretrainedModel', 'ErnieForSequenceClassification',
'ErnieForTokenClassification', 'ErnieForQuestionAnswering',
'ErnieForPretraining', 'ErniePretrainingCriterion'
'ErnieModel',
'ErniePretrainedModel',
'ErnieForSequenceClassification',
'ErnieForTokenClassification',
'ErnieForQuestionAnswering',
'ErnieForPretraining',
'ErniePretrainingCriterion',
'ErnieForMaskedLM',
]


Expand Down Expand Up @@ -770,3 +775,87 @@ def forward(self, prediction_scores, seq_relationship_score,
next_sentence_loss = F.cross_entropy(
seq_relationship_score, next_sentence_labels, reduction='none')
return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss)


class ErnieOnlyMLMHead(nn.Layer):
def __init__(self, hidden_size, vocab_size, activation, embedding_weights):
super().__init__()
self.predictions = ErnieLMPredictionHead(
hidden_size=hidden_size,
vocab_size=vocab_size,
activation=activation,
embedding_weights=embedding_weights)

def forward(self, sequence_output, masked_positions=None):
prediction_scores = self.predictions(sequence_output, masked_positions)
return prediction_scores


class ErnieForMaskedLM(ErniePretrainedModel):
"""
Ernie Model with a `masked language modeling` head on top.
Args:
ernie (:class:`ErnieModel`):
An instance of :class:`ErnieModel`.
"""

def __init__(self, ernie):
super(ErnieForMaskedLM, self).__init__()
self.ernie = ernie
self.cls = ErnieOnlyMLMHead(
self.ernie.config["hidden_size"],
self.ernie.config["vocab_size"],
self.ernie.config["hidden_act"],
embedding_weights=self.ernie.embeddings.word_embeddings.weight)

self.apply(self.init_weights)

def forward(self,
input_ids,
token_type_ids=None,
position_ids=None,
attention_mask=None):
r"""
Args:
input_ids (Tensor):
See :class:`ErnieModel`.
token_type_ids (Tensor, optional):
See :class:`ErnieModel`.
position_ids (Tensor, optional):
See :class:`ErnieModel`.
attention_mask (Tensor, optional):
See :class:`ErnieModel`.
Returns:
Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].
Example:
.. code-block::
import paddle
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
model = ErnieForMaskedLM.from_pretrained('ernie-1.0')
inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
logits = model(**inputs)
print(logits.shape)
# [1, 17, 18000]
"""

outputs = self.ernie(
input_ids,
token_type_ids=token_type_ids,
position_ids=position_ids,
attention_mask=attention_mask)
sequence_output = outputs[0]
prediction_scores = self.cls(sequence_output, masked_positions=None)
return prediction_scores

0 comments on commit a5f8a3e

Please sign in to comment.