Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pre-Training] Add tutorial for clue small 14g dataset #1555

Merged
merged 9 commits into from
Jan 15, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions community/zhui/cluecorpussmall_ernie-1.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# 详细介绍
本权重为使用PaddleNLP提供的ernie预训练教程,在clue corpus small 14g数据集上训练得到的权重。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ernie -> ERNIE 文档书写要区分模型官方名和api的参数名,正式名称是ERNIE/ERNIE-1.0
clue corpus small 14g. 使用正式名称

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


模型结构与ernie-1.0完全相同,batch_size=512, max_steps=100w, 训练得到。使用方法与原始ernie-1.0权重相同。

# 使用示例

示例一:
```
import paddle
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer
tokenizer = ErnieTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

名称改为ernie-1.0-cluecorpus2020?

double confirm下使用的语料官方名称是否角CLUECOrpus2020
https://github.com/CLUEbenchmark/CLUE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/CLUEbenchmark/CLUECorpus2020
CLUECorpus2020 是100G的数据,需要申请,咱们使用的是 CLUECorpusSmall 只有14G。是两份不同数据。

我修改为ernie-1.0-cluecorpussmall

model = ErnieForMaskedLM.from_pretrained('zhui/cluecorpussmall_ernie-1.0')

tokens = ['[CLS]', '我', '的', '[MASK]','很', '可', '爱','。', '[SEP]']
masked_ids = paddle.to_tensor([tokenizer.convert_tokens_to_ids(tokens)])
segment_ids = paddle.to_tensor([[0] * len(tokens)])

outputs = model(masked_ids, token_type_ids=segment_ids)
prediction_scores = outputs
prediction_index = paddle.argmax(prediction_scores[0, 3]).item()
predicted_token = tokenizer.convert_ids_to_tokens([prediction_index])[0]
print(tokens)
#['[CLS]', '我', '的', '[MASK]', '很', '可', '爱', '。', '[SEP]']
print(predicted_token)
#猫
```

示例二:
```python
import paddle
from paddlenlp.transformers import *

tokenizer = AutoTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
text = tokenizer('自然语言处理')

# 语义表示
model = AutoModel.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# 文本分类 & 句对匹配
model = AutoModelForSequenceClassification.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
# 序列标注
model = AutoModelForTokenClassification.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
# 问答
model = AutoModelForQuestionAnswering.from_pretrained('zhui/cluecorpussmall_ernie-1.0')
```
6 changes: 6 additions & 0 deletions community/zhui/cluecorpussmall_ernie-1.0/files.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"model_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/model_config.json",
"model_state": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/model_state.pdparams",
"tokenizer_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/tokenizer_config.json",
"vocab_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/vocab.txt",
}
52 changes: 50 additions & 2 deletions examples/language_model/data_tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ chinese words:
可选。是否需要WWM策略。一般而言,Bert/Ernie模型需要,GPT不需要。
--cn_seg_func {lac,seg,jieba}
Words segment function for chinese words.
默认lac,jieba速度较快
默认jieba,jieba速度较快,lac模型更复杂。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

复杂这个形容词标书不准确。
应该是lac分词模型更加准确,但计算量更高。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

--cn_splited Is chinese corpus is splited in to words.
分词后的文本,可选。设置此选项则,cn_seg_func不起作用。
例如分词后文本串 "百度 手机助手 是 Android 手机 的 权威 资源平台"
Expand All @@ -148,7 +148,7 @@ common config:
--workers WORKERS Number of worker processes to launch
处理文本id化的进程个数。
```
同过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
通过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`.
```
python -u create_pretraining_data.py \
--model_name ernie-1.0 \
Expand Down Expand Up @@ -190,3 +190,51 @@ sh run_static.sh
## 参考内容

注: 大部分数据流程,参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),特此表达感谢。


# 附录

## Clue corpus small 数据集处理教程
**数据集简介**:可用于语言建模、预训练或生成型任务等,数据量超过14G,近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目
包含如下子语料库(总共14G语料):新闻语料 news2016zh_corpus, 社区互动语料webText2019zh_corpus,维基百科语料wiki2019zh_corpus,评论数据-语料comments2019zh_corpus。

**数据集下载**:
用户可以通过官方githu网页下载,https://github.com/CLUEbenchmark/CLUE 。同时,为方便用户,我们也提供了aistudio数据集下载地址。[part1](https://aistudio.baidu.com/aistudio/datasetdetail/60598),[part2](https://aistudio.baidu.com/aistudio/datasetdetail/124357)。使用aistudio版本的数据,下载好后,可以核对md5值:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github,少了b

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```shell
> md5sum ./*
8a8be341ebce39cfe9524fb0b46b08c5 ./comment2019zh_corpus.zip
4bdc2c941a7adb4a061caf273fea42b8 ./news2016zh_corpus.zip
fc582409f078b10d717caf233cc58ddd ./webText2019zh_corpus.zip
157dacde91dcbd2e52a60af49f710fa5 ./wiki2019zh_corpus.zip
```
解压文件
```shell
unzip comment2019zh_corpus.zip -d clue_corpus_small_14g/comment2019zh_corpus
unzip news2016zh_corpus.zip -d clue_corpus_small_14g/news2016zh_corpus
unzip webText2019zh_corpus.zip -d clue_corpus_small_14g/webText2019zh_corpus
unzip wiki2019zh_corpus.zip -d clue_corpus_small_14g/wiki2019zh_corpus
```
将txt文件转换为jsonl格式
```
python trans_to_json.py --input_path ./clue_corpus_small_14g --output_path clue_corpus_small_14g.jsonl
```
现在我们得到了jsonl格式的数据集,下面是针对训练任务的数据集应用,此处以ernie为例。
```
python -u create_pretraining_data.py \
--model_name ernie-1.0 \
--tokenizer_name ErnieTokenizer \
--input_path clue_corpus_small_14g.jsonl \
--split_sentences\
--chinese \
--cn_whole_word_segment \
--cn_seg_func jieba \
--output_prefix clue_corpus_small_14g_20220104 \
--workers 48 \
--log_interval 10000
```
数据共有文档`15702702`条左右,由于分词比较耗时,大概一小时左右可以完成。在当前目录下产出训练所需数据。
```
clue_corpus_small_14g_20220104_ids.npy
clue_corpus_small_14g_20220104_idx.npz
```
用户可以使用此数据进行预训练任务。
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def get_args():
group.add_argument(
'--cn_seg_func',
type=str,
default='lac',
default='jieba',
choices=['lac', 'seg', 'jieba'],
help='Words segment function for chinese words.')
group.add_argument(
Expand Down
32 changes: 29 additions & 3 deletions examples/language_model/ernie-1.0/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,12 @@ python -u -m paddle.distributed.launch \
--use_recompute false \
--max_lr 0.0001 \
--min_lr 0.00001 \
--max_steps 4000000 \
--max_steps 1000000 \
--save_steps 50000 \
--checkpoint_steps 5000 \
--decay_steps 3960000 \
--decay_steps 990000 \
--weight_decay 0.01 \
--warmup_rate 0.0025 \
--warmup_rate 0.01 \
--grad_clip 1.0 \
--logging_freq 20\
--num_workers 2 \
Expand Down Expand Up @@ -82,6 +82,32 @@ python -u -m paddle.distributed.launch \
- 一般而言, `global_batch_size = micro_batch_size * sharding_degree * dp_degree`。可以使用梯度累积的方式增大`global_batch_size`。设置`global_batch_size`为理论值的整数倍是,默认启用梯度累积。
- 训练断点重启,直接启动即可,程序会找到最新的checkpoint,开始重启训练。


### Clue corpus small 数据集训练结果
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLUECorpus2020 Small?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


数据准备部分参考[data_tools](../data_tools/)中的附录部分,根据文档,创建训练clue_corpus_small_14g数据集。
使用本训练脚本, batch_size=512, max_steps=100w,详细训练日志请参考:https://www.paddlepaddle.org.cn/paddle/visualdl/service/app/scalar?id=b0e19e554d68b9165a55901f0eb92812

最终训练loss结果:

|Loss | Train | Validation |
|-|-|-|
|loss |2.72 | 2.60 |
|lm_loss|2.60 | 2.50 |
|sop_loss|0.12 | 0.10 |

训练集 lm_loss=2.60 左右, 验证集 lm_loss=2.50 左右。

使用训练好的模型参数,在下游任务重进行finetune(需要先将静态图参数转换为动态图,请参考模型参数转换部分)。这里报告部分数据集上的finetune结果:

|Dataset | Dev | Test|
|--|--|--|
XNLI-CN | 0.79269 | 0.78339 |
ChnSentiCorp | 0.94495 | 0.95496 |
PeoplesDailyNer | 0.95128 | 0.94035 |
CMRC2018 | 72.05/85.67 | - |


### 其他
#### 模型参数转换
本示例提供了静态图训练脚本,但Paddle目前主要的使用方式是动态图。因此,本示例提供了静态图参数到动态图参数的转换脚本:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import argparse
import paddle
from paddlenlp.transformers import AutoModel
from paddlenlp.transformers import AutoModelForPretraining
from paddlenlp.utils.log import logger

paddle.set_device("cpu")
Expand All @@ -25,7 +25,7 @@ def init_dygraph_with_static(model, static_params_path):

def main(args):
logger.info("Loading model: %s" % args.model)
model = AutoModel.from_pretrained(args.model)
model = AutoModelForPretraining.from_pretrained(args.model)
logger.info("Loading static params and trans paramters...")
model_dict = init_dygraph_with_static(model, args.path)
save_name = args.output_path
Expand Down
95 changes: 92 additions & 3 deletions paddlenlp/transformers/ernie/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,14 @@
from .. import PretrainedModel, register_base_model

__all__ = [
'ErnieModel', 'ErniePretrainedModel', 'ErnieForSequenceClassification',
'ErnieForTokenClassification', 'ErnieForQuestionAnswering',
'ErnieForPretraining', 'ErniePretrainingCriterion'
'ErnieModel',
'ErniePretrainedModel',
'ErnieForSequenceClassification',
'ErnieForTokenClassification',
'ErnieForQuestionAnswering',
'ErnieForPretraining',
'ErniePretrainingCriterion',
'ErnieForMaskedLM',
]


Expand Down Expand Up @@ -770,3 +775,87 @@ def forward(self, prediction_scores, seq_relationship_score,
next_sentence_loss = F.cross_entropy(
seq_relationship_score, next_sentence_labels, reduction='none')
return paddle.mean(masked_lm_loss), paddle.mean(next_sentence_loss)


class ErnieOnlyMLMHead(nn.Layer):
def __init__(self, hidden_size, vocab_size, activation, embedding_weights):
super().__init__()
self.predictions = ErnieLMPredictionHead(
hidden_size=hidden_size,
vocab_size=vocab_size,
activation=activation,
embedding_weights=embedding_weights)

def forward(self, sequence_output, masked_positions=None):
prediction_scores = self.predictions(sequence_output, masked_positions)
return prediction_scores


class ErnieForMaskedLM(ErniePretrainedModel):
"""
Ernie Model with a `masked language modeling` head on top.

Args:
ernie (:class:`ErnieModel`):
An instance of :class:`ErnieModel`.

"""

def __init__(self, ernie):
super(ErnieForMaskedLM, self).__init__()
self.ernie = ernie
self.cls = ErnieOnlyMLMHead(
self.ernie.config["hidden_size"],
self.ernie.config["vocab_size"],
self.ernie.config["hidden_act"],
embedding_weights=self.ernie.embeddings.word_embeddings.weight)

self.apply(self.init_weights)

def forward(self,
input_ids,
token_type_ids=None,
position_ids=None,
attention_mask=None):
r"""

Args:
input_ids (Tensor):
See :class:`ErnieModel`.
token_type_ids (Tensor, optional):
See :class:`ErnieModel`.
position_ids (Tensor, optional):
See :class:`ErnieModel`.
attention_mask (Tensor, optional):
See :class:`ErnieModel`.

Returns:
Tensor: Returns tensor `prediction_scores`, The scores of masked token prediction.
Its data type should be float32 and shape is [batch_size, sequence_length, vocab_size].

Example:
.. code-block::

import paddle
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer

tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
model = ErnieForMaskedLM.from_pretrained('ernie-1.0')

inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!")
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}

logits = model(**inputs)
print(logits.shape)
# [1, 17, 18000]

"""

outputs = self.ernie(
input_ids,
token_type_ids=token_type_ids,
position_ids=position_ids,
attention_mask=attention_mask)
sequence_output = outputs[0]
prediction_scores = self.cls(sequence_output, masked_positions=None)
return prediction_scores