-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pre-Training] Add tutorial for clue small 14g dataset #1555
Changes from 7 commits
d82e63d
bcd7e42
99220e2
17ec4c9
5419975
605ed30
1bb43cd
89cf276
0ec4a71
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# 详细介绍 | ||
本权重为使用PaddleNLP提供的ernie预训练教程,在clue corpus small 14g数据集上训练得到的权重。 | ||
|
||
模型结构与ernie-1.0完全相同,batch_size=512, max_steps=100w, 训练得到。使用方法与原始ernie-1.0权重相同。 | ||
|
||
预训练全流程参见:https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/ernie-1.0/README.md | ||
|
||
# 使用示例 | ||
|
||
示例一: | ||
```python | ||
import paddle | ||
from paddlenlp.transformers import ErnieForMaskedLM, ErnieTokenizer | ||
tokenizer = ErnieTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 名称改为ernie-1.0-cluecorpus2020? double confirm下使用的语料官方名称是否角CLUECOrpus2020 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://github.com/CLUEbenchmark/CLUECorpus2020 我修改为 |
||
model = ErnieForMaskedLM.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
|
||
tokens = ['[CLS]', '我', '的', '[MASK]','很', '可', '爱','。', '[SEP]'] | ||
masked_ids = paddle.to_tensor([tokenizer.convert_tokens_to_ids(tokens)]) | ||
segment_ids = paddle.to_tensor([[0] * len(tokens)]) | ||
|
||
outputs = model(masked_ids, token_type_ids=segment_ids) | ||
prediction_scores = outputs | ||
prediction_index = paddle.argmax(prediction_scores[0, 3]).item() | ||
predicted_token = tokenizer.convert_ids_to_tokens([prediction_index])[0] | ||
print(tokens) | ||
#['[CLS]', '我', '的', '[MASK]', '很', '可', '爱', '。', '[SEP]'] | ||
print(predicted_token) | ||
#猫 | ||
``` | ||
|
||
示例二: | ||
```python | ||
import paddle | ||
from paddlenlp.transformers import * | ||
|
||
tokenizer = AutoTokenizer.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
text = tokenizer('自然语言处理') | ||
|
||
# 语义表示 | ||
model = AutoModel.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']])) | ||
# 文本分类 & 句对匹配 | ||
model = AutoModelForSequenceClassification.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
# 序列标注 | ||
model = AutoModelForTokenClassification.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
# 问答 | ||
model = AutoModelForQuestionAnswering.from_pretrained('zhui/cluecorpussmall_ernie-1.0') | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
{ | ||
"model_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/model_config.json", | ||
"model_state": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/model_state.pdparams", | ||
"tokenizer_config_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/tokenizer_config.json", | ||
"vocab_file": "https://bj.bcebos.com/paddlenlp/models/transformers/community/zhui/cluecorpussmall_ernie-1.0/vocab.txt", | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,12 +40,12 @@ python -u -m paddle.distributed.launch \ | |
--use_recompute false \ | ||
--max_lr 0.0001 \ | ||
--min_lr 0.00001 \ | ||
--max_steps 4000000 \ | ||
--max_steps 1000000 \ | ||
--save_steps 50000 \ | ||
--checkpoint_steps 5000 \ | ||
--decay_steps 3960000 \ | ||
--decay_steps 990000 \ | ||
--weight_decay 0.01 \ | ||
--warmup_rate 0.0025 \ | ||
--warmup_rate 0.01 \ | ||
--grad_clip 1.0 \ | ||
--logging_freq 20\ | ||
--num_workers 2 \ | ||
|
@@ -82,6 +82,32 @@ python -u -m paddle.distributed.launch \ | |
- 一般而言, `global_batch_size = micro_batch_size * sharding_degree * dp_degree`。可以使用梯度累积的方式增大`global_batch_size`。设置`global_batch_size`为理论值的整数倍是,默认启用梯度累积。 | ||
- 训练断点重启,直接启动即可,程序会找到最新的checkpoint,开始重启训练。 | ||
|
||
|
||
### Clue corpus small 数据集训练结果 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CLUECorpus2020 Small? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
数据准备部分参考[data_tools](../data_tools/)中的附录部分,根据文档,创建训练clue_corpus_small_14g数据集。 | ||
使用本训练脚本, batch_size=512, max_steps=100w,详细训练日志请参考:https://www.paddlepaddle.org.cn/paddle/visualdl/service/app/scalar?id=b0e19e554d68b9165a55901f0eb92812 | ||
|
||
最终训练loss结果: | ||
|
||
|Loss | Train | Validation | | ||
|-|-|-| | ||
|loss |2.72 | 2.60 | | ||
|lm_loss|2.60 | 2.50 | | ||
|sop_loss|0.12 | 0.10 | | ||
|
||
训练集 lm_loss=2.60 左右, 验证集 lm_loss=2.50 左右。 | ||
|
||
使用训练好的模型参数,在下游任务重进行finetune(需要先将静态图参数转换为动态图,请参考模型参数转换部分)。这里报告部分数据集上的finetune结果: | ||
|
||
|Dataset | Dev | Test| | ||
|--|--|--| | ||
XNLI-CN | 0.79269 | 0.78339 | | ||
ChnSentiCorp | 0.94495 | 0.95496 | | ||
PeoplesDailyNer | 0.95128 | 0.94035 | | ||
CMRC2018 | 72.05/85.67 | - | | ||
|
||
|
||
### 其他 | ||
#### 模型参数转换 | ||
本示例提供了静态图训练脚本,但Paddle目前主要的使用方式是动态图。因此,本示例提供了静态图参数到动态图参数的转换脚本: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ernie -> ERNIE 文档书写要区分模型官方名和api的参数名,正式名称是ERNIE/ERNIE-1.0
clue corpus small 14g. 使用正式名称
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done