Skip to content

Latest commit

 

History

History
81 lines (60 loc) · 4.35 KB

README.md

File metadata and controls

81 lines (60 loc) · 4.35 KB

About This Fork

This is a fork of gpt2-ml, gpt2-ml is a wonderful project which is not maintained anymore. Hope @imcaspar is all good. This fork fixed some download link and made the pre-trained sustainable which means you don't need to download pre-trained file every time...

Credit

GPT2 for Multiple Languages

Try it now: Open In Colab
If it runs failed, check the dependence: Check the Dependence Version of Your Colab

中文说明 | English

  • Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
  • Ported bert tokenizer, multilingual corpus compatible
  • 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
  • Batteries-included Colab demo #
  • 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )

Pretrained Model

Size Language Corpus Vocab Link1 Link2 SHA256
1.5B Params Chinese ~30G CLUE ( 8021 tokens ) Google Drive Baidu Pan (ffz6) e698cc97a7f5f706f84f58bb469d614e
51d3c0ce5f9ab9bf77e01e3fcb41d482
1.5B Params Chinese ~15G Bert ( 21128 tokens ) Google Drive Baidu Pan (q9vr) 4a6e5124df8db7ac2bdd902e6191b807
a6983a7f5d09fb10ce011f9a073b183e

Corpus from THUCNews and nlp_chinese_corpus

Using Cloud TPU Pod v3-256 to train 22w steps

loss

Google Colab

Due to the reason of colab (google reduced free gpu performance), the colab demo might be stale or no response unless you have a paid account (paid account is charged by google company, I have nothing to do with that. me and/or all the contributors won't get any money from it since this is a completely free and completely open sources project).

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[Colab Notebook]

Train

Disclaimer

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.

Citation

@misc{GPT2-ML,
  author = {Zhibo Zhang},{zxkmm}
  title = {GPT2-ML: GPT-2 for Multiple Languages},
  year = {2019},{2022}
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}

Reference

https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Press

[机器之心] 只需单击三次,让中文GPT-2为你生成定制故事

[科学空间] 现在可以用Keras玩中文GPT2了