Skip to content

Pre-train a Spanish GPT-2 model from scratch using the Spanish OSCAR dataset.

Notifications You must be signed in to change notification settings

somosnlp/gpt-2-spanish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

language tags datasets
es
causal-lm
text-generation
oscar

GPT-2 Spanish

GPT-2 model pre-trained from scratch using the Spanish portion of OSCAR during the Flax x Hugging Face community event by @mariagrandury, @mrm8488, @pablogps, @daveni, @srisweet, @jdposa, @shpotes, and @jorgealro.

Model description

The model used for training is OpenAI's GPT-2, introduced in the paper "Language Models are Unsupervised Multitask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.

This model is available in the 🤗 Model Hub.

Intended uses & limitations

How to use (TODO)

Limitations and bias (TODO)

Training data

Spanish portion of OSCAR or Open Super-large Crawled ALMAnaCH coRpus, a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

This corpus is available in the 🤗 Datasets library.

Training procedure (TODO)

Eval results (TODO)

About

Pre-train a Spanish GPT-2 model from scratch using the Spanish OSCAR dataset.

Topics

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published