BETO: Spanish BERT

BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. Below you find Tensorflow and Pytorch checkpoints for the uncased and cased versions, as well as some results for Spanish benchmarks comparing BETO with Multilingual BERT as well as other (not BERT-based) models.

Download

	HuggingFace Model Repository
BETO uncased	dccuchile/bert-base-spanish-wwm-uncased
BETO cased	dccuchile/bert-base-spanish-wwm-cased

All models use a vocabulary of about 31k BPE subwords constructed using SentencePiece and were trained for 2M steps.

Benchmarks

The following table shows some BETO results in the Spanish version of every task. We compare BETO (cased and uncased) with the Best Multilingual BERT results that we found in the literature (as of October 2019). The table also shows some alternative methods for the same tasks (not necessarily BERT-based methods). References for all methods can be found here.

Task	BETO-cased	BETO-uncased	Best Multilingual BERT	Other results
POS	98.97	98.44	97.10 [2]	98.91 [6], 96.71 [3]
NER-C	88.43	82.67	87.38 [2]	87.18 [3]
MLDoc	95.60	96.12	95.70 [2]	88.75 [4]
PAWS-X	89.05	89.55	90.70 [8]
XNLI	82.01	80.15	78.50 [2]	80.80 [5], 77.80 [1], 73.15 [4]

Example of use

For further details on how to use BETO you can visit the 🤗Huggingface Transformers library, starting by the Quickstart section. BETO models can be accessed simply as 'dccuchile/bert-base-spanish-wwm-cased' and 'dccuchile/bert-base-spanish-wwm-uncased' by using the Transformers library. An example on how to use the models in this page can be found in this colab notebook.

Acknowledgments

We thank Adereso for kindly providing support for traininig BETO-uncased, and the Millennium Institute for Foundational Research on Data that provided support for training BETO-cased. Also thanks to Google for helping us with the TensorFlow Research Cloud program.

Citation

Spanish Pre-Trained BERT Model and Evaluation Data

To cite this resource in a publication please use the following:

@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}

License Disclaimer

The license CC BY 4.0 best describes our intentions for our work. However we are not sure that all the datasets used to train BETO have licenses compatible with CC BY 4.0 (specially for commercial use). Please use at your own discretion and verify that the licenses of the original text resources match your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
replication		replication
LICENSE		LICENSE
README.md		README.md
models.md		models.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BETO: Spanish BERT

Download

Benchmarks

Example of use

Acknowledgments

Citation

License Disclaimer

References

About

Releases

Packages

Contributors 4

License

dccuchile/beto

Folders and files

Latest commit

History

Repository files navigation

BETO: Spanish BERT

Download

Benchmarks

Example of use

Acknowledgments

Citation

License Disclaimer

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages