hface_transformer

Baseline model assuming that everything works here :)

Also good to see how much time it actually takes to train with HF

ToDo

BERT wordpiece tokenizer with proper pre-tokenization
- trained on latin oscar+wiki
train-script based on run_mlm.py
maybe add some deepspeed
sbatch etc.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
compare_tokenizers.py		compare_tokenizers.py
get_args.py		get_args.py
model_config.json		model_config.json
oscar+wiki.64k.wordpiece.tokenizer.json		oscar+wiki.64k.wordpiece.tokenizer.json
prepare_data.py		prepare_data.py
robin_args.json		robin_args.json
run_mlm.py		run_mlm.py
run_prepro.sh		run_prepro.sh
sbatch_prepro.sh		sbatch_prepro.sh
test_tokenizer.py		test_tokenizer.py
train_tokenizer.py		train_tokenizer.py