About

This project contains scripts and configs for making a Russian punctuation/capitalization dataset based on transcriptions, training a BERT model on that dataset and performing some inference examples on the trained model.

An instance of already collected dataset is located here: https://huggingface.co/datasets/denis-berezutskiy-lad/ru_transcription_punctuation

An instance of already trained model in located here: https://huggingface.co/denis-berezutskiy-lad/lad_transcription_bert_ru_punctuator

The underlying base model is https://huggingface.co/DeepPavlov/rubert-base-cased-conversational

In order to train the model the set of NeMo scripts has to be used: https://github.com/NVIDIA/NeMo

Please note that the NeMo scripts revision was fcfc0ebb23b428a9bee6d847d1e0b37ca0784ba5 at the moment of training, and a later version may not be compatible with the required libraries (and vice versa).

Please also note that torchmetrics is not linked to the latest version, because that version is not compatible with the mentioned NeMo revision.

Why one more punctuator

The idea behind the project is to use large continous professional transcriptions for training rather than relying on short low-quality samples consisting of 1-2 sentences (which is typical for the most popular datasets in Russian). Our experiments show significant improvements comparing to BERTs trained on the standard Ru datasets (social comments, omnia russica etc.). That's why we mainly use transcriptions published by Russian legislatures (Gosduma, Mosgorduma) with some addition of film subtitles from OpenSubtitles project.

Supported labels

Please note that some new labels are not supported by NeMo scripts out of the box (-, —, T), so we need to add special handling for them. See the inference notebook for details.

Punctuation

O,.?!:;…⁈-—

Capitalization

OUT

(T means abbreviation ("total" uppercase))

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
dataset		dataset
README.md		README.md
capit_label_ids.csv		capit_label_ids.csv
punct_label_ids.csv		punct_label_ids.csv
requirements.txt		requirements.txt
transcription_inference.ipynb		transcription_inference.ipynb
transcription_training.ipynb		transcription_training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Why one more punctuator

Supported labels

Punctuation

Capitalization

About

Releases

Packages

Languages

denis-berezutskiy-lad/transcription-bert-ru-punctuator-scripts

Folders and files

Latest commit

History

Repository files navigation

About

Why one more punctuator

Supported labels

Punctuation

Capitalization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages