Here you can find NLP models for Russian, implemented in HF transformers🤗
Model | Task | Type | Tokenizer | Dict size | Num Parameters | Training Data Volume |
---|---|---|---|---|---|---|
ruBERT-base | mask filling | encoder | bpe | 120 138 | 178 M | 30 GB |
ruBERT-large | mask filling | encoder | bpe | 120 138 | 427 M | 30 GB |
ruRoBERTa-large | mask filling | encoder | bbpe | 50 257 | 355 M | 250 GB |
ruT5-base | text2text generation | encoder-decoder | bpe | 32101 | 222 M | 300 GB |
ruT5-large | text2text generation | encoder-decoder | bpe | 32101 | 737 M | 300 GB |
Text2Text Generation task T5 paper
fill-mask task Roberta paper
- Large: HF Model
fill-mask task Bert paper
Use this to explore the models or run them on your machine.
pip install -r requirements.txt
from transformers import pipeline
unmasker = pipeline("fill-mask", model="sberbank-ai/ruRoberta-large")
unmasker("Евгений Понасенков назвал <mask> величайшим маэстро.", top_k=1)
# ruRoberta-large example
from transformers import RobertaForMaskedLM,RobertaTokenizer
model=RobertaForMaskedLM.from_pretrained('sberbank-ai/ruRoberta-large')
tokenizer=RobertaTokenizer.from_pretrained('sberbank-ai/ruRoberta-large')
unmasker = pipeline('fill-mask', model=model,tokenizer=tokenizer)
unmasker("Стоит чаще писать на Хабр про <mask>.")
Roberta model_view:
from transformers import RobertaModel, RobertaTokenizer
from bertviz import model_view
model_version = 'sberbank-ai/ruRoberta-large'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
model_view(attention, tokens)