Skip to content

ai-forever/model-zoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the Model Zoo!

Here you can find NLP models for Russian, implemented in HF transformers🤗

See Examples In Colab!

Models:

Model Task Type Tokenizer Dict size Num Parameters Training Data Volume
ruBERT-base mask filling encoder bpe 120 138 178 M 30 GB
ruBERT-large mask filling encoder bpe 120 138 427 M 30 GB
ruRoBERTa-large mask filling encoder bbpe 50 257 355 M 250 GB
ruT5-base text2text generation encoder-decoder bpe 32101 222 M 300 GB
ruT5-large text2text generation encoder-decoder bpe 32101 737 M 300 GB

ruT5

Text2Text Generation task T5 paper

Model parameters

ruRoBerta

fill-mask task Roberta paper

ruBert

fill-mask task Bert paper

How to:

Use this Colab! to explore the models or run them on your machine.

Model set up:

pip install -r requirements.txt

Pipeline usage

from transformers import pipeline

unmasker = pipeline("fill-mask", model="sberbank-ai/ruRoberta-large")
unmasker("Евгений Понасенков назвал <mask> величайшим маэстро.", top_k=1)

Classical usage

# ruRoberta-large example 
from transformers import RobertaForMaskedLM,RobertaTokenizer

model=RobertaForMaskedLM.from_pretrained('sberbank-ai/ruRoberta-large')

tokenizer=RobertaTokenizer.from_pretrained('sberbank-ai/ruRoberta-large')

unmasker = pipeline('fill-mask', model=model,tokenizer=tokenizer)
unmasker("Стоит чаще писать на Хабр про <mask>.")

Use BertViz to obtain model visualizations

Roberta model_view:

/ !

from transformers import RobertaModel, RobertaTokenizer
from bertviz import model_view

model_version = 'sberbank-ai/ruRoberta-large'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)

sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
model_view(attention, tokens)