Curated list of Text libraries, tools and datasets for Persian language.
In each section, tools, libraries, models, and datasets related to the main topic of that section are listed.
- Multi-purpose libs
- Grapheme to Phoneme
- Word Analyzing
- Sentiment Analysis
- Informal Persian
- Numbers <> Words
- Embeddings
- Benchmark
- QA
- Dependency Parsing
- Entailment
- Datasets (classification)
- NER
- Unlabled and Raw Text
- Toxic Text
- Stop Word List
- Spell Checking
- Normalization
- Transliteration
- Encyclopedia and Word Set
- Poetry and Literature
- Audio Dataset
- Crawl Suite
- POS Tagging
- Various
- Base Models
- Mocking
- UI/UX
- OCR
- Spam
- Image Captioning
- Translation
- Knowledge Graph
- Summery
- Paraphrase
- WSD
- Generation
A Language Processing Toolkit for Persian
- Normalizer / Tokenizer (sentences / words)
- Stemmer
- POS Tagger
- Chunker
- Dependency Parser
- Spell Checker
Persian NLP Toolkit
- Normalizer / Tokenizer
- Lemmatizer
- POS Tagger
- Chunker
- Dependency Parser
- Word / Sentence Embedding
- Different Corpora reader
The all-in-one AI library for Persian, supporting a wide variety of tasks and modalities!
- POS Tagger
- Text Classification (sentiment analysis, categorization, etc)
- Sequence Labeling (POS, NER, etc.)
- Mask Filling
- Speech Recognition
- Text Detection
- Image to Text (OCR)
- Image to Text (License Plate Recognition)
- Image to Text (Image Captioning)
- Word Embeddings
- FastText
- Word2Vec (Skip-gram)
- Word2Vec (CBOW)
- Datasets
Multilingual text (NLP) processing toolkit. Consists of some useful Persian functionalities:
- Tokenizer (Sentence / Word)
- Named Entity Recognition
- Morpheme Extractor
- Language Detector
A tool for translating Persian text to IPA (International Phonetic Alphabet).
A Grapheme to Phoneme model using LSTM implemented in pytorch
Persian Grapheme-to-Phoneme (G2P) converter
list of persian word pronunciations
It is a convolutional sequence to sequence model created based on Tachibana et al with modifications. This repo consists of notebooks to do the training and inferencing and provides proper datasets to do so.
Persian Grapheme-to-Phoneme (G2P) converter
The G2P algorithm is used to generate the most probable pronunciation for a word not contained in the lexicon dictionary. It could be used as a preprocess of text-to-speech system to generate pronunciation for OOV words.
Tihu-dict is a pronouncing dictionary of Persian
Informal and Formal Persian word analyzer (inflection with FST)
This dataset includes 45300 Persian word forms which are manually segmented into sequences of morphemes.
Universal Derivations (UDer) is a collection of harmonized lexical networks capturing word-formation, especially derivation, in a cross-linguistically consistent annotation scheme for many languages including Persian
(semi-automatically). Consists of 7k families, 43k lexemes and 35k relations. Article. Dataset files.
A morpheme Extracter for 135 languages including Persian
.
PARSEME is a verbal multiword expressions (VMWEs) corpus for Farsi. All the annotated data come from a subset of the Farsi section of the MULTEXT-East "1984" annotated corpus 4.0. More than colums of LEMMA UPOS, XPOS, FEATS, HEAD and DEPREL there is also PARSEME:MVE which is manually annotated.
Universal Segmentations (UniSegments) is a collection of lexical resources capturing morphological segmentations harmonised into a cross-linguistically consistent annotation scheme for many languages including Persian
. The annotation scheme consists of simple tab-separated columns that stores a word and its morphological segmentations, including pieces of information about the word and the segmented units, e.g., part-of-speech categories, type of morphs/morphemes etc. It also has a python library or creating such data from text. This dataset consists of 45k Persian words.
Persian stemmer and morphological analyzer
Consists of two stemmeing sets. 1) 4k words from Bootstrapping the Development of an HPSG-based Treebank for Persian and 2) 27k words from A syntactic valency lexicon for Persian verbs : The first steps towards Persian dependency treebank.
A stemmer for Persian based on A new hybrid stemming method for persian language
Awesome Persian Sentiment Analysis Resources - منابع مرتبط با تحلیل احساسات در زبان فارسی
- Consists of following datasets:
- Deep Neural Networks in Persian Sentiment Analysis
- Sentiment Analysis Challenges
- Sentiment Lexicon
- Sentiment Tagged Corpus (dataset)
- HesNegar: Persian Sentiment WordNet
Consists of data (3K) and code (notebook) to create a LSTM model for Sentiment Analysis.
Sentiment analysis using ML and DL models on Persian texts
A Sentiment Analysis Lexicon for Persian. Consists of 4k words
Persian book comment ratings dataset. Consists of about 70k comment about 11k books.
The Digikala (comments & products) dataset offers a comprehensive glimpse into the vast online marketplace of Digikala, comprising over 1.2 million products and more than 6 million comments.
3k comments with score and ratings.
93k digikala products comments with manual labeling.
20k tweets with emotion identification labels.
A Dataset of 30,000 emotion labeled Persian Tweets.
Consists of 5.56K tweets with labels (sadness, anger, happiness, hatred, wonder and fear) describing their emotions.
Consists of 7k docs with 6 emotion label types (sadness, anger, happiness, hatred, wonder, fear).
Snappfood (an online food delivery company) user comments containing 70,000 comments with two labels (i.e. polarity classification): Happy, Sad.
It is the Persian translation of NRC Emotion Lexicon which is a list of English words with their associate basic emotions in eigth categories( anger, fear, anticipation, trust, surprise, sadness, joy, and disgust).
Consists of 10k samples which each record focuses on one aspect (e.g. camera, screen resolution, etc of a comment about a cell phone) of a comment. Each comment may appear on more than one sample based on the number of aspects that exist in that comment.
Consists of 1500 words with their degrees of polarity.
Utilizes the SentiPers dataset, which consists of 7,400 sentences, and enhances it with various embeddings to develop both LSTM and CNN models. All the original and newly transformed data, along with the notebooks used to create the models, are available in this repository.
Fine-tuned a BERT based transofrmer on various sentiment analysis datasets like Digikala, SnappFood, SentiPers and Taaghche.
Persian NLP team trained various mt5 models on their sentiment analysis dataset.
Shekasteh is an evaluation dataset for Persian colloquial text. It comes from different genres, including blog posts, movie subtitles, and forum chats.
Informal and Formal Persian word analyzer (inflection with FST)
Persian Slang Words (dataset)
Informal Persian Universal Dependency Treebank, consisting of 3000 sentences and 54,904 tokens, is an open source collection of colloquial informal texts from Persian blogs.
Converts numbers to words.
Read me this number python -- Convert number to Persian
Convert numbers to Persian words.
Describe PERsian Numbers
A normalizer which do a lot about numbers, both ways.
Handling various number types in Persian text (like National ID, Sheba, etc)
Persian text -> integer, ineteger -> text converter
Takes a number and converts it to Persian word form
Pre-trained word vectors of 157 languages including Persian
, trained on CommonCrawl and Wikipedia using CBOW.
A tutorial on how to use 3 word embeddings; a) Downloading and using fasttext Persian word embeddings. b) How to get word embeddings of ParsBERT base model itself. c) How to get word embeddings of ParsGPT model.
A Persian Word2Vec Model trained by Wikipedia articles
Three similar models based on fine-tuning ParsBERT base model on 3 different entailment datasets. Each of these models can be used for Semantic Search, Clustering, Summerization, Information retrieval and Topic Modeling tasks.
A comprehensive suite of high-level NLP tasks for Persian language. The dataset consists of the following tasks: Text entailment, Query paraphrasing, Reading comprehension, Multiple-choice QA, Machine translation and Sentiment analysis. They've been also fine-tuned mt5 models on these datasets which result in various Persian models.
ParsBench provides toolkits for benchmarking LLMs based on the Persian language tasks.
- ParsiNLU all tasks
- Persian NER
- Persian Math
- ConjNLI Entailment
- Persian MMLU (khayyam Chanllenge)
Benchmarking ChatGPT for Persian: A Preliminary Study
- Elemntry school
- Mathematical problems dataset
Persian (Farsi) Question Answering Dataset. with models: bert-base-fa-qa with 162M parameters fine-tuned on this dataset and xlm-roberta-large-fa-qa with 558M parameters fine-tuned on this dataset and SQuAD2.0 (English) dataset.
Medical Question Answering dataset consists of 15k dialogs in 70 specialities.
26k QA and related excerpt extracted from Persian wikipedia. Some of the questions can not be answered based on the given excerpt by design (like SQuAD2.0).
Persian Question Answering Dataset based on Machine Translation of SQuAD 2.0
Consists of 30K questions and answers of various Persian crossword puzzles.
Persian NLP team trained various mt5 and BERT models on their multiple-choice QA dataset.
It consists of 266k legal questions, answers and related tags.
Persian translation of 35k records of Stanford Alpaca Instruction dataset (52K records). There is also a version with different formatting.
This dataset contains 5900 Persian language question-answer pairs generated using the PersianAnswerGenerator class from answer.py. The answers are produced by an AI assistant leveraging the GPT-4o model through the Avala API service.
MauxiMix is a carefully curated dataset of 1,000 high-quality Persian conversations, translated from the SmolTalk dataset using advanced language models. This dataset is specifically designed for training and fine-tuning Large Language Models (LLMs) with Supervised Fine-Tuning (SFT) techniques, contributing to the development of open-source Persian language models.
The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.
The Persian Universal Dependency Treebank (PerUDT) is the result of automatic coversion of Persian Dependency Treebank (PerDT) with extensive manual corrections. Consists of 29k sentences.
PARSEME is a verbal multiword expressions (VMWEs) corpus for Farsi. All the annotated data come from a subset of the Farsi section of the MULTEXT-East "1984" annotated corpus 4.0. More than colums of LEMMA UPOS, XPOS, FEATS, HEAD and DEPREL there is also PARSEME:MVE which is manually annotated.
UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files.
Informal Persian Universal Dependency Treebank, consisting of 3000 sentences and 54,904 tokens, is an open source collection of colloquial informal texts from Persian blogs.
10k pairs with entailment label.
Utilizes the FarsTail dataset for fine-tuning its ParsBERT model, while also incorporating two other entailment datasets: Wiki Triplet and Wiki D/Similar.
Persian NLP team trained various mt5 and BERT models on their entailment dataset.
This could be a nice tool for Persian writers or bloggers to automatically pick the suggested hashtag or even subject for their articles. We could even collect data from google trend for each hashtag or 'label' used in an article. Consists of 11k+ articles.
The file contains 3780 news articles published by BBC Persian. The articles mostly belong to the year 1399 and 1400, and are published before Aban 18th, 1400. Columns are: title, publish_name, link, related_topics, body, category.
Consists of 63k News articles with following columns: category, title, abstract, body, time.
Yearly collection of the Farsnews agency (1398). Contains 294k News article with following columns: title, abstract, paragraphs, cat, subcat, tags, link.
A total of 8,515 articles scraped from Digikala Online Magazine. This dataset includes seven different classes: Video Games, Shopping Guide, Health Beauty, Science Technology, General, Art Cinema and Books Literature.
Contains about 3K tweets, with each one of them labeled as either ironic or not.
4K of records of stance detection in headlines and bodies of News articles.
Consists of 5.5K pairs of tweets which the stance of the reply tweets have been marked as against, support or neither to the main tweet.
Consists of 3.8K tweets, in which the type of each claim in each tweet have been identified. But it does not show where is the claim located in the main tweet.
Name Entity Recognition (NER) on the Persian Twitter dataset. Consists of 6 entity types: event, location, natinality, organization and pog (political organizations and historical dynasties). 12k Named Entities in 232k tokens.
Extends PEYMA corpus (300k tokens), with another 600k tokens. Consists of 16 entity types including: date, location, percent number, money, time, person and organization. 48k NEs in 884k tokens.
The dataset includes 250,015 tokens and 7,682 Persian sentences in total. Consists of 6 NE types including: facility, organization, location, event, person and proper noun. 37K NEs in 749k tokens.
Crowd-sourced NE dataset with 5 NE types. 2.2M NEs in 25M tokens.
These dataset is a mixed NER dataset collected from ARMAN, PEYMA, and WikiANN that covered ten types of entities including: Date, Event, Facility, Location, Money, Organization, Percent, Person, Product and Time. 140K NEs in 40k sentences.
It is a large Multilingual Dataset for Entity Linking containing data in 53 languages including Persian
. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. Paper. For this project UDPipe has been used.
XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 typologically diverse languages and includes nine tasks. But for Persian
it only consists of:
- Wikiann named entity recognition
- Universal dependencies part-of-speech tagging (rasooli et al.)
Persian real SMS Dataset
Crawled more than 3k+ articles from tarjoman website.
27M tweets. Although these texts have been labeled or translated using various NLP toolkits, they have never been supervised.
Consists of 8M words with following columns: title, date, url and body.
219K abstracts collected from Ensani.ir papers.
We created a dataset of 33338 Persian tweets, of which 10% contained Abusive words and 90% were non-Abusive.
Remove Persian (Farsi) Swear Words
Persian Swear Dataset - you can use in your production to filter unwanted content. دیتاست کلمات نامناسب و بد فارسی برای فیلتر کردن متن ها
A collection of Persian stopwords. Consists of:
- persian-stop-word
- persian-stopwords
- and 5 other lists.
Consists of about 2k stop words.
A complete instruction for training a Persian spell checker and a language model based on SymSpell and KenLM, using Wikipedia dataset. Tokens that are not in the vocab and has a very low frequency considered to be miss-spelled words and replaced with their equivalent from vocabs which maximizes the probabilty of the sentence.
FASpell dataset was developed for the evaluation of spell checking algorithms. It contains a set of pairs of misspelled Persian words and their corresponding corrected forms similar to the ASpell dataset used for English. The dataset consists of two parts: a) faspell_main: list of 5050 pairs collected from errors made by elementary school pupils and professional typists. b) faspell_ocr: list of 800 pairs collected from the output of a Farsi OCR system.
Created data for hunspell library for spell checking and morphology analyzing.
Consists of some lists of miss-spelled words and some dictionaries of Persian word entries.
A comprehensive parallel dataset designed for the task of spell checking in Persian. Misspelled sentences together with the correct form are produced using a massive confusion matrix, which is gathered from many sources. This dataset contains informal sentences in addition to the formal sentences, and contains texts from diverse topics. Both non-word and real-word errors are collected in the dataset
Code and data for detecting and correcting just a special kind of cognitive miss-spelling error in informal Persian
.
Standardize your Persian text: Preprocessing, Embedding, and more!
Simple Farsi normalizer
Cleanning up Persian text! (Ruby)
Virastar is a Persian text cleaner (JS).
A Persian normalization and tokenization tool, constructed as a plugin for Elasticsearch.
A normalizer which do a lot about numbers, both ways.
Tajik-to-Persian transliteration model
Farsi to Finglish, a Persian transliterator
24k ASCII transliterated Persian words
An attempt to make a transliterator of Farsi (Persian) web page to Tajiki (Cyrillic) with a bookmarklet.
Consists of following sets:
- Words of
Sareh
Dictionary (Purified Persian Words) Farhangestan
chosen words for non-Persian equivalents.- Farhange
Emlaee
(A dictionary of Persian orthography and spelling) - A part of
Ganjoor
's website poetry repos. - Farhange
Motaradef
va Motazad (A dictionary of Persian synonyms and antonyms) - Farhange
Teyfi
(Persian Thesaurus)
Persian names dataset
A Python package for generating random Persian (Farsi) names.
A SQL database that includes a dictionary of 494,286 Persian words.
This repository is a Persian meaningful database with json
850k categorized Persian words.
pre-calculated list of similar Persian words ordered by rating and best match
List of ~240,000 Persian words
Useful Persian dictionary and more. Consists of:
- Dehkhoda dictionary (36k)
- Synonyms (20k)
- Arabic to Persian dictionary (113k)
- Persian to Arabic dictionary(32k)
- Abjad Persian to Arabic dictionary (42k)
- Arabic to Persian dictionary (8k)
- Quran Mofradat (1.6k)
- Arabic monolingual dictionary (4.6k)
- Intermediate Arabic dictionary (41k)
- Alamsal - Arabic proverbs dictionary (4.5k)
The "Iranian Job Title" dataset offers a comprehensive compilation of various job titles prevalent in Iran across diverse industries and sectors.
Moeen dictionary based Thesaurus for Persian.
It's an enhanced version of Flexicon word list with syllable, IPA procunciation and some refinements in word list itself.
A simple Telegram bot implemented in Python.
Useful Persian dictionary and more. Consists of:
- Persian poetry of Iranian poets:
- Ahmad Shamlou
- Baba-Taher
- Parvin E'tesami
- Hafez
- Khayyam
- Rahi-Moayeri
- Roodaki
- Sa'di
- Sohrab Sepehri
- Shahriar
- Saeb Tabrizi
- Onsori
- Ferdowsi
- Forugh Farrokhzad
- Mehdi Akhavan Sales
- Mowlavi
- Nezami
- Nima Yushij
- Quran Database
- Quran Surahs (114)
- Quran Versus (6236)
- Quran Versus Translation by Gomshe'i (6326)
- Quran Translation Word by word (83668)
- Reading voice of Famous Readers (48)
Collection of Persian Modernist Poetry from Iranian contemporary poets
Crawled Ganjoor for poems of 48 poets.
This model fine-tuned on ParsGPT2 with Chronological Persian poetry dataset and can generate poems by providing the name of the poet.
Dataset of poetry of 67 Persian poets of different times.
Persian spoken digit recognition
Simple Persian Questions aimed to use in a voice assistant in 4 Categories. Labeled NEs in command utterances (in text).
About 60 hours audio produced by various users reading sentences. All sentences with duplicates are 500h+.
This ~2.5-hour Single-Speaker Speech corpus.
A semi-natural db which contains emotional speech samples of Persian speakers. The database includes 3000 semi-natural utterances, equivalent to 3 h and 25 min of speech data extracted from online radio plays.
A Deep-Learning-Based Persian Speech Recognition System. Takes advantage of various ASR platforms to create models for ASR. Also it uses various datasets including Mozzila CommonVoice and their own dataset which consists of 300h+ audio and transcription.
Phoneme based speech dataset.
Open-source tool for speech recognition for various platforms and OSes, supprting 20 languages including Persian
.
It is a wav2vec model fine-tuned on Mozzila CommonVoice Persian dataset. The model and the notebook to recreate the model with extra data are avaialble.
A search engine for crawling news from the web, storing in a structured way, and querying through the stored documents for finding the most relevant results using Machine Learning and Information Retrieval techniques.
a crawler to fetch last news from Iranian(Persian) news agencies.
Open source crawler for Persian websites including Asriran, fa-Wikipedia, Tasnim, Isna.
A Persian POS Tagger trained by The Persian Universal Dependency Treebank (Persian UD) with Tensorflow
PARSEME is a verbal multiword expressions (VMWEs) corpus for Farsi. All the annotated data come from a subset of the Farsi section of the MULTEXT-East "1984" annotated corpus 4.0. More than colums of LEMMA UPOS, XPOS, FEATS, HEAD and DEPREL there is also PARSEME:MVE which is manually annotated.
Scripts and models developed for POS Tagging and Dependency Parsing Persian based on TurboParser.
RDRPOSTagger is supports pre-trained UPOS, XPOS and morphological tagging models for about 80 languages including Persian
. Java version.
This is another persian POS tagger
A keyphrase extractor for Persian
The first intelligent Persian reverse dictionary. Consists of various models for this task and datasets of Amid, Moeen, Dehkhoda, Persian Wikipedia and Persian Wordnet (Farsnet).
A Persian dataset for Joint Intent Detection and Slot Filling.
Persian NLP team trained various mt5 models on their reading comprehension dataset.
Family of ParsBERT models including BERT, DistilBERT, ALBERT and ROBERTA. All of which are transformer based models with encoder-decoder design.
Multilingual BERT model consists of 104 languages including Persian
.
Is a BERT based model trained on Divan dataset (proprietary). This model has 46.6M parameters. Its evaluation on NER and Sentiment Analysis is repoted.
Is a BERT based model trained on Divan dataset (proprietary). This model has 124M parameters. Its evaluation on NER and Sentiment Analysis is repoted.
Is a Persian BERT model trained on various Persian texts.
Is a Persian BERT model trained on various Persian texts.
Is a Persian BERT model trained on various Persian texts with 123M parameters. There is also a large version of this model with 353M parameters.
Do you need some fake data?
Persian-Badge is a website for having metadata badges in the Persian language
This is a dataset of handwritten cities in Iran in Arabic/Persian that has been used in my Master project. This dataset is collected for sorting postal packages.
Hand-written / typed names of different cities of Iran in image format.
50*50 Images of Persian letters (without dots) with 32 Different Fonts.
Consists of about 20k images of Persian subwords in different fonts and sizes to be used in ocr models.
persian sms spam word
Coco 2017 translated to Persian language. 91k images with caption in Persian.
Dataset of Farsi License Plate Characters (83k).
The VQA dataset consists of almost 11k images and 28.5k question and answer pairs with short and long answers usable for both classification and generation VQA.
A dataset consists of 16M records of images and their corresponding texts. It also consists of a model traind on 400k of this dataset for searching images based on text and image.
Consists of about 26K records of images with th describing captions in Persian.
Persian language movies dataset from imvbox.com. 14k movies with storyline translated from Persian to English.
Quran ayat with translation in 21 languages.
A multilingual parallel corpus created from translations of the Bible. In 100 languages including Persian
.
A set of corpora for 120 languages including Persian
automatically collected from wikipedia and the web.
Persian NLP team trained various mt5 models on their translation dataset.
2.7k Relation of entities with translation and relation type.
It is a large Multilingual Dataset for Entity Linking containing data in 53 languages including Persian
. DaMuEL consists of two components: a knowledge base that contains language-agnostic information about entities, including their claims from Wikidata and named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART, MANUFACTURED); and Wikipedia texts with entity mentions linked to the knowledge base, along with language-specific text from Wikidata such as labels, aliases, and descriptions, stored separately for each language. Paper. For this project UDPipe has been used.
It is a knowledge graph platform designed for extracting information from Wikipedia, tables, and unstructured texts. A portion of its data is also available for download.
Open information extraction from Persian web.
The Persian Simple Question Answering Dataset and System over Knowledge Graph. It consists of 36k records.
It is a dataset for Persian fact extraction and verification, developed in accordance with FEVER guidelines.
Consists of 63k News articles with following columns: category, title, abstract
, body, time.
Yearly collection of the Farsnews agency (1398). Contains 294k News article with following columns: title, abstract
, paragraphs, cat, subcat, tags, link.
95k documents with body and summery extracted from wikipedia Persian articles. There is also notebook to create and test models for summerization.
Statistical and Semantical Text Summarizer in Persian Language
A well-structured summarization dataset for the Persian language consists of 93,207 records. It is prepared for Abstractive/Extractive tasks (like cnn_dailymail for English). It can also be used in other scopes like Text Generation, Title Generation, and News Category Classification.
Consists of similar models fine-tuned on ParsBERT using three different datasets, these models can be utilized for various applications, including Text summarization.
MirasText has more than 2.8 million articles and over 1.4 billion content words. Consists of following columns: content, summary, keywords, title, url.
Paraphrase data for Persian. It consists of 2.3M sentence pairs of which 1M of them are paraphrase and 1.3M are not parapharse of each other.
Persian NLP team trained various mt5 models on their query paraphrase dataset.
Consists of 800 pairs of Persian sentences wich are paraphrases of each other.
SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation.
The Dorna models are a family of decoder-only models, specifically trained/fine-tuned on Persian data. This model is built using the Meta Llama 3 Instruct model. There are also quantized versions of this model.
With 13 billion parameters, this model has been fine-tuned using the Persian Alpaca dataset on Lllama 2 to excel at executing detailed instructions and delivering tailored outputs. There is also PersianLLaMA 13B which is fine-tuned on Persian wikipedia.
Persian version of GPT2 model fine-tuned on Persian poetry and ParsiNLU sentiment analysis datast.
Fine-tuned versions of Mistral 7B and Llama 3 for Persian. The Persian resources used for these models are not known.
Is a Mistral 7B based model trained on Alpaca Persian Instruction dataset.
Thanks to Awesome Persian NLP and Awesome Iranian Datasets for providing some elements of this long list.