Hebrew NLP Resources

An online interface of this resource index is also available HERE.

This repository collects resources for NLP in Hebrew, as part of the NLPH project, which you can read more about here. Resources are divided to folders by type. If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at contact@nlph.org.il. See here for a list of companies operating in the field.

This specific document is meant to be a list of Hebrew NLP resources, both for general use and to be used as reference when discussing what existing tools can be opened, adapted or integrated to help create a good open source foundation for NLP in Hebrew, as part of the NLPH Project.

When contributing to the list, please add a link to the license for all non-paper resources, e.g. {AGPL-3.0}, {?} for an unkonwn licesnse or {X} for unreleased/closed/copyrighted resources. For code resource, please also add the main language in which the tool is written, e.g. [Python] or [?] for an unknown programming language. Please add hosting mirrors with pointy brackets, e.g. <Zenodo mirror>.

Contents

1 Corpora
2 Lexical Resources
3 Models and Tools
- 3.1 Models and Tools by Task
- 3.2 Models by Type
4 Commercial and Online Services
5 Annotation Tools
6 Evaluation
- 6.1 Benchmark Datasets
- 6.2 Evaluation Metrics
7 Labs & Researchers
- 7.1 Academia
- 7.2 Non-Profit
- 7.3 Industry
8 Courses, Presentations and Meetups
- 8.1 Meetup & Discussion Groups
- 8.2 Specific Talks

HeDC4 used for HeRo {Apache License 2.0} - A Hebrew Deduplicated and Cleaned Common Crawl Corpus. A thoroughly cleaned and approximately deduplicated dataset for unsupervised learning.
Wikipedia Corpora used for AlephBERT {Apache License 2.0} - The texts in all of Hebrew Wikipedia was also extracted to pre-train OnlpLab's AlephBERT, using Attardi's Wikiextractor.
JPress {Custom Terms of Use} - The National Library offers a collection of Jewish newspapers published in various countries, languages, and time periods, including digital versions and full-text search. The texts are published under a Custom Terms of Use document that prohibits commercial use, and additionally requires checking the copyright status and receiving permission from the copyright-holder of the work for any use requiring such permission according to the Copyright Law.
The SVLM Hebrew Wikipedia Corpus {CC-BY-SA 3.0} - A corpus of 50K sentences from Hebrew Wikipedia chosen to ensure phoneme coverage for the purpose of a sentence recording project.

1.1.2 Specialized Corpora

Sefaria {Each text is licensed separately} - Structured Jewish texts and metadata with free public licenses, exported from Sefaria's database. A Living Library of Jewish Texts. 3,000 years of Jewish texts in Hebrew and English translation.
Hebrew Songs Lyrics {CC BY-SA 4.0} - ~15,000 israeli songs scrapped from Shironet website and contains 167 different singers. Contains only Hebrew characters.
1001 Israeli Pop Songs Dataset {CC BY-NC-ND 4.0} - 1001 Israeli pop songs manual analyses 1967-2017.
Supreme Court of Israel {OpenRAIL} - This dataset represents a 2022 snapshot of the Supreme Court of Israel public verdicts and decisions supported by rich metadata. The 5.31GB dataset represents 751,194 documents. Overall, the dataset contains 2.68 Gb of text.

1.1.3 Crawls and Dumps

Hebrew Wikipedia Dumps {CC-BY-SA 3.0} - Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
HeWikiBooks Dumps {CC0 1.0} - Wikimedia dump service.
Project Ben Yehuda Public Dumps {Public Domain} - A repository containing dumps of thousands of public domain works in Hebrew, from Project Ben-Yehuda, in plaintext UTF-8 files, with and without diacritics (nikkud), and in HTML files.

1.2 Multilingual Corpora

OSCAR {CC BY 4.0} - OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
CC100 {MIT} - This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises monolingual data for 100+ languages, including Hebrew. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots.
Old Newspapers {CC0 1.0} - The HC Corpora was a great resource that contains natural language text from various newspapers, social media posts and blog pages in multiple languages. This is a cleaned version of the raw data from the newspaper subset of the HC corpus.
TED Talks Transcripts for NLP {CC BY-NC 4.0} - Transcripts and more in 12 languages including Hebrew.

1.3 Annotated Datasets by Task

1.3.1 Dependency Treebanks

Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>
The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only. (temporarily down)
UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
IAHLT-HTB {CC BY-NC-SA 4.0} - IAHLT version of the UD Hebrew Treebank. This is a revised fork of the Universal Dependencies version of the Hebrew Treebank, with some important changes and a consistency overhaul involving substantial manual corrections. The dataset was prepared as part of the Hebrew & Arabic Corpus Linguistics Infrastructure project at the Israeli Association of Human Language Technologies (IAHLT).
Modern Hebrew Dependency Treebank V.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.
UD Hebrew IAHLTwiki {CC-BY-SA 4.0} - Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section. The UD Hebrew-IAHLTWiki treebank consists of 5,000 contemporary Hebrew sentences representing a variety of texts originating from Wikipedia entries, compiled by the Israeli Association of Human Language Technology. It includes various text domains, such as: biography, law, finance, health, places, events and miscellaneous.
UD Hebrew - IAHLTKnesset {CC BY 4.0} - A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols.
The Hebrew Language Corpus - Morphological Annotation (קורפוס השפה העברית - תיוג מורפולוגי) {Open} - An annotated Hebrew database published as part of the Hebrew Language Corpus Project of Israel National Digital Agency and The Academy of the Hebrew Language.
The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too. (temporarily down)

1.3.2 Named Entity Recognition (NER)

NEMO {CC BY 4.0} - Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, and more. The following entity types are tagged: Person, Organization, Geo-Political Entity, Location, Facility, Work-of-Art, Event, Product, Language.
MDTEL {MIT} - A dataset of posts from the www.camoni.co.il, tagged with medical entities from the UMLS, and a code that recognize medical entities in the Hebrew text.
Ben-Mordecai and Elhadad's Corpus {?} - Newspaper articles in different fields: news, economy, fashion and gossip. The following entity types are tagged: entity names (person, location, organization), temporal expression (date, time) and number expression (percent, money). Demo
UD Hebrew - IAHLTKnesset {CC BY 4.0} - A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols.

1.3.3 Question Answering (QA)

HeQ {CC BY 4.0} - a question answering dataset in Modern Hebrew, consisting of 30,147 questions. The dataset follows the format and crowdsourcing methodology of SQuAD (Stanford Question Answering Dataset) and the original ParaShoot. A team of crowdworkers formulated and answered reading comprehension questions based on random paragraphs in Hebrew. The answer to each question is a segment of text (span) included in the relevant paragraph. The paragraphs are sourced from two different platforms: (1) Hebrew Wikipedia, and (2) Geektime, an online Israeli news channel specializing in technology.
ParaShoot {?} - A Hebrew question and answering dataset in the style of SQuAD, created by Omri Keren and Omer Levy. ParaShoot is based on articles scraped from Wikipedia. The dataset contains 3K crowdsource-annotated pairs of questions and answers, in a setting suitable for few-shot learning.
HebWiki QA {?} Translated (by google translation API) SQUAD dataset from English to Hebrew. The translation process included fixation and removal of bad translations.

1.3.4 Sentiment Analysis

Hebrew-Sentiment-Data Amram et al. {?} - A sentiment analysis benchmark (positive, negative and neutral sentiment) for Hebrew, based on 12K social media comments, containing two instances of input items: token-based and morpheme-based. A cleaned version of the Hebrew Sentiment dataset - a test-train data leakage was cleaned.
Emotion User Generated Content (UGC) {MIT} - collected for HeBERT model and includes comments posted on news articles collected from 3 major Israeli news sites, between January 2020 to August 2020. The total size of the data is ~150 MB, including over 7 millions words and 350K sentences. ~2000 sentences were annotated by crowd members (3-10 annotators per sentence) for overall sentiment (polarity) and eight emotions.
Sentiment HebrewDataset {MIT} - The sentiment analysis dataset contains 75,152 tagged sentences from 3 categories: economy, news (mostly politics) and sport. All the sentences were annotated by crowd members (2-5 annotators) to sentiment: positive, negative or neutral. This dataset was created by SUMIT-AI company, thanks to joint funding of the NNLP-IL.

1.3.5 Emotion Detection

Emotion User Generated Content (UGC) {MIT} - collected for HeBERT model and includes comments posted on news articles collected from 3 major Israeli news sites, between January 2020 to August 2020. The total size of the data is ~150 MB, including over 7 millions words and 350K sentences. ~2000 sentences were annotated by crowd members (3-10 annotators per sentence) for overall sentiment (polarity) and eight emotions: anger, disgust, expectation , fear, happiness, sadness, surprise and trust.

1.3.6 Topic Classification

Knesset Topic Classification {?} - This data was collected as a part of Nitzan Barzilay's project and contains about 2,700 quotes from Knesset meetings, manually classified into eight topics: education, Covid-19, welfare, economic, women and LGBT, health, security, internal security.
ThinkIL {CC-BY-SA 3.0} - An archive of the writings of Zvi Yanai.

The HUJI Corpus of Spoken Hebrew {CC BY 4.0} - The corpus project, created by Dr Michal Marmorstein, Nadav Matalon, Amir Efrati, Itamar Folman and Yuval Geva, and hosted by the Hebrew University of Jerusalem (HUJI), aims at documenting naturally occurring speech and interaction in Modern Hebrew. Data come from telephone conversations recorded during the years 2020–2021. Data annotation followed standard methods of Interactional Linguistics (Couper-Kuhlen and Selting 2018). Audio files and transcripts were made freely accessible online.
CoSIH - The Corpus of Spoken Hebrew {?} - The Corpus of Spoken Israeli Hebrew (CoSIH) is a database of recordings of spoken Israeli Hebrew
MaTaCOp {?} - a corpus of Hebrew dialogues within the Map Task framework (allowed for non-commercial research and teaching purposes only)
HaArchion {?} - Recording of various Hebrew prose and poetry being read. (temporarily down)
Robo-Shaul (רובו-שאול) {?} - Transcribed audio recordings (30 hours) of an Israeli economics podcast (חיות כיס).

The BGU morphological lexicon (not yet released)
The morphological lexicon of the Israeli National Institute for Testing and Evaluation (not yet released)
The MILA lexicon of Hebrew words {GPLv3} - The lexicon was designed mainly for usage by morphological analyzers, but is being constantly extended to facilitate other applications as well. The lexicon contains about 25,000 lexicon items and is extended regularly. Free for non-commercial use. (temporarily down)
MILA's Verb Complements Lexicon {GPLv3}
Hebrew Psychological Lexicons {CC-BY-SA 4.0} - Natalie Shapira's large collection of Hebrew psychological lexicons and word lists. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.

2.1.2 Bilingual/Multilingual Lexicons

Hebrew WordNet {GPLv3} - Hebrew WordNet uses the MultiWordNet methodology and is aligned with the one developed at IRST (and therefore is aligned with English, Italian and Spanish). Free for non-commercial use. (temporarily down)
Sentiment lexicon {GPLv3} - Sentiment analysis, the task of automatically detecting whether a piece of text is positive or negative, generally relies on a hand-curated list of words with positive sentiment (good, great, awesome) and negative sentiment (bad, gross, awful). This dataset contains both positive and negative sentiment lexicons for 81 languages.
word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported languages, and thus word-to-word translation to/from Hebrew is supported for 61 languages.

2.2 Dictionaries & Word Lists

Eran Tomer's Digital Vocalized Text Corpus {Apache License 2.0} - A corpus of digital vocalized Hebrew texts created by Eran Tomer as part of his Master thesis. The corpus is found in the resources folder.
MILA's Hebrew Stopwords List {GPLv3} - An Excel XLSX file containing 23,327 Hebrew tokens in descending order of frequency.
Tapuz Hebrew Stop Words - a list of the 500 most common words (stop words) computed from discussions from the Tapuz People website, on a variety of subjects. (Data files © Original Authors)
Stop words {GPLv2} - Stop words in 28 languages.
Hebrew verb lists {CC-BY 4.0} - Created by Eran Tomer (erantom@gmail.com).
Hebrew name lists {CC-BY 4.0} - Lists of street, company, given and last names. Created by Guy Laybovitz.
Most Common Hebrew Verbs on Twitter - 1000 most frequent words in Hebrew tweets during (roughly) 2018.
KIMA - the Historical Hebrew Gazetteer - Place Names in the Hebrew Script. An open, attestation based, historical database. Kima currently holds 27,239 Places, with 94,650 alternate variants of their names and 236,744 attestations of these variants.
Wikidata Lexemes {CC0 1.0} - over 500K conjugations with morphological analysis, mainly based on Hspell. Can be queried using http://query.wikidata.org/ - Uploaded by Uziel302
Most Common Hebrew Words on Twitter - Hebrew most common words by Twitter based on tweets from March 2018 to March 2019.
Hebrew WordLists {AGPL-3.0} - Useful word lists extracted from Hspell 1.4 by Eyal Gruss.
Hebrew stop word base on the UD {CC-BY-SA 4.0} - List of stop words in Hebrew produced by using Universal Dependencies of the The Israeli Association of Human Language Technologies (IAHLT).
The Word-Frequency Database for Printed Hebrew - supplies the frequency of occurrence of any Hebrew letter cluster (mean occurrence per million). The corpus was assembled throughout the year 2001, and consists of text downloaded from 914 editions of the three major daily online Hebrew newspapers (Haaretz, Maariv, and Yediot Acharonot). After removing abbreviations, single characters, forms with counts that are less than 3 (mostly typos), and splitting hyphenated forms (vast majority were two words), the corpus totals 554,270 types and 619,835,788 tokens. (©The Hebrew University of Jerusalem)

2.3 Word Embeddings

fastText pre-trained word vectors for Hebrew {CC-BY-SA 3.0} - Trained on Wikipedia using fastText. Comes in both the binary and text default formats of fastText: binary+text, text. In the text format, each line contains a word followed by its embedding; Each value is space separated; Words are ordered by their frequency in a descending order.
hebrew-word2vec pre-trained word vectors {Apache License 2.0} - Trained on data from Twitter. Developed by Ron Shemesh in Bar-Ilan University's NLP lab under the instruction of Dr. Yoav Goldberg. Contains vectors for over 1.4M words (as of January 2018). Comes in a zip with two files: a text file with a word list and a NumPy array file (npy file).
CoNLL17 word2vec word embeddings {CC BY 4.0} - Trained on the Hebrew CoNLL17 corpus using Word2Vec continuous skipgram, with a vecotor dimension of 100 and a window size of 10. The vocabulary includes 672,384 words.
CoNLL17 ELMO word embeddings {GPLv3} - Trained on the Hebrew CoNLL17 corpus using ELMO. NOTE: The link at the repository might not work. To download a concerete version of the Hebrew embeddings, press here.
Hebrew Word Embeddings by Lior Shkiller - Read more in this blog post.
Hebrew Subword Embeddings
LASER Language-Agnostic SEntence Representations {CC BY-NC 4.0} - LASER is a library to calculate and use multilingual sentence embeddings.
hebrew-w2v {Apache License 2.0} - Iddo Yadlin and Itamar Shefi's word2vec model for Hebrew, trained on a corpus which is the Hebrew wikipedia dump only tokenized with hebpipe.
BEREL {?} - BERT Embeddings for Rabbinic-Encoded Language - DICTA's pre-trained language model (PLM) for Rabbinic Hebrew.

Yonti Levin's Hebrew Tokenizer [Python] {MIT} - A very simple python tokenizer for Hebrew text. No batteries included - No dependencies needed!
Hebrew Tokenizer {?} - Eyal Gruss's Hebrew tokenizer. A field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.
RFTokenizer [Python] {Apache License 2.0} - A highly accurate morphological segmenter to break up complex word forms

3.1.1.2 Morphological Analysis

The MILA Morphological Analysis Tool [?] {GPLv3} - Takes as input undotted Hebrew text (formatted either as plain text or as tokenized XML following MILA's standards). The Analyzer then returns, for each token, all the possible morphological analyses of the token, reflecting part of speech, transliteration, gender, number, definiteness, and possessive suffix. Free for non-commercial use. (temporarily down)
The MILA Morphological Disambiguation Tool [?] {GPLv3} - Takes as input morphologically-analyzed text and uses a Hidden Markov Model (HMM) to assign scores for each analysis, considering contextual information from the rest of the sentence. For a given token, all analyses deemed impossible are given scores of 0; all n analyses deemed possible are given positive scores. Free for non-commercial use. (temporarily down)
BGU Tagger: Morphological Tagging of Hebrew [Java] {?} - Morphological Analysis, Disambiguation.
AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the- art results on the tasks of segmentation and Part of Speech Tagging. Github: https://github.com/OnlpLab/AlephBERT
AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting of approximately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's (erantom@gmail.com) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more.
YAP morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. [original repository] Demo
SPMRL to UD {Apache License 2.0} - Converts YAP's output from the SPMRL scheme to UD v2.
HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
Hspell [?] {AGPL-3.0} - Free Hebrew linguistic project including spell checker and morphological analyzer. HspellPy [Python] {AGPL-3.0} - Python wrapper for Hspell.

3.1.1.3 Part-of-speech (POS) Tagging

AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the- art results on the tasks of segmentation and Part of Speech Tagging. Github: https://github.com/OnlpLab/AlephBERT
AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting od approximiately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
The MILA Morphological Analysis Tool [?] {GPLv3} - Takes as input undotted Hebrew text (formatted either as plain text or as tokenized XML following MILA's standards). The Analyzer then returns, for each token, all the possible morphological analyses of the token, reflecting part of speech, transliteration, gender, number, definiteness, and possessive suffix. Free for non-commercial use. (temporarily down)
HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more
YAP morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. [original repository] Demo

3.1.1.4 Stemming and Lemmatization

HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more.
YAP morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. [original repository] Demo

3.1.1.5 Spell Checking and Correction

Shtey Shekel {MIT} - Wikiproject for correcting grammar mistakes. (Heuristic) positive annotions can be derived from query.
Hspell [?] {AGPL-3.0} - Free Hebrew linguistic project including spell checker and morphological analyzer. HspellPy [Python] {AGPL-3.0} - Python wrapper for Hspell.

3.1.1.6 DiacritizationVocalization

Nakdan (Paper) - Tool for Automatic and semi-automatic Nikud for Hebrew texts. Avi Shmidman, Shaltiel Shmidman, Moshe Koppel, and Yoav Goldberg. 2020. Nakdan: Professional Hebrew diacritizer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 197–203, Online. Association for Computational Linguistics.
Nakdimon (Paper , code , data) - Hebrew diacritizer. Elazar Gershuni and Yuval Pinter: Restoring Hebrew Diacritics Without a Dictionary. Demo in Replicate.
UNIKUD {MIT} - Morris Alper's open-source tool for adding vowel signs (Nikud) to Hebrew text, uses no rule-based logic, built with a CANINE transformer network. An interactive demo is available at https://huggingface.co/spaces/malper/unikud Blog post: https://towardsdatascience.com/unikud-adding-vowels-to-hebrew-text-with-deep-learning-powered-by-dagshub-56d238e22d3f .
Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.

3.1.1.7 Stopwords Removal

3.1.1.8 Language modeling

Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.

HeRo {?} - RoBERTa based language model for Hebrew, present state-of-the-art results on sentiment analysis, named entity recognition and question answering.

3.1.2.2 Sentiment Analysis

HeRo {?} - RoBERTa based language model for Hebrew, present state-of-the-art results on sentiment analysis, named entity recognition and question answering.
AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. Github: https://github.com/OnlpLab/AlephBERT
AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting od approximiately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
Neural Sentiment Analyzer for Modern Hebrew [?] {MIT} - This code and dataset provide an established benchmark for neural sentiment analysis for Modern Hebrew.
HeBERT {MIT} - HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: HebEMO - emotion recognition model and sentiment analysis. (https://huggingface.co/avichr/heBERT?fbclid=IwAR2Lo9pkN5HLZmtFiFwcIDWyXR9gyP646pyFzNSUUP_djalAkewvB9p8E_o)

3.1.2.3 Emotion Detection

Hebrew Psychological Lexicons {Apache License 2.0} - Easy-to-use Python interface for Hebrew clinical psychology text analysis. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.
HeBERT {MIT} - HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: HebEMO - emotion recognition model and sentiment analysis. (https://huggingface.co/avichr/heBERT?fbclid=IwAR2Lo9pkN5HLZmtFiFwcIDWyXR9gyP646pyFzNSUUP_djalAkewvB9p8E_o)

3.1.2.4 Text Summarization

3.1.2.5 Text Classification

LongHeRo {?} - State-of-the-art Longformer language model for Hebrew.
Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.

3.1.2.6 Topic Classification

Hebrew Psychological Lexicons {Apache License 2.0} - Easy-to-use Python interface for Hebrew clinical psychology text analysis. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.

3.1.2.7 Topic Modeling

BGU NLP - LemLDA: an LDA Package for Hebrew [?] {GPLv3} - The package is based on Heinrich's java implementation of collapsed Gibbs sampling with an extra variable to model the generative nature of lemmas in Hebrew.

HeRo {?} - RoBERTa based language model for Hebrew, present state-of-the-art results on sentiment analysis, named entity recognition and question answering.
AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the-art results on the tasks of segmentation, Part of Speech Tagging, Named Entity Recognition, and Sentiment Analysis. Github: https://github.com/OnlpLab/AlephBERT
AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting od approximiately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
Neural Modeling for Named Entities and Morphology (NEMO2) {Apache License 2.0} - OnlpLab's code and models for neural modeling of Hebrew NER. Described in the TACL paper Neural Modeling for Named Entities and Morphology (NEMO2).
MDTEL {?} - Yonatan Bitton's code that recognizes medical entities in a Hebrew text.
HebSpacy {MIT} - A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.
HebSafeHarbor {MIT} - A de-identification toolkit for clinical text in Hebrew. Demo

3.1.3.2 Semantic Role Labeling (SRL)

3.1.3.3 Temporal Information Extraction

HebSafeHarbor {MIT} - A de-identification toolkit for clinical text in Hebrew. Demo

3.1.3.4 Event Extraction

3.1.3.5 Coreference Resolution

HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more.

The Automatic Hebrew Transcriber - Automatically transcribes text from Hebrew audio and video files. (down, link not found)

3.1.4.5 Optical Character Recognition (OCR)

Text-Fabric [Python] {CC BY-NC 4.0} - A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.

3.1.4.6 Language Generation

Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's (erantom@gmail.com) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.

3.1.4.7 Machine Translation

word2word {Apache License 2.0} - Easy-to-use Python interface for accessing top-k word translations and for building a new bilingual lexicon from a custom parallel corpus.

3.2 Models by Type

3.2.1 Pre-Trained Language Models

AlephBERT {Apache License 2.0} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains state-of-the- art results on the tasks of segmentation, Part of Speech Tagging, Named Entity Recognition, and Sentiment Analysis. Github: https://github.com/OnlpLab/AlephBERT
AlephBERTGimmel {CC0 1.0} - a new Hebrew pre-trained language model, trained on the same dataset as the previous SOTA Hebrew PLM AlephBERT, consisting od approximiately 2 billion words of text but with a substantially increased vocabulary of 128,000 word pieces. Published as a collaboration of the OnlpLab team and Dicta. Github: https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
HeBERT {MIT} - HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: HebEMO - emotion recognition model and sentiment analysis. (https://huggingface.co/avichr/heBERT?fbclid=IwAR2Lo9pkN5HLZmtFiFwcIDWyXR9gyP646pyFzNSUUP_djalAkewvB9p8E_o)
TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
BEREL {?} - BERT Embeddings for Rabbinic-Encoded Language - DICTA's pre-trained language model (PLM) for Rabbinic Hebrew.
Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.

3.2.2 Fine-Tuned Language Models

TavBERT {MIT} - a BERT-style masked language model over character sequences, published by Omri Keren, Tal Avinari, Prof. Reut Tsarfaty and Dr. Omer Levy.
Legal-HeBERT {?} - a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. Avichay Chriqui, Dr. Inbal Yahav Shenberger and Dr. Ittai Bar-Siman-Tov release two versions of Legal-HeBERT: The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.

3.2.3 Multilingual Models

BERT's multilingual model - Trained (also) on Hebrew.
Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.

3.2.4 PipelinesParsers

HebPipe [Python] {Apache License 2.0} - End-to-end pipeline for Hebrew NLP using off the shelf tools, including morphological analysis, tagging, lemmatization, parsing and more
YAP morpho-syntactic parser [Go] {Apache License 2.0} - Morphological Analysis, disambiguation and dependency Parser. Morphological Analyzer relies on the BGU Lexicon. [original repository] Demo
SPMRL to UD {Apache License 2.0} - Converts YAP's output from the SPMRL scheme to UD v2.
HebSpacy {MIT} - A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.
HebSafeHarbor {MIT} - A de-identification toolkit for clinical text in Hebrew. Demo

3.2.5 Causal Language Models (CLM)

Hebrew GPT neo {MIT} - Doron Adler's Hebrew text generation model based on EleutherAI's gpt-neo.

4 Commercial and Online Services

DICTA {CC-BY-SA 4.0} - Analytical tools for Jewish texts. They also have a GitHub organization.
wordfreq 3.0.3 {MIT} - wordfreq is a Python library for looking up the frequencies of words in 44 languages, including Hebrew. The Hebrew data is based on Wikipedia, OPUS OpenSubtitles 2018 and SUBTLEX, Google Books Ngrams 2012, Web text from OSCAR and Twitter.
Eyfo - A commercial engine for search and entity tagging in Hebrew.
Melingo's ICA (Intelligent Content Analysis) - A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
Genius - Automatic analysis of free text in Hebrew.
AlmaReader - Online text-to-speech service for Hebrew.
Amnon The Transcriber - a WhatsApp bot that receives a voice note and transcribe it to text.
Callee - a WhatsApp bot that receives a voice note, transcribes it to text also summarize it (as a text).
verbit.ai - Transcription.
Text Analytics for health containers
Hebrew-Nlp
HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.

5 Annotation Tools

LightTag - A tool for managing annotation projects. Handles right-to-left and part-of-word marking. Tutorial video: https://www.youtube.com/watch?v=eTlrTC_n_yg
Recogito [Scala, JavaScript, HTML] {Apache License 2.0} - A tool for linked data annotation.
CATMA [HTML, Java] {unclear} - A web-based tool for research and collaboration over text data. Handles right-to-left and part-of-word marking. See the system itself here: http://portal.catma.de/catma/, and the code here: https://github.com/mpetris/catma
WebAnno [Java] {Apache License 2.0} - Web-based. Support RTL and project management. Repository: https://github.com/webanno/webanno
Arethusa: Annotation Environment [JavaScript] {MIT} - A backend-independent client-side annotation framework. Repository here.
rasa-nlu-trainer [JavaScript] {MIT} - A tool to edit training examples for rasa NLU. Handles right-to-left and part-of-word marking.
brat [Python, JavaScript] {MIT} - An online environment for collaborative text annotation. Does not support right-to-left. Repository here.
openNLP [Java] {Apache License 2.0} - OpenNLP has a tagging tool.
opeNER [Ruby, HTML, Java, Python] - opeNER has a tagging tool.
pybossa [Python] {AGPL-3.0} - A framework for crowdsourcing of data analysis and enrichment tasks. GitHub.
TextThrasher [JavaScript, Python] - A crowdsourced text annotator. Built with React and Redux (possibly also with pybossa).
SHEBANQ - System for HEBrew Text: ANnotations for Queries and Markup. SHEBANQ is an online environment for studying the Hebrew Bible.
doccano {MIT} - an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on.

6 Evaluation

6.1 Benchmark Datasets

Hebrew SimLex-999 - A Hebrew version of the Simlex-999 resource for the evaluation of models that learn the meaning of words and concepts. A copy can also be found in the Attract-Repel repository. Another copy is found in this repository.

Bar Ilan University:
- The ONLP Lab
  - Prof. Reut Tsarfaty - Head of the ONLP Lab.
  - Dan Bareket - Data Scientist.
- The Natural Language Processing Lab at Bar Ilan University [Twitter]:
- Prof. Moshe Koppel
- Dr. Avi Shmidman
The Open University of Israel
- The Open Media and Information Lab (OMILab) at the Open University of Israel - An interdisciplinary center for research and for teaching in new media and related areas, such as big data, information science, network cultures and digital sociology.
  - Dr. Vered Silber-Varod - Director of the Open Media and Information Lab (OMILab). Research interests and publications focus on various aspects of speech sciences, with expertise in speech prosody, acoustic phonetics, and speech communication and text analytics.
- Dr. Anat Lerner, Senior Lecturer - Interested in speech prosody analyses, combinatorial auctions and computer Networks (especially Ad-Hoc networks, mobile and cellular networks).
Ben-Gurion University:
- Natural Language Processing Lab at Ben Gurion University
- Dr. Oren Tzur
University of Haifa:
- Prof. Shuly Wintner
- Dr. Einat Minkov - Working on Information Extraction and Semantics, as well as in other Natural Language Processing applications. I am also interested in Machine Learning - and the application of learning to NLP problems.
Tel Aviv University:
- Prof. Jonathan Berant
The Technion:
- Dr Yonatan Belinkov - Assistant Professor at the faculty of Computer Science. Focus: interpretability and robustness.
- Prof. Alon Itai (retired)
- Prof. Roi Reichart - An Assistant Professor at the faculty of Industrial Engineering and Management of the Technion. Working on Natural Language Processing (NLP). Interested in language learning in its context and design models that integrate domain and world knowledge with data-driven methods.
- Prof. Joseph (Yossi) Keshet
The Hebrew University of Jerusalem:
- Prof. Ronen Feldman - Feldman's main areas of research are natural language processing, entity extraction and text relations, text sentiment analysis, and language processing for algorithmic trading. He is one of the founder of the discipline of text mining.
- Prof. Ari Rappoport - With his main contribution in the area of Neuroscience, where he developed a comprehensive theory of the brain, Prof. Rappoport's Computer Science area of interest is language (Computational Linguistics, Natural Language Processing (NLP)), from cognitive science and machine learning perspectives.
- Prof. Omri Abend - My fields of interest are Computational Linguistics and Natural Language Processing. Specifically, I conduct research on semantic (meaning) representation from a computational perspective. My research is tightly linked to statistical learning, language technology (such as Machine Translation and Information Extraction), and computational modeling of child language acquisition.
- Prof. Dafna Shahaf - Prof. Shahaf's research focuses on helping people make sense of the world. She designs algorithms that help people understand the underlying structure of complex topics, and connect the dots between different pieces. She also likes to formalize intuitive notions; see recent work on Computational Humor.
- The Neurolinguistics Laboratory at the Edmond and Lily Safra Center for Brain Sciences (ELSC):
  - Prof. Yosef Grodzinsky - Research fields: functional anatomy of language, linguistic theory (syntax, semantics), language acquisition, aphasia, individual variation.

7.2 Non-Profit

Allen Institute for AI - Israel
- Prof. Yoav Goldberg
- Dr. Jonathan Berant

Name		Name	Last commit message	Last commit date
Latest commit History 298 Commits
code/VerbInflector		code/VerbInflector
linguistic_resources		linguistic_resources
methodology/hebrew_named_entity_tagging_guidelines		methodology/hebrew_named_entity_tagging_guidelines
Industry.rst		Industry.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst

License

abalewis/Resources

Folders and files

Latest commit

History

Repository files navigation

Hebrew NLP Resources

About

Resources

License

Stars

Watchers

Forks

Languages