האיגוד הישראלי לטכנולוגיות שפת אנוש
الرابطة الإسرائيلية لتكنولوجيا اللغة البشرية
The Israeli Association of Human Language Technologies
https://www.iahlt.org
A Universal Dependencies treebank with named entities for contemporary Hebrew covering Knesset protocols. It is released under CC-BY-4.0, see COPYING for details.
The Universal Dependencies (UD) Hebrew-IAHLTKnesset treebank is a work in progress. The dataset consists of 2619 annotations of 2619 sentences (with a total of 45538 tokens annotated, 4382 unique lemmas) for dependency syntax, part-of-speech, lemmatization and morphological analysis. The texts were sampled from Knesset protocols.
Note that although the sentnumber
values are not necessarily consecutive, the
sentences are in fact sorted according to the original order.
The UD Hebrew-IAHLT treebank consists of texts originating from Knesset protocols. The schema for the UD Hebrew-IAHLT treebank is based on the conversion of the Hebrew Treebank (HTB) into UD V2 and is checked against the Universal Dependencies validator as of UD release V2.8.
The HTB was initially converted automatically, then a subset of the converted data was manually validated and adopted as a gold standard for training the model for UD parsing used in Hebrew-IAHLT.
The parsed data has been manually edited to correct parsing errors. Quality assurance (QA) scripts were used to apply corrections following updates in the schema. All sentences in this release pass level-5 validation of the Universal Dependencies validator.
-
fields for technical use:
sent_id
- a unique identifier for the tree within this releasetext
- the (Hebrew) text of the original sentenceurl
- the link for the source entry/articlesource
- the source of the sentencedoc_id
- a unique identifier for the source documentprotocol
- the source protocol fileparnumber
- the paragraph sequence number within the source documentsentnumber
- the sentence sequence number within the source paragraph
The annotation guidelines can be found at https://github.com/ivrit/IAHLT-HTB-GUIDELINES
We would like to thank all the people who contributed to this corpus:
Emmanuelle Ko Israel Landau Nick Howell Noam Ordan Omer Strass Shira Wigderson Yifat Ben Moshe