A corpus of session protocols of the Knesset (Israeli parliament) between January 2004 and November 2005.
The Knesset Meetings Corpus 2004-2005 is made up of two components:
- Raw texts - 282 files made up of 867,725 lines together. These can be downloaded in two formats:
- As
doc
files, encoded usingwindows-1255
encoding:kneset16.zip
- Contains 164 text files made up of 543,228 lines together. [MILA host] [Github Mirror]kneset17.zip
- Contains 118 text files made up of 324,497 lines together. [MILA host] [Github Mirror]
- As
txt
files, encoded usingutf8
encoding:kneset.tar.gz
- An archive of all the raw text files, divided into two folders: [Github mirror]16
- Contains 164 text files made up of 543,228 lines together.17
- Contains 118 text files made up of 324,497 lines together.
knesset_txt_16.tar.gz
- Contains 164 text files made up of 543,228 lines together. [MILA host] [Github Mirror]knesset_txt_17.zip
- Contains 118 text files made up of 324,497 lines together. [MILA host] [Github Mirror]
- As
- Tokenized and morphologically tagged texts - Tagged versions exist only for the files in the
16
folder. The texts are encoded using MILA's XML schema for corpora. These can be downloaded in two ways:knesset_tagged_16.tar.gz
- An archive of all tokenized and tagged files. [MILA host] [Archive.org mirror]- By cloning this repository, as the unarchived version of these files can be found in this repository, under the
knesset_tagged
folder.
These can be processed like any simple text file.
Reading the doc
-formatted raw text files requires using a software supporting the windows-1255
encoding.
For example, you can process these files using Python with the following code:
fpath = "/data/knesset/16/00133504.txt"
with open(fpath, 'rt', encoding='windows-1255') as f:
line = f.readline()
# or with "for line in f", etc...
This repository is a mirror of this dataset found on MILA's website.
Zenodo mirror: https://zenodo.org/record/2707356
All Knesset meeting protocols are in the public domain (רשות הציבור) by law. These files are thus in the public doamin and do not require any license or public domain dedication to set their status.