Skip to content

A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005.

Notifications You must be signed in to change notification settings

NLPH/knesset-2004-2005

Repository files navigation

The Knesset Meetings Corpus 2004-2005

DOI LICENCE PUBDOM

A corpus of session protocols of the Knesset (Israeli parliament) between January 2004 and November 2005.

Contents

The Knesset Meetings Corpus 2004-2005 is made up of two components:

  • Raw texts - 282 files made up of 867,725 lines together. These can be downloaded in two formats:
    • As doc files, encoded using windows-1255 encoding:
    • As txt files, encoded using utf8 encoding:
      • kneset.tar.gz - An archive of all the raw text files, divided into two folders: [Github mirror]
        • 16 - Contains 164 text files made up of 543,228 lines together.
        • 17 - Contains 118 text files made up of 324,497 lines together.
      • knesset_txt_16.tar.gz- Contains 164 text files made up of 543,228 lines together. [MILA host] [Github Mirror]
      • knesset_txt_17.zip - Contains 118 text files made up of 324,497 lines together. [MILA host] [Github Mirror]
  • Tokenized and morphologically tagged texts - Tagged versions exist only for the files in the 16 folder. The texts are encoded using MILA's XML schema for corpora. These can be downloaded in two ways:
    • knesset_tagged_16.tar.gz - An archive of all tokenized and tagged files. [MILA host] [Archive.org mirror]
    • By cloning this repository, as the unarchived version of these files can be found in this repository, under the knesset_tagged folder.

Use

txt format

These can be processed like any simple text file.

doc format

Reading the doc-formatted raw text files requires using a software supporting the windows-1255 encoding.

For example, you can process these files using Python with the following code:

fpath = "/data/knesset/16/00133504.txt"
with open(fpath, 'rt', encoding='windows-1255') as f:
  line = f.readline()
  # or with "for line in f", etc...

Mirrors

This repository is a mirror of this dataset found on MILA's website.

Zenodo mirror: https://zenodo.org/record/2707356

License

All Knesset meeting protocols are in the public domain (רשות הציבור) by law. These files are thus in the public doamin and do not require any license or public domain dedication to set their status.

About

A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published