Releases: amir-zeldes/gum
Releases · amir-zeldes/gum
V10.1.0 - corrections and minor updates
This is a corrected version of GUM series 10 (no additional documents since V10.0.0)
- Added ExtPos to multiword fixed expression
- Revised Cxn annotations to follow latest UCxn standard for construction annotation
- Content-identical with UD v2.14
V10.0.0 - added court, essay, letter and podcast genres
This is the first release of GUM series 10, with 16 genres in total.
- Four new growing genres:
court
- courtroom transcriptsessay
- argumentative essaysletter
- personal and professional correspondence on paper (not e-mails)podcast
- podcast on various topics
- Many corrections to all annotation layers
Note on document names compared to V9:
- With the addition of the
court
genre, one conversation from GUM V9 which is actually from courtroom proceedings has been moved to the new court genre (GUM_conversation_court
->GUM_court_carpet
) - To compensate for the removed conversation, an additional conversation has been added in V10:
GUM_conversation_toys
V9.2.0 - RST++, MSeg and CxG
This is the final release of the GUM 9.X series, which is the basis for the contents of the equivalent Universal Dependencies release v2.13. New in this version:
- Enhanced Rhetorical Structure Theory annotations using RST++:
- Additional, tree breaking secondary discourse relations
- Annotation of connectives and many other signaling devices for discourse relations
- Morphological segmentation based on Unimorph in the MSeg annotation (e.g. un-break-able)
- Construction Grammar annotation of constructions in the Cxn annotation
- A second human written summary for each document in the test set
- Numerous corrections and consistency improvements bringing this corpus and the English Web Treebank (EWT) closer
V9.1.0 - Numerous corrections
- Numerous corrections to all layers
- Consistency improved with other LDC and UD English corpora
- Added xpos tag GW for goeswith handling as in EWT
- MWT fixed for "let's"
- Label consistency with EWT for assigning iobj without obj
- Many RST corrections for the DISRPT shared task
- Data in this version is even with the UD v2.12 release
V9.0.0 - new data, summaries and entity salience
- 20 documents added including more conversational data (total tokens: 203,879)
- Abstractive summaries for each document in metadata
- Annotations for most salient entities in each document
- Foreign language tags identify individual source languages
- New process for reconstructing Reddit text data in top-level folders (see README.md)
- Many corrections to all annotation layers
V8.1.0 - final version of GUM series 8
- Added centering theory annotations (ranked cf, cb, sentence transition types)
- Numerous corrections
- Final version of GUM V8.X ahead of V9 release
V8.0.0 - new data and new RST relations
- 25 documents added including more conversational data (total tokens: 180,849):
- New RST discourse relations, now covering 32 labels in a two level hierarchy, as discourse constituent and dependency trees
- More consistent UD syntax, including a new
obl:agent
relation for passive agents - New Wikidata identifiers for wikification layer (including nested and pronominal mentions; see #97)
- Many corrections to all annotation layers
V7.3.0 - HYPH tokens, RST depth, 6-way infstat, pred/disc coref, MIN spans and XML in deps
Stable version 7.3.0, corresponds to UD version 2.9. Same 168 documents as in 7.2.0 but substantial changes to some annotations and tokenization, leading to more total tokens (152,308).
Changes:
- tokenization now follows EWT and recent LDC corpora in separating hyphenated compounds (e.g. "data-driven" is three tokens)
- new xpos/extended PTB tag for such tokens:
HYPH
- added RST depth to discourse relations in .conllu and .rsd files, allowing deterministic conversion of discourse dependencies to fully hierarchical RST constituent trees
- added
# newpar
comments to conllu files expressing potentially nested block elements, such as paragraphs, headings or bulleted lists - added a MISC annotation
XML
to .conllu files expressing all other XML markup in the corpus - shortened entity bracket format in .conllu files to consolidate with Coref UD data / Universal Anaphora initiative
- removed accessible-generic information status annotations for countries and absolute date expressions
- add information status categories closer to SFB632 guidelines, including in conllu files. Now a six-way distinction: giv:act, giv:inact, acc:inf, acc:com, acc:aggr and new
- added
pred
anddisc
coref edge types for indefinite predication and discourse deixis respectively - added MIN spans and coreference type to entity annotations in .conllu files
- many corrections and additional validations
V7.2.0 - OntoGUM coreference version and corrections
- Added separate OntoGUM version of coreference annotations following the OntoNotes scheme, in addition to the more comprehensive GUM coreference annotations
- Numerous corrections
V7.1.0 - enhanced dependencies, consistency overhaul and more
(Note: this version contains the content-identical superset of annotations producing UD_English-GUM in Universal Dependencies V2.8)
- Massive round of consistency corrections and harmonization with English Web Treebank, PTB and OntoNotes
- Added enhanced dependencies
- More error validations
- Added multiword tokens to CoNLL-U format (caution: token IDs like
1-2
now in use!) - Added reconstructed ellipsis tokens to CoNLL-U format (caution: token IDs like
8.1
now in use!) - Added metadata to CoNLL-U files
- Better escape characters in Wikification
- ANNIS conversion support for null nodes to accommodate ellipsis tokens