Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by Institute for Ukrainian, NGO. [українською]
UD Ukrainian comprises 122K tokens in 7000 sentences of fiction, news, opinion articles, Wikipedia, legal documents, letters, posts, and comments — from the last 15 years, as well as from the first half of the 20th century.
Consider using the latest version at ‘dev’ branch on GitHub. It contains the latest stable improvements while the official releases are up to 6 month old [discussion].
Major contributors: Natalia Kotsyba, Bohdan Moskalevskyi, Mykhailo Romanenko.
Large portion of annotation was made by Halyna Samoridna, Ivanka Kosovska, Olha Lytvyn, Oksana Orlenko and by students of Kyiv-Mohyla Academy department of Ukrainian language (headed by Liudmyla Dyka): Hanna Brovko, Bohdana Matushko, Natalia Onyshchuk, Valeriia Pareviazko, Yaroslava Rychyk, Anastasiia Stetsenko, Snizhana Umanets.
We thank Prof. Larysa Masenko for guidance.
Project homepage (in Ukrainian)
You can also browse the entire treebank in Brat.
set | sentences | ~tokens |
---|---|---|
train | 5496 | 92K |
dev | 672 | 13K |
test | 892 | 17K |
TOTAL | 7060 | 122K |
See stats.xml for detail.
Morphology is annotated using 2+1 schema. The syntax is single-pass plus supervisor’s check. Consistency is further enforced by ~300 validation and autofix rules (see warnings page) and by investigating errors made by a trained parser.
Data is split between train/dev/test linearly by hand at 75%/10%/15% to balance in genre and complexity. Some large documents are divided across datasets.
UD Ukrainian data conforms to CoNLL-U format with the following specifics:
- Sentence-level comments:
- Document boundaries as
# newdoc id = ...
. - Sentence-level paragraph boundaries as
# newpar id = ...
. - Document titles as
# doc_title = ...
. - Document authors as
# author = ...
. - Document sources as
# source = ...
. - Czech-like translit is present as
# translit = ...
. - Gaps in the text are marked on the sentences following the gap as:
# annotation_gap
for sentences not exported to CoNLL-U because annotator was unable to parse it with confidence (e.g. new guidelines need to be created);# gap
for intentional gaps in texts (selected fragments).
- Document boundaries as
- XPOSTAG column contains MTE tag with
U
for punctuation. UPOS+FEATS contain all the information in XPOSTAG and more. XPOSTAG is intended for legacy applications. - DEPS column contains Enhanced Dependencies.
- MISC column:
- Token-level paragraph boundaries as
NewPar=Yes
. - Token ids as
Id=xxxx
. SpaceAfter=No
markers are present.- Form (
Translit
) and lemma (LTranslit
) transliterations are present - The pipe (
|
) character is escaped with\p
. Backslash is\\
. See issue #569.
- Token-level paragraph boundaries as
- Document, paragraph, sentence, and token ids are 4-character base-32 numbers. They survive treebank updates.
- Empty (null) nodes for elided predicates. Elided predicates are manually reconstructed with word forms and full morphological info. Coverage: only ~200 instances done.
- Propagation of incoming dependencies to conjuncts. Propagated automatically. For heterogeneous conjuncts, a relation guesser is employed. Coverage: full.
- Propagation of outgoing dependencies from conjuncts. Dependents of first conjuncts are propagated only if they are manually marked as shared. Coverage: ~75% of the sentences.
- Additional subject relations for control and raising constructions. All
xcomp
subjects are annotated manually asnsubj:xsubj
/csubj:x
. Subjects ofxcomp:pred
(secondary predication) arensubj:pred
/csubj:pred
. The latter are also used for the subjects ofadvcl:pred
(see #476). Coverage: full. - Coreference in relative clause constructions. All relative clauses are manually annotated with enhanced dependencies. This includes all types mentioned in the universal docs plus Ukrainian clauses that use personal pronouns as relativizers: вузол, що його не переріжеш “the-knot, that it.Acc not you-can-cut”. Coverage: full.
- Case information. We don’t case-mark relation names because this doesn’t bring any new information [discussion].
Data files are built from sources at mova-institute/zoloto, where the actual development happens.
The data is licensed under CC BY-NC-SA 4.0 and is free for non-commercial use. For a commercial license, please contact us at org@mova.institute.
-
2022-11-15 v2.9 (upcoming)
- Reanalyze large numerals like thousand, million, and above. See the discussion.
- Brought back
Hyph
andBull
PunctType
s. - Renamed
:sp
relation subtypes to:pred
. - Fixed errors.
- Added sentenses.
-
2021-05-15 v2.8
- Undocumented
PunctType
Ndash
,Hyph
,Bull
converted toDash
.
- Undocumented
-
2019-05-15 v2.4
- Closed many annotaion gaps: 116K→122K.
- Fixed annotation errors.
- Shared more dependents of a first conjunct.
- Improved consistency by extending annotation guidelines to rarer phenomena.
- Switched from
ccomp
toxcomp
wherensubj:xsubj
is a phantom object. - Made clauses with
ADV
relativizers:relcl
. - Added
Polarity=Neg
for conjunctions. - Escaped the pipe (
|
) character inMISC
as\p
.\\
is now a backslash.
-
2018-11-15 v2.3
- Added all types of enhanced dependencies except for case-marking, see Enhanced Dependencies section.
- Closed many annotation gaps and added new texts: 100→115K.
- Fixed ~450 annotation errors including його/її/їх
PRON
vsDET
ambiguity. - Improved consistency by extending annotation guidelines to many rarer phenomena.
- Introduced multitokens for ні́кого, ні́де etc.
- Split words with fused пів- numerals (e.g. півкласу) to multitokens.
- Introduced
flat:abs
,flat:sibl
,flat:range
,advmod:det
,acl:adv
,parataxis:rel
,vocative:cl
. - Specified
acl:relcl
. - Removed
:pass
subtype from relations as it currently can be inferred from the morphology. - Added transliteration.
- Fixed missing
# annotation_gap
s. - Updated readme with more description, links.
-
2018-04-15 v2.2
- Renamed the repository from UD_Ukrainian to UD_Ukrainian-IU to match the new UD naming convention.
- Fixed some validation errors.
- Added a couple of new sentences.
Orth=Khark
feature renamed toOrth=Alt
.
-
2017-11-15 v2.1
- Quadrupled the amount of data up to 100K, mostly with nonfiction; improved consistency.
- Resplitted train/dev/test.
-
2017-02-15 v2.0
- Replaced v1.4 data with 25K tokens of misc genres, mostly fiction.
-
2016-11-01 v1.4
- An initial experimental release containing 1.6K tokens of grammar examples and fiction.
=== Machine-readable metadata =================================================
Data available since: UD v1.4
License: CC BY-NC-SA 4.0
Includes text: yes
Genre: blog email fiction grammar-examples legal news reviews social web wiki
Lemmas: manual native
UPOS: manual native
XPOS: manual native
Features: manual native
Relations: manual native
Contributors: Kotsyba, Natalia; Moskalevskyi, Bohdan; Romanenko, Mykhailo
Contributing: elsewhere
Contact: org@mova.institute
===============================================================================