This repository contains the data, models and system output for replicating the results from the following NAACL paper.
Sebastian Schuster, Joakim Nivre, and Christopher D. Manning. 2018. Sentences with Gapping: Parsing and Reconstructing Elided Predicates. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018).
The repository contains the manually annotated gapping constructions (corresponding to section 4.1 and 4.4), the parser outputs (corresponding to sections 4.2 and 4.4), and the post-processed dependency graphs (corresponding to section 4.3).
Note that this directory only contains the additional data (what we refer to as the GAPPING dataset in the paper) from the WSJ, the Brown Corpus, the GENIA treebank, and the linguistic literature. The latest version of the EWT can be downloaded from
https://github.com/UniversalDependencies/UD_English
The directory structure is as follows.
-
data
: manually annotated dataenhanced
: gold standard enhanced graphsorphan-basic
: gold standard dependency trees annotated according to UD v2 guidelinescomposite-basic
: gold standard dependency trees annotated with composite relationsconstituency
: Constituency trees annotated according to PTB guidelines
-
parser-output
: outputs of dependency and constituency parserorphan-basic
: predicted dependency trees annotated with UD v2 relationscomposite-basic
: predicted dependency trees annotated with composite-relationskummerfeld-klein
: predicted constituency trees as output by the parser of Kummerfeld and Klein (2017).
-
enhanced-graphs
: post-processed dependency graphsoracle
: dependency graphs obtained from gold standard dependency treesorphan
: graphs obtained by applying the orphan procedure (section 3.2)composite
: graphs obtained by applying the composite procedure (section 3.1)
end-to-end
: dependency graphs obtained from predicted dependency treesorphan
: graphs obtained by applying the orphan procedure (section 3.2)composite
: graphs obtained by applying the composite procedure (section 3.1)
The models
directory contains the models for the Dozat and Manning (2016) parser and the Dozat et al. (2017) tagger. We used the following version of the parser for training and parsing:
https://github.com/tdozat/Parser-v2/commit/d5159a1c03c4ce92eb3bdb77ac1723d0bdbc052c
The parser and tagger requires the pre-trained embeddings from the CoNLL 2017 Shared task, which can be downloaded at
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989
The enhancement code is built on top of Stanford CoreNLP. It uses some unreleased extensions that allow one to work with UD v2 code. If you want to use the enhancer, download the following two jar files:
- Special version of Stanford CoreNLP (also contains the source files)
- EJML 0.23
Use the following command to run the enhancer:
java -cp javanlp-core-src.jar:ejml-0.23.jar edu.stanford.nlp.trees.ud.EnglishUDGappingEnhancer INPUT_FILE.conllu GOLD_STANDARD.conllu -embeddings embeddings.txt > OUTPUT_FILE.conllu
This will add the enhancements to a treebank annotated with orphan
relations and output the graphs to OUTPUT_FILE.conllu
. It will also compute the labeled and unlabeled precision and recall metrics by comparing the ouput to the graphs in GOLD_STANDARD.conllu
.
Please contact Sebastian Schuster (sebschu@stanford.edu) if you have questions about the data or running the code.