Skip to content

Latest commit

 

History

History
46 lines (35 loc) · 1.3 KB

README.md

File metadata and controls

46 lines (35 loc) · 1.3 KB

Data Preprocessing for AMR-to-Text Generation

Assuming that you're working on AMR 2.0 (LDC2017T10), unzip the corpus to data/AMR/LDC2017T10, and make sure it has the following structure:

data/AMR/LDC2017T10
├── data
│   ├── alignments
│   ├── amrs
│   └── frames
├── docs
│   ├── AMR-alignment-format.txt
│   ├── amr-guidelines-v1.2.pdf
│   ├── file.tbl
│   ├── frameset.dtd
│   ├── PropBank-unification-notes.txt
│   └── README.txt
└── index.html
  1. Download Artifacts:
./scripts/download_artifacts.sh
  1. Prepare training/dev/test data:
./scripts/prepare_data.sh -v 2 -p data/AMR/LDC2017T10
  1. We use Stanford CoreNLP (version 3.9.2) for tokenizing. First, start a CoreNLP server by sh run_standford_corenlp_server.sh Then, annotate AMR sentences:
sh run_standford_corenlp_server.sh
./scripts/annotate_features.sh data/AMR/amr_2.0
  1. Data Preprocessing
./scripts/preprocess_2.0.sh

(Acknowledgements) A large body of the code for AMR preprocessing is from sheng-z/stog.