ProQE

This repository contains binary patch data for reconstructing dataset for quality estimation of GEC.

Requirements

bsdiff/bspatch

Data

Running the script with the following command will automatically download the W&I+LOCNESS dataset and apply the patch.

$ git clone https://github.com/tmu-nlp/ProQE.git
$ cd ProQE
$ bash ./scripts/setup.sh

scripts/setup.sh generates data for each proficiency level in data/tsv directory.

$ ls ./data/tsv
a.tsv b.tsv c.tsv n.tsv

The first line of each generated data is a header.

$ head -n 1 ./data/tsv/a.tsv
source	output	scores	ave_score

The header contains the following columns:

source: source sentence.
output: GEC system output sentence.
scores: annotation scores by 3 annotators.
ave_score: average of scores.

Note: The source and output sentences are detokenized to prevent annotators from evaluating the tokenization as an error. (we used nltk for detokenization.)

Citation

If you use our dataset, please cite our LREC paper:

Yujin Takahashi, Masahiro Kaneko, Masato Mita and Mamoru Komachi. Proficiency Matters Quality Estimation in Grammatical Error Correction. 13th Edition of Language Resources and Evaluation Conference (LREC 2022). May, 2022.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ProQE

Requirements

Data

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

ProQE

Requirements

Data

Citation