Code for: Calibrated Interpretation: Confidence Estimation in Semantic Parsing
Author: Elias Stengel-Eskin
Personal Email:
This repo is a fork of this repo, which is itself a fork of a fork of MISO which is a semantic parsing codebase that was released with Joint Universal Syntactic and Semantic Parsing.
MISO was built over the course of the following publications:
- AMR Parsing as Sequence-to-Graph Transduction, Zhang et al., ACL 2019
- Broad-Coverage Semantic Parsing as Transduction, Zhang et al., EMNLP 2019
- Universal Decompositional Semantic Parsing, Stengel-Eskin et al. ACL 2020
- Joint Universal Syntactic and Semantic Parsing, Stengel-Eskin et al., TACL 2021
- When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems, Stengel-Eskin et al., EMNLP 2022
It is a flexible sequence-to-graph parsing framework built on top of allennlp.
Most models in Calibrated Interpretation: Confidence Estimation in Semantic Parsing were run via BenchClamp. This repo contains analysis scripts and MISO, which was one model considered. All other scripts and model code is in this fork of BenchClamp:
The directory data_subsets
contains the easy and hard splits of TreeDST and SMCalFlow described in the paper.
All dependencies can be installed with ./
The first step to replicating experiments is to download the data and glove embeddings.
From the project home directory:
mkdir -p data
cd data
# This may take some time
tar -xzvf data_clean.tar.gz
mv data_clean/* .
rm -r data_clean
Important directories:
: contains all the parsing code for the different MISO modelsscripts
: contains helper scripts for analysis and creating config files/data splitsexperiments
: contains bash files for running MISO parser (see for more details)
The main change between different .jsonnet
files is the data path at the top. This points the model to the correct data split to use, e.g. data/smcalflow_samples_curated/FindManager/5000_100/
points the model to the 5000 train sample subset with 100 FindManager examples.
The assumption is that each experiment has a jsonnet file.
For example, the experiment which trains a transformer model with the seed=12
for the 5000-100 FindManager corresponds to the .jsonnet
file miso/training_configs/calflow_transformer/FindManager/12_seed/5000_100.jsonnet
In the released configs, the data dir argument is an environment variable
: Data is assumed to be pre-processed according to Task Oriented Parsing as Dataflow Synethesis instructions. This is a modified version of the instructions in the README there to include agent utterances and previous user turns.experiments/
: main training/testing commands for calflowexperiments/
: main training/testing commands for TreeDST
Models can be trained locally using experiments/
expects the following environment variables to be set: CHECKPOINT_DIR
is the location where you downloaded the data.
The former points to a directory where the model will store checkpoints. The latter is a .jsonnet
config that will be read by AllenNLP.
Optionally, the FXN
variable can also be set, for function-specific evaluation.
Model checkpoints and logs will be written to CHECKPOINT_DIR/ckpt
. Decoded outputs will be written to CHECKPOINT_DIR/translate_output/<split}>.tgt
For additional details, see miso_docs/
The following environment variables need to set:
: the directory containing a subdirectoryckpt
, which contains an archivemodel.tar.gz
. If training is interrupted or canceled, the archive may be missing. It can be created manually by the following commands:
tar -czvf model.tar.gz config.json vocabulary
is the path to the test data without the extension. An example would beTEST_DATA=data/
is the function of interest. Example:FXN=FindManager
The model can then be tested using ./experiments/ -a eval_fxn
The output at the end will have the following rows:
Exact Match: The overall exact match accuracy of produced and reference programs.
FXN Coarse: The percentage of programs for which, if FXN is in the reference, it is also in the predicted program. It doesn't matter if the programs match or not.
FindManager Fine: The percentage of programs with FXN in the reference where the predicted program is an exact match.
FindManager Precision: The percentage of predicted programs that have FXN in them and also have FXN in the reference program.
FindManager Recall: Same as Coarse
FindManager F1: Harmonic mean of precision and recall
To get the predicted token logits under a forced decode, see the log_losses
function in experiments/
To get token-level predicted probabilities without a forced decode, use eval_calibrate