Code and data for the paper Unobserved Local Structures Make Compositional Generalization Hard.
COVR is a synthetic semantic parsing dataset used to evaluate sequence to sequence models for compositional generalization. COVR-10 contains 10 compositional splits, in which each test set contains a particular kind of unseen programs.
# | Acc.1 (FT) Bart/T5 |
Acc.2 (ICL) GPT-3 |
2-ULSs (unobserved local structures3) |
Example |
---|---|---|---|---|
8 | 0.34 | 0.51 | eq+triangle eq+brown eq+gray eq+round eq+query_attr[color] eq+black eq+white eq+query_attr[shape] eq+square |
Both the color of cat that is chasing black triangle mouse that is playing with ... and (🟠eq (🔵query_attr [color] (with_relation (find (cat), chasing, with_relation (... |
25 | 0.59 | 0.23 | and+some none+filter filter+scene some+filter most+filter exists+filter all+filter |
None of square square cat are playing with dog that is looking at white animal... 🟠none (🔵filter (square, filter (square, find (cat))), with_relation (scene (), pla... |
34 | 0.35 | 0.38 | all+with_relation with_relation+scene exists+with_relation none+with_relation most+with_relation some+with_relation |
Either the number of white animal that is looking at square brown animal that is... or (eq (count (🔵with_relation (filter (white, find (animal)), looking at, ...), 4... |
43 | 0.2 | 0.11 | and+some and+most or+all and+all or+none and+none or+most or+some |
Both the color of cat is equal to brown and some of cat are brown ... 🟠and (eq (query_attr [color] (find (cat)), brown), 🔵some (find (cat), filter (brow... |
48 | 0 | 0.85 | <s>+query_attr[shape] <s>+query_attr[color] |
What is the shape of square cat that is looking at black brown animal that is lo... 🟤query_attr [shape] (with_relation (filter (square, find (cat)), looking at, with... |
51 | 0.64 | 0.35 | Either the color of mouse that is playing with mouse that is chasing triangle br... or (eq (query_attr [color] (with_relation (find (mouse), playing with, with_rela... |
|
99 | 0 | 0.89 | <s>+count |
What is the number of gray animal that is chasing gray mouse that is playing wit... 🟤count (with_relation (filter (gray, find (animal)), chasing, with_relation (filt... |
100 | 0.02 | 0.18 | and+exists exists+find or+exists |
Both the shape of cat is equal to white and there is triangle black cat ... 🟠and (eq (query_attr [shape] (find (cat)), white), 🔵exists (filter (triangle, filt... |
110 | 0.18 | 0.33 | with_relation+filter |
Either the number of animal is equal to the number of round dog that is chasing ... or (eq (count (find (animal)), count (🟠with_relation (🔵filter (round, find (dog)),... |
115 | 0.28 | 0.05 | all+with_relation with_relation+scene none+with_relation most+with_relation some+with_relation |
Either all of cat that is chasing triangle triangle cat that is playing with mou... or (🟠all (🔵with_relation (find (cat), chasing, with_relation (filter (triangle, fi... |
More |
🟠 and 🔵 represent an unseen pair of symbols in a given example. 🟤 represents a symbol that was unseen as a first token in the output sequence.
Splits are created using the Synchronous context-free grammar (SCFG) rules that have generated this dataset, by holding out sets of rules that are not seen together during training.
- For details on this splitting method, see our paper (Appendix B.2).
- You can see the set of unseen grammar rules for each split, along with training and test examples, by clicking on Details for any desired split.
- See the list of all grammar splits, which includes splits that were not selected for COVR-10. This list only includes grammar splits and not n-LS splits.
- Download COVR-10
1Average exact match accuracy for BART-Base, BART-Large, T5-Base and T5-Large, fine-tuned (FT) separately on each split (see implementation details in the paper).
2Exact match accuracy of GPT-3, engine text-davinci-002
, using OpenAI API. For each split we evaluated on a subset of 100 test examples. We use in-context learning (ICL): for each test instance, we randomly sample 10 examples from the training set and add their source and target to the prompt. Click on the GPT-3 accuracy to see samples of prompts and outputs.
3Unobserved local structures of size 2 (2-LS), considering only parent-child relations.
Dataset | Split Method | # Splits | Download Dataset and splits | Comments |
---|---|---|---|---|
COVR-10 | Grammar | 10 | covr10.zip | |
COVR | Grammar/ n-LS |
124/ 22 |
covr.zip | |
Overnight | Template | 5 (per domain) | overnight.zip | |
Schema2QA | Template | 5 | s2q.zip | Both utterances and targets are normalized for better evaluation, and are anonymized to resolve column ambiguity |
Atis | Template | 5 | atis.zip | Normalized variables for better evaluation |