Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added sentence reordering transformation #48

Merged
merged 17 commits into from
Jul 5, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
checklist==0.0.10
checklist==0.0.11
spacy==2.2.4

# for back_translation
Expand Down
48 changes: 48 additions & 0 deletions transformations/sentence_reordering/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Sentence reordering
This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) by randomly shuffling sentencesin the input text with coreference resolution to reduce ambiguity.

Author name: Zijian Wang (zijwang@hotmail.com)

## What type of a transformation is this?
This transformation could shuffle sentence order in the input text, which could test model robustness.

## What tasks does it intend to benefit?
This perturbation would benefit all tasks on text classification and generation.

Benchmark results:

- Sentiment analysis: we run sentiment analysis on a 1% sample of the IMDB dataset. The original accuracy is 956 and the perturbed accuracy is 95.2.
- Text summarization: we run text summarization on a 1% sample of the xsum dataset. The original BLEU is 15.99 and the perturbed BLEU is 9.75.

## Related work

This is very similar to the `Sentence Permutation` noising method in the BART paper.

```bibtex
@inproceedings{lewis2020bart,
title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension},
author={Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
pages={7871--7880},
year={2020}
}
```

The coreference resolution model is from the following paper

```bibtex
@inproceedings{lee2018higher,
title={Higher-Order Coreference Resolution with Coarse-to-Fine Inference},
author={Lee, Kenton and He, Luheng and Zettlemoyer, Luke},
booktitle={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)},
pages={687--692},
year={2018}
}
```

We use its [AllenNLP implementation](https://demo.allennlp.org/coreference-resolution).


## What are the limitations of this transformation?

This transformation will only change the input text that has more than one sentence.
1 change: 1 addition & 0 deletions transformations/sentence_reordering/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .transformation import *
3 changes: 3 additions & 0 deletions transformations/sentence_reordering/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# for sentence_reordering
allennlp==2.5.0
allennlp-models==2.5.0
71 changes: 71 additions & 0 deletions transformations/sentence_reordering/test.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
{
"type": "sentence_reordering",
"test_cases": [
{
"class": "SentenceReordering",
"inputs": {
"sentence": "The Novikov conjecture is one of the most important unsolved problems in topology. It is named for Sergei Novikov who originally posed the conjecture in 1965. The Novikov conjecture concerns the homotopy invariance of certain polynomials in the Pontryagin classes of a manifold, arising from the fundamental group. According to the Novikov conjecture, the higher signatures, which are certain numerical invariants of smooth manifolds, are homotopy invariants."
},
"outputs": [
{
"sentence": "The Novikov conjecture concerns the homotopy invariance of certain polynomials in the Pontryagin classes of a manifold, arising from the fundamental group. The Novikov conjecture is named for Sergei Novikov who originally posed The Novikov conjecture in 1965. According to The Novikov conjecture, the higher signatures, which are certain numerical invariants of smooth manifolds, are homotopy invariants. The Novikov conjecture is one of the most important unsolved problems in topology."
}
]
},
{
"class": "SentenceReordering",
"inputs": {
"sentence": "Albany Theatre is a historic theater in Albany, Georgia. It was added to the National Register of Historic Places on August 21, 2006. The Albany Theatre opened on September 12, 1927. The theatre is no longer in operation. It is located at 107 North Jackson Street."
},
"outputs": [
{
"sentence": "Albany Theatre is no longer in operation. Albany Theatre was added to the National Register of Historic Places on August 21, 2006. Albany Theatre opened on September 12, 1927. Albany Theatre is located at 107 North Jackson Street. Albany Theatre is a historic theater in Albany, Georgia."
}
]
},
{
"class": "SentenceReordering",
"inputs": {
"sentence": "Intertoys is a Dutch store-chain founded in 1976 that specialised in toys, multimedia and electronics. It is headquartered in Amsterdam."
},
"outputs": [
{
"sentence": "Intertoys is headquartered in Amsterdam. Intertoys is a Dutch store-chain founded in 1976 that specialised in toys, multimedia and electronics."
}
]
},
{
"class": "SentenceReordering",
"inputs": {
"sentence": "QuantumScape is an American company that does research about solid state lithium metal batteries for electric cars. The company is headquartered in San Jose, California and employs around 200 people. Investors include Bill Gates and Volkswagen."
},
"outputs": [
{
"sentence": "QuantumScape is headquartered in San Jose, California and employs around 200 people. QuantumScape is an American company that does research about solid state lithium metal batteries for electric cars. Investors include Bill Gates and Volkswagen."
}
]
},
{
"class": "SentenceReordering",
"inputs": {
"sentence": "Sousmoulins is a commune in the Charente-Maritime department in southwestern France. The Seugne forms part of the commune's northeastern border."
},
"outputs": [
{
"sentence": "The Seugne forms part of a commune in the Charente-Maritime department in southwestern France's northeastern border. Sousmoulins is a commune in the Charente-Maritime department in southwestern France."
}
]
},
{
"class": "SentenceReordering",
"inputs": {
"sentence": "John is a great person. He resides in Australia. Peter is also a great person. He resides in India."
},
"outputs": [
{
"sentence": "Peter is also a great person. John resides in Australia. Peter resides in India. John is a great person."
}
]
}
]
}
61 changes: 61 additions & 0 deletions transformations/sentence_reordering/transformation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import random
from interfaces.SentenceOperation import SentenceOperation
from tasks.TaskTypes import TaskType

# for sent tokenizer
import spacy

nlp = spacy.load("en_core_web_sm")
zijwang marked this conversation as resolved.
Show resolved Hide resolved

# coref resolution from allennlp
# ref: https://demo.allennlp.org/coreference-resolution
import allennlp_models.tagging
from allennlp.predictors.predictor import Predictor

predictor = Predictor.from_path(
"https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz"
zijwang marked this conversation as resolved.
Show resolved Hide resolved
)


"""
Base Class for implementing the different input transformations a generation should be robust against.
"""


def sentence_reordering(text, seed, coref_model):
random.seed(seed)
# resolve coref
text = coref_model.coref_resolved(document=text)

# tokenize and shuffle
text_split = [i.text for i in nlp(text).sents]
random.shuffle(text_split)
return " ".join(text_split)


"""
Shuffle sentence order
"""


class SentenceReordering(SentenceOperation):
tasks = [
TaskType.TEXT_CLASSIFICATION,
TaskType.TEXT_TO_TEXT_GENERATION,
]
languages = ["en"]

def __init__(self, seed=42, max_output=1):
super().__init__(seed)
self.max_output = max_output
self.coref_model = Predictor.from_path(
"https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz"
)

def generate(self, sentence: str):
pertubed = [
sentence_reordering(
text=sentence, seed=self.seed, coref_model=self.coref_model
)
]
return pertubed