Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added sentence reordering transformation #48

Merged
merged 17 commits into from
Jul 5, 2021

Conversation

zijwang
Copy link
Contributor

@zijwang zijwang commented Jun 20, 2021

No description provided.

@kaustubhdhole
Copy link
Collaborator

Hi @zijwang , thank you very much for your changes and your PR!

Sentence shuffling seems to be a great addition to NL-Augmenter but I believe it can be made stronger - Currently it can easily change meaning when transformed (especially in narration, news articles, stories, etc). So, I would like to suggest one change - i.e. at least substituting mentions of pronouns with their actual mentions so that cases like the ones below could be handled:

John is a great person. He resides in Australia. Peter is a wonderful person. He resides in India.

There is a wonderful recent work on sentence decontextualisation which is very relevant in this case which attempts to convert sentences into their decontextualised counterparts - so that they can make sense without the context. https://arxiv.org/pdf/2102.05169.pdf

I would recommend

  1. either using a coref model (probably neural-coref https://github.com/huggingface/neuralcoref to keep consistency with spacy, others should also be fine)
  2. or using this sort of a decontextualising model

The following is my overall review:

Correctness: All checks have passed - please pull "main" once into your branch since we added a facility to allow multiple generations. There's a minor change in the syntax of the outputs.
Interface: I think you've chosen the correct interface.
Specificity: The change is not specific but it is generic and useful.
Novelty: Sentence shuffling or any similar approach is not implemented in NL-Augmenter so far.
Adding New Libraries: The new libraries need to be added in the same transformations folder's requirements.txt.
Test Cases: 5 test cases have been added.
Evaluating Robustness: Robustness evaluation has been conducted.

Thanks again for your changes!

@zijwang
Copy link
Contributor Author

zijwang commented Jun 27, 2021

Thanks for the review and suggestions, @kaustubhdhole !

Adding coreference resolution makes sense. I added NeuralCoref to the code to and updated the README and the requirements file. Please let me know how you think now.

Note that the NeuralCoref package seems to have compatibility issues with different versions of SpaCy (see a few issues in NeuralCoref's repo). I added the source code link into the requirements file following this github issue, which hopefully could resolve the issue.

@zijwang
Copy link
Contributor Author

zijwang commented Jun 29, 2021

Hi @kaustubhdhole , I tried to add NeuralCoref into the dependency list, however, it looks like the pipeline failed for multiple times. Could you take a look and let me know what the correct config is? Thanks!

@kaustubhdhole
Copy link
Collaborator

kaustubhdhole commented Jun 29, 2021

The pipeline seems to be failing because neuralcoref only seems to work on spacy=2.1.0 from what I read. So, we will have to skip neuralcoref till they start supporting later versions. Would you like to try the decontextualizing model rather or any HF coref model?

https://colab.research.google.com/github/google-research/language/blob/master/language/decontext/decontextualization_demo.ipynb#scrollTo=KUGNUKuZ5Vky

They seem to be using tensorflow-text.

@kaustubhdhole
Copy link
Collaborator

Adding @vgtomahawk here since he had used neuralcoref earlier.

@zijwang
Copy link
Contributor Author

zijwang commented Jun 29, 2021

@kaustubhdhole Actually I have got both NeuralCoref + Spacy 2.2.x and NeuralCoref + Spacy 2.3.x working locally. Could you see whether there is any special setting of the pipeline?

Also I have seen the error Fatal Python error: Segmentation fault from the pipeline locally with the same setup. This could be related to memory/disk limits. Could you also take a look of that?

@zijwang
Copy link
Contributor Author

zijwang commented Jul 3, 2021

Hey @kaustubhdhole and @vgtomahawk ! Just wanna ping you again to see whether you have any idea on the reason of the error from the pipeline. Let me know there is anything I could help. Thanks!

@kaustubhdhole
Copy link
Collaborator

Hi @zijwang

The error is due to spacy and neural coref versions clashing. NeuralCoref seems to only work with spacy 2.1.0 and if we downgrade spacy, then checklist would not work. I would suggest you to use a different coref resolver than neuralcoref.

@zijwang
Copy link
Contributor Author

zijwang commented Jul 3, 2021

Hi @kaustubhdhole ! Sure I will look into alternatives and let u know once I updated the code.

Meanwhile, as mentioned above, I did get nerualcoref working with spacy 2.2.x and 2.3.x locally on a ubuntu machine. It does require compiling from source though. Let me know if you know a way that I could do that using easy_install and I could also try that out.

@zijwang
Copy link
Contributor Author

zijwang commented Jul 4, 2021

Hi @kaustubhdhole ! I updated the coref model to this one using AllenNLP. The pipeline seems working now. Could you take a look?

@kaustubhdhole
Copy link
Collaborator

kaustubhdhole commented Jul 4, 2021

Thank you @zijwang for your continuous efforts! I think this PR is qualitatively in great shape! There are 3 minor changes to improve speed:

  1. In order to avoid conflicts with others' requirements.txt, you can add the changes for the requirements.txt in a new file inside the folder "sentence_reordering" like this.

  2. Also, the loading of the coref model predictor = Predictor.from_path.. and other heavy loading can be moved into the constructor so that the transformation does not take up memory during import.

  3. The NLTK sentence splitter can be replaced with the spacy one which is already present in NL-Augmenter.

This snippet has been taken from here.

import spacy
nlp = spacy.load('en_core_web_sm')

text = 'My first birthday was great. My 2. was even better.'
sentences = [i for i in nlp(text).sents]

Also, you might want to update the README file too which mentions "neuralcoref".

Thanks again for your efforts!

@zijwang
Copy link
Contributor Author

zijwang commented Jul 4, 2021

Thanks for the quick reply @kaustubhdhole ! I have updated the code and README.

@vgtomahawk vgtomahawk self-requested a review July 4, 2021 23:15
Copy link
Collaborator

@vgtomahawk vgtomahawk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zijwang , thank once again for going above and beyond to incorporate changes for this PR! I didn't notice @kaustubhdhole had tagged me here earlier, but I had faced similar issues with the neuralcoref lob and never managed to get it to work above that specific version of spacy - its commendable you even managed to pull that off locally.

I think its nice we finally did not take the decontextualization route (though it is a very pertinent thing to discuss in this context) since it may have made the sentences enlarged and damaged coherence/naturalness. Though it would indeed be super-interesting as future work to think how we can use decontextualization to do some doc-level stuff [not necessarily shuffling].

I don't think I've much else to add really..

My overall review:

Correctness: I think things seem good on this front
Interface: I think the interface chosen here is apt and pertinent
Specificity: The change is applicable to any task as long as the inputs are multi-sentence i.e non-trivial documents..

Novelty: It is novel w.r.t NL-Augmenter. Even outside NL-Augmenter, I've rarely seen implementations that take care to fix coreference after doing the shuffling.

Btw just FYI, I think this PR has already been heavy and I am not asking you or anyone to address this, but there are still certain phenomenon other than coref that can go wrong after one does shuffling [as you might know]. One prominent one is ellipsis -

S: 1. All of the Ross family has red hair, except Henry. 2. Henry has blonde hair that is very curly. 3. Henry’s father often teases Henry’s mother about the mailman. 4. The mailman has blonde, curly hair, but he is very ugly.5. His dad’s teasing makes Henry feel bad.

S’: 1. All of the Ross family has red hair, except Henry. 2. His dad’s teasing about the mailman makes Henry feel very bad.3. This is because the mailman has blonde, curly hair, but he is very ugly. 4. Henry also has blonde hair that is very curly. 5. Henry’s father often teases Henry’s mother about the mailman

(Note that this example is not my own but from some arxiv preprint i know)

You can see "about the mailman" which was ellipsis-ed out earlier had to be added in here. However, unfortunately even basic ellipsis detection tools are poorly developed today IMO.

Adding New Libraries: This has already been discussed in the previous review, nothing new to add here

Test Cases: 6 test cases are present, sufficient per reqs.

Evaluating Robustness: Appropriate robustness evaluation has been carried out.

Thanks again, and sorry for not chiming into the discussion earlier!!

@zijwang
Copy link
Contributor Author

zijwang commented Jul 5, 2021

Thanks a lot @vgtomahawk for the review! This ellipsis problem is indeed hard to resolve. I added a note on this point into the README.

It seems this task already got 2 approvals. Is there anything I should do to merge it into master?

@vgtomahawk vgtomahawk merged commit fec76ce into GEM-benchmark:main Jul 5, 2021
@vgtomahawk
Copy link
Collaborator

vgtomahawk commented Jul 5, 2021

Thanks a lot @vgtomahawk for the review! This ellipsis problem is indeed hard to resolve. I added a note on this point into the README.

I think building a complete sentence shuffler which addresses all such doc level phenomena would be an amazing open prob to work on - I also know just these two basically [ellipsis/coref] but there are many more.. Thanks for updating limitations!!

It seems this task already got 2 approvals. Is there anything I should do to merge it into master?

Done! Merged!

@zijwang
Copy link
Contributor Author

zijwang commented Jul 10, 2021

FYI @vgtomahawk there is a new ACL paper on a very similar topic: https://arxiv.org/pdf/2107.03448.pdf

@kaustubhdhole
Copy link
Collaborator

Thank you very much for your changes. Your changes are merged. As NLP has largely focussed on high-resource languages and domains, if possible, we also encourage you to add a transformation in your mother tongue and/or a low-resource language which might need more attention from the NLP community. This is not mandatory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants