-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added sentence reordering transformation #48
Conversation
Hi @zijwang , thank you very much for your changes and your PR! Sentence shuffling seems to be a great addition to NL-Augmenter but I believe it can be made stronger - Currently it can easily change meaning when transformed (especially in narration, news articles, stories, etc). So, I would like to suggest one change - i.e. at least substituting mentions of pronouns with their actual mentions so that cases like the ones below could be handled: John is a great person. He resides in Australia. Peter is a wonderful person. He resides in India. There is a wonderful recent work on sentence decontextualisation which is very relevant in this case which attempts to convert sentences into their decontextualised counterparts - so that they can make sense without the context. https://arxiv.org/pdf/2102.05169.pdf I would recommend
The following is my overall review: Correctness: All checks have passed - please pull "main" once into your branch since we added a facility to allow multiple generations. There's a minor change in the syntax of the outputs. Thanks again for your changes! |
Thanks for the review and suggestions, @kaustubhdhole ! Adding coreference resolution makes sense. I added NeuralCoref to the code to and updated the README and the requirements file. Please let me know how you think now. Note that the NeuralCoref package seems to have compatibility issues with different versions of SpaCy (see a few issues in NeuralCoref's repo). I added the source code link into the requirements file following this github issue, which hopefully could resolve the issue. |
Hi @kaustubhdhole , I tried to add |
The pipeline seems to be failing because neuralcoref only seems to work on spacy=2.1.0 from what I read. So, we will have to skip neuralcoref till they start supporting later versions. Would you like to try the decontextualizing model rather or any HF coref model? They seem to be using tensorflow-text. |
Adding @vgtomahawk here since he had used neuralcoref earlier. |
@kaustubhdhole Actually I have got both NeuralCoref + Spacy 2.2.x and NeuralCoref + Spacy 2.3.x working locally. Could you see whether there is any special setting of the pipeline? Also I have seen the error |
Hey @kaustubhdhole and @vgtomahawk ! Just wanna ping you again to see whether you have any idea on the reason of the error from the pipeline. Let me know there is anything I could help. Thanks! |
Hi @zijwang The error is due to spacy and neural coref versions clashing. NeuralCoref seems to only work with spacy 2.1.0 and if we downgrade spacy, then checklist would not work. I would suggest you to use a different coref resolver than neuralcoref. |
Hi @kaustubhdhole ! Sure I will look into alternatives and let u know once I updated the code. Meanwhile, as mentioned above, I did get nerualcoref working with spacy 2.2.x and 2.3.x locally on a ubuntu machine. It does require compiling from source though. Let me know if you know a way that I could do that using easy_install and I could also try that out. |
Hi @kaustubhdhole ! I updated the coref model to this one using AllenNLP. The pipeline seems working now. Could you take a look? |
Thank you @zijwang for your continuous efforts! I think this PR is qualitatively in great shape! There are 3 minor changes to improve speed:
This snippet has been taken from here. import spacy
nlp = spacy.load('en_core_web_sm')
text = 'My first birthday was great. My 2. was even better.'
sentences = [i for i in nlp(text).sents] Also, you might want to update the README file too which mentions "neuralcoref". Thanks again for your efforts! |
Thanks for the quick reply @kaustubhdhole ! I have updated the code and README. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zijwang , thank once again for going above and beyond to incorporate changes for this PR! I didn't notice @kaustubhdhole had tagged me here earlier, but I had faced similar issues with the neuralcoref lob and never managed to get it to work above that specific version of spacy - its commendable you even managed to pull that off locally.
I think its nice we finally did not take the decontextualization route (though it is a very pertinent thing to discuss in this context) since it may have made the sentences enlarged and damaged coherence/naturalness. Though it would indeed be super-interesting as future work to think how we can use decontextualization to do some doc-level stuff [not necessarily shuffling].
I don't think I've much else to add really..
My overall review:
Correctness: I think things seem good on this front
Interface: I think the interface chosen here is apt and pertinent
Specificity: The change is applicable to any task as long as the inputs are multi-sentence i.e non-trivial documents..
Novelty: It is novel w.r.t NL-Augmenter. Even outside NL-Augmenter, I've rarely seen implementations that take care to fix coreference after doing the shuffling.
Btw just FYI, I think this PR has already been heavy and I am not asking you or anyone to address this, but there are still certain phenomenon other than coref that can go wrong after one does shuffling [as you might know]. One prominent one is ellipsis -
S: 1. All of the Ross family has red hair, except Henry. 2. Henry has blonde hair that is very curly. 3. Henry’s father often teases Henry’s mother about the mailman. 4. The mailman has blonde, curly hair, but he is very ugly.5. His dad’s teasing makes Henry feel bad.
S’: 1. All of the Ross family has red hair, except Henry. 2. His dad’s teasing about the mailman makes Henry feel very bad.3. This is because the mailman has blonde, curly hair, but he is very ugly. 4. Henry also has blonde hair that is very curly. 5. Henry’s father often teases Henry’s mother about the mailman
(Note that this example is not my own but from some arxiv preprint i know)
You can see "about the mailman" which was ellipsis-ed out earlier had to be added in here. However, unfortunately even basic ellipsis detection tools are poorly developed today IMO.
Adding New Libraries: This has already been discussed in the previous review, nothing new to add here
Test Cases: 6 test cases are present, sufficient per reqs.
Evaluating Robustness: Appropriate robustness evaluation has been carried out.
Thanks again, and sorry for not chiming into the discussion earlier!!
Thanks a lot @vgtomahawk for the review! This ellipsis problem is indeed hard to resolve. I added a note on this point into the README. It seems this task already got 2 approvals. Is there anything I should do to merge it into master? |
I think building a complete sentence shuffler which addresses all such doc level phenomena would be an amazing open prob to work on - I also know just these two basically [ellipsis/coref] but there are many more.. Thanks for updating limitations!!
Done! Merged! |
FYI @vgtomahawk there is a new ACL paper on a very similar topic: https://arxiv.org/pdf/2107.03448.pdf |
Thank you very much for your changes. Your changes are merged. As NLP has largely focussed on high-resource languages and domains, if possible, we also encourage you to add a transformation in your mother tongue and/or a low-resource language which might need more attention from the NLP community. This is not mandatory. |
No description provided.