Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.not() is destructive to punctuation #1065

Closed
Fdawgs opened this issue Dec 1, 2023 · 4 comments
Closed

.not() is destructive to punctuation #1065

Fdawgs opened this issue Dec 1, 2023 · 4 comments

Comments

@Fdawgs
Copy link
Contributor

Fdawgs commented Dec 1, 2023

Node version: 20.10.0
Compromise version: 14.10.1

As title states, punctuation is removed if it is next to what is being removed by .not().
In the examples below, you can see the brackets have been removed in the first two, and the exclamation mark with the last.

Reproduction:

const nlp = require('compromise')

const text = 'The leftorium sells left-handed products (Ned Flanders is the owner)'
const result1 = nlp(text).not('#Person').text()
console.log(result1)
/**
 * outputs:
 * The leftorium sells left-handed products is the owner)
 */

const text2 = 'The leftorium sells left-handed products (the owner is Ned Flanders)'
const result2 = nlp(text2).not('Ned Flanders').text()
console.log(result2)
/**
 * outputs:
 * The leftorium sells left-handed products (the owner is
 */

const text3 = 'The leftorium sells left-handed products, the owner is Ned Flanders!'
const result3 = nlp(text3).not('Ned Flanders').text()
console.log(result3)
/**
 * outputs:
 * The leftorium sells left-handed products, the owner is
 */

Potentially related to #1022.

@track0x1
Copy link
Contributor

track0x1 commented Feb 6, 2024

I noticed this happens too.
Seems to be when not() is omitting text that occurs before punctuation. For me this has manifested when I pair not() with parentheses().

// Works as expected
> nlp('this is (kinda) messy').not('messy').parentheses().out('array')
[ '(kinda)' ]

// No results for parentheses()
> nlp('this is (kinda) messy').not('this').parentheses().out('array')
[]

// Multiple terms in parentheses are lost
> nlp('this is (kinda really) messy').not('this').parentheses().out('array')
[ '(kinda' ]

@spencermountain
Copy link
Owner

spencermountain commented Feb 7, 2024

hey, yep that's right - compromise is tokenizing punctuation into pre-text, and post-text, and has some opinions on what term a punctuation should be on, or if it should hang on the left or the right of a term.

There's also the guesswork in .text() if it should print leading or trailing punctuation - sometimes it should and sometimes it shouldn't, and it decides based on how chopped-up the match is. I think that's what's happening in Ned Flanders examples.

That being said, this does appear to be a bug:

nlp('this is (kinda) messy').not('this').parentheses()

I'll take a look at all of these, if I can, today.
cheers

@spencermountain
Copy link
Owner

spencermountain commented Feb 12, 2024

hey @Fdawgs - you may want to try .remove() which mutates the document, instead of .not(), which just changes the current match. It's a subtle difference, but .remove() will do some of the things you seek, regarding repairing sentence-punctuation, and things:

const text3 = 'The leftorium sells left-handed products, the owner is Ned Flanders!'
const result3 = nlp(text3).remove('Ned Flanders').text()
console.log(result3)
//The leftorium sells left-handed products, the owner is!

Please let me know if you spot .remove() mangling punctuation in unexpected ways. I don't think it's been tested very well. It would be fun to improve.

@track0x1 fix is on dev, will be part of next release, likely this week.
cheers

@spencermountain
Copy link
Owner

fixed in 14.12.0

nlp('this is (kinda) messy').not('this').parentheses() //'(kinda)'

cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants