Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epithets starting with non are not parsed correctly #211

Closed
tobymarsden opened this issue Nov 16, 2021 · 4 comments
Closed

Epithets starting with non are not parsed correctly #211

tobymarsden opened this issue Nov 16, 2021 · 4 comments

Comments

@tobymarsden
Copy link

Currently names such as Hyacinthoides non-scripta have to be special-cased because non is a stopword.

There are also a bunch of these names which are not currently handled:

Artocarpus altilis var. non-seminiferus
Artocarpus incisus var. non-seminiferus
Asarum maculatum var. non-maculatum
Asarum versicolor var. non-versicolor
Hyacinthus non-scriptus
Hylomenes non-scripta
Grossularia non-scripta
Scilla non-scripta subsp. hispanica
Usteria non-scripta
Anthericum non-ramosum
Anthericum non-scriptum
Endymion non-scriptus
Streptanthera cuprea var. non-picta
Scilla non-scripta subsp. cernua
Torreya grandis f. non-apiculata
Rosa ×pouzinii subsp. nonhispida
Cotoneaster non-shan
Ribes non-scriptum

The most conservative way of handling this would be to change the non stopword into non\s -- this would retain the current behavior in the case of inputs such as Xiphipops fisheri (non Snyder, 1904) but allow epithets starting with non- to be parsed.

@abubelinha
Copy link

Hold on. There is something odd here.

Hyacinthoides non-scripta was reported as one of these cases, but current version of the online parser (v1.5.5) is already resolving it correctly (quality 1)

But the others @tobymarsden mentions now are getting quality 4 (unparsed tails)
What's the explanation for this different behaviour of gnparser with similar epithets?

@dimus
Copy link
Member

dimus commented Nov 17, 2021

for these specific names I quess we need a look-ahead with '-'
non\b can be the last word in a name string, word with space, word with some other non-letter(,, ., : etc.).

There is a broader situation where names like "Aus bus (non Linnaeus)" would benefit from properly parsed "non", but it can be addressed in a separate issue.

@tobymarsden
Copy link
Author

@dimus considering the absence of lookarounds in golang's regex, this is ugly but appears to work:

var notesRe = regexp.MustCompile(
	`(?i)\s+((environmental|samples|species\s+group|species\s+complex|clade|group|author|nec|vide|fide)\b|non[^a-zA-Z-]).*$`,	
)

Have I missed anything?

(non is already in the lastWordJunkRe regex so ignoring that here).

@dimus
Copy link
Member

dimus commented Nov 18, 2021

yes, lets try it this way, looks like lookahead is not included for performance reasons

@dimus dimus closed this as completed in 8f1fffe Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants