Epithets starting with `non` are not parsed correctly #211

tobymarsden · 2021-11-16T22:00:56Z

Currently names such as Hyacinthoides non-scripta have to be special-cased because non is a stopword.

There are also a bunch of these names which are not currently handled:

Artocarpus altilis var. non-seminiferus
Artocarpus incisus var. non-seminiferus
Asarum maculatum var. non-maculatum
Asarum versicolor var. non-versicolor
Hyacinthus non-scriptus
Hylomenes non-scripta
Grossularia non-scripta
Scilla non-scripta subsp. hispanica
Usteria non-scripta
Anthericum non-ramosum
Anthericum non-scriptum
Endymion non-scriptus
Streptanthera cuprea var. non-picta
Scilla non-scripta subsp. cernua
Torreya grandis f. non-apiculata
Rosa ×pouzinii subsp. nonhispida
Cotoneaster non-shan
Ribes non-scriptum

The most conservative way of handling this would be to change the non stopword into non\s -- this would retain the current behavior in the case of inputs such as Xiphipops fisheri (non Snyder, 1904) but allow epithets starting with non- to be parsed.

The text was updated successfully, but these errors were encountered:

abubelinha · 2021-11-17T18:28:14Z

Hold on. There is something odd here.

Hyacinthoides non-scripta was reported as one of these cases, but current version of the online parser (v1.5.5) is already resolving it correctly (quality 1)

But the others @tobymarsden mentions now are getting quality 4 (unparsed tails)
What's the explanation for this different behaviour of gnparser with similar epithets?

dimus · 2021-11-17T19:59:47Z

for these specific names I quess we need a look-ahead with '-'
non\b can be the last word in a name string, word with space, word with some other non-letter(,, ., : etc.).

There is a broader situation where names like "Aus bus (non Linnaeus)" would benefit from properly parsed "non", but it can be addressed in a separate issue.

tobymarsden · 2021-11-17T23:39:37Z

@dimus considering the absence of lookarounds in golang's regex, this is ugly but appears to work:

var notesRe = regexp.MustCompile(
	`(?i)\s+((environmental|samples|species\s+group|species\s+complex|clade|group|author|nec|vide|fide)\b|non[^a-zA-Z-]).*$`,	
)

Have I missed anything?

(non is already in the lastWordJunkRe regex so ignoring that here).

dimus · 2021-11-18T03:51:14Z

yes, lets try it this way, looks like lookahead is not included for performance reasons

dimus closed this as completed in 8f1fffe Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epithets starting with `non` are not parsed correctly #211

Epithets starting with `non` are not parsed correctly #211

tobymarsden commented Nov 16, 2021

abubelinha commented Nov 17, 2021

dimus commented Nov 17, 2021 •

edited

Loading

tobymarsden commented Nov 17, 2021

dimus commented Nov 18, 2021

Epithets starting with non are not parsed correctly #211

Epithets starting with non are not parsed correctly #211

Comments

tobymarsden commented Nov 16, 2021

abubelinha commented Nov 17, 2021

dimus commented Nov 17, 2021 • edited Loading

tobymarsden commented Nov 17, 2021

dimus commented Nov 18, 2021

Epithets starting with `non` are not parsed correctly #211

Epithets starting with `non` are not parsed correctly #211

dimus commented Nov 17, 2021 •

edited

Loading