Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing the dagger symbol? #85

Closed
dimus opened this issue Dec 18, 2020 · 4 comments
Closed

Parsing the dagger symbol? #85

dimus opened this issue Dec 18, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@dimus
Copy link
Member

dimus commented Dec 18, 2020

created by @gdower at https://gitlab.com/gogna/gnparser/-/issues/85

Names often include the dagger symbol (†) to indicate that the taxon is extinct. It might be useful to remove the dagger from the name and add an extinct boolean.

@dimus dimus added the enhancement New feature or request label Dec 18, 2020
@dimus dimus self-assigned this Dec 18, 2020
@dimus
Copy link
Member Author

dimus commented Dec 18, 2020

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43

1. Henriksenopterix†

2. Henriksenopterix† paucistriata (Henriksen, 1922)

3. Heteralocha acutirostris (Gould, 1837) Huia N E†

4. Oncorhynchus nerka (Walbaum, 1792) Sockeye salmon F A †? 

5. Ostomalynus Kireichuk & Ponomarenko, 1990. Type 
   species: †  Ostomalynus ovalis Kireichuk & 
   Ponomarenko, 1990, by original designation.

Cases 1-3: pos will work fine if to substitute the dagger with a space.

Case 4-5: This one is problematic. I guess what I can do is to remember where daggers happened, and if all of them were in the unparsed tail -- ignore them.

@dimus
Copy link
Member Author

dimus commented Dec 18, 2020

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44

@gdower do you have examples of where do you see the dagger symbol in the wild? If it is always in the end, pos part of the parsed data will not get broken.

@dimus
Copy link
Member Author

dimus commented Dec 18, 2020

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/45

It does make sense. I can imagine 2 ways to solve it.

  1. To have a preprocessing that detects and removes the dagger symbol. This approach has, in my view, 2 problems:
  • It would run for every string, while dagger symbol is pretty rare. If the implementation is a regex it will take 1-2% of speed. However if it is done by scanning every symbol, slowdown will be negligible.
  • It will modify the name. However we do change it for example when we remove html tags, and altogether we do normalize name anyway.
  1. If we have an unparsed tail, we scan it for the dagger symbol. We keep the dagger in the unparsed tail and set extinct flag to true. In this case search for the dagger will be usually rare. Possible problems:
  • Name in this case is marked as quality 3, while dagger symbol is a commonly accepted practice.

I think the first approach is better. After looking at "dagger" names in the wild 2nd approach is not going to work at all.

@dimus
Copy link
Member Author

dimus commented Nov 10, 2021

Solution:

  1. Dagger is detected during preprocessing, and substituted with 3 spaces (to keep the same number of bytes: 0xE2 0x80 0xA0 (e280a0))
  2. flag HasDagger is set to true
  3. Parsing as usual

Such approach generates a warning for too many empty spaces, and we cannot say if it was generated
because of the dagger char, or because there were genunine spare empty spaces as well.

Solution: remove empty spaces silently. I think removal of extra spaces is similar to removal of comma before year, it is something that probably can be done without issuing a warning.

@dimus dimus closed this as completed in f37d469 Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant