Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Botanical names #11

Open
drdhaval2785 opened this issue Jun 18, 2020 · 13 comments
Open

Botanical names #11

drdhaval2785 opened this issue Jun 18, 2020 · 13 comments

Comments

@drdhaval2785
Copy link

WIL has many botanical names.
That is giving a lot of false positives in finding out English spelling errors.
We need to give it a separate tag, as in SNP.

@drdhaval2785
Copy link
Author

\(([A-Z][^.]*)[.]\) seems to be a good regex to identify the majority of botanical names.
There are only a few false positives, which are either markup errors or the usage like (In the Astronomy.) etc.
They can be easily weeded out.

@drdhaval2785
Copy link
Author

wil_botany.txt

This is the extracted data. Someone needs to look into it.

@drdhaval2785
Copy link
Author

A general guide can be

  1. Ignore items starting with 'In '.
  2. Remove items having more than 2 spaces in between.

@gasyoun
Copy link
Member

gasyoun commented Jun 18, 2020

This is the extracted data. Someone needs to look into it.

I can ask a student of mine. Want to weed out all non-flora?

@drdhaval2785
Copy link
Author

We need to weed out non-fauna.
To reduce human labour, we need list of scientific names of trees, plants. Then we can compare computationally. It will reduce the human labour to a great extent.

@funderburkjim
Copy link
Contributor

mw_bot.txt lists all the

'bot' tags in mw. It can be compared to wil_botany.txt.

@gasyoun
Copy link
Member

gasyoun commented Jun 19, 2020

we need list of scientific names of trees, plants.

We have done some preliminary work. The proble is that MW uses outdated terminology and so does WIL.

@funderburkjim
Copy link
Contributor

<bot> and <bio> tags now added to Wilson (wil.txt).

See:

@funderburkjim
Copy link
Contributor

I think this completes what was requested by @drdhaval2785 in the first comment.

@gasyoun
Copy link
Member

gasyoun commented Jun 21, 2020

As regards of There still remain spelling variations which likely need to be corrected in the scientific names what approach would you suggest to see them close to each other? Just read wil_bio.txt line by line?

@funderburkjim
Copy link
Contributor

Just read wil_bio.txt line by line?

Yes.

Read wil_bot.txt and wil_bio.txt

The output could identify the lines that need to be reviewed manually for corrections.
For example, a copy of wil_bio.txt could add an asterisk by those lines which would probably be
in line for correction. Here is the start of such identification of wil_bio.txt. Note that those
lines without an asterisk can be ignored -- they don't need to be examined further.

Abrus precatorios	1 *
Abrus precatorious	1 *
Abrus precatorius	9 *
Acacia Arabica	2
Acacia sirisa	1 *
Acacia Sirisa	1 *
Acacia Sirisha	1 *
Acacia suma	1
Acheranthes aspera	1
Achyranthes	1

This could be done in an hour or two by a student, I think.

@gasyoun
Copy link
Member

gasyoun commented Jun 24, 2020

Note that those
lines without an asterisk can be ignored -- they don't need to be examined further

Thanks, crystal clear and will be done.

@drdhaval2785
Copy link
Author

@Amygdalus
There are two files wil_bot.txt (thought to be flora) and wil_bio.txt (thought to be fauna)
Can you go throught the same and let us know if there is any spelling error or some flora became fauna or vice versa?
#11 (comment) for the files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants