Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying correctly spelled headwords #254

Open
funderburkjim opened this issue Feb 24, 2016 · 12 comments
Open

Identifying correctly spelled headwords #254

funderburkjim opened this issue Feb 24, 2016 · 12 comments

Comments

@funderburkjim
Copy link
Contributor

Thus far, in our search for headword spelling errors, the general tactic has been to search for rare substrings in headwords. Earlier, we used deviations from alphabetical orderings as a source of candidates for spelling errors. Such instances have provided fertile ground for finding errors.

A complementary approach would be to identify headwords for which there is evidence of correctness.
By excluding the probably correct spellings, we will narrow the search field for finding further errors.

I think there are numerous avenues open to us in identifying the correctly spelled headwords.

@gasyoun
Copy link
Member

gasyoun commented Feb 24, 2016

White-listing is a different approach. For some purposes we propose MW is the ideal list. I hardly understand what should change in real life concerning this approved approach, indeed.

@funderburkjim
Copy link
Contributor Author

One fact in favor of the correctness of particular headword spelling in a particular dictionary is that an equivalently spelled headword occurs in some other dictionary.

We have, in the approach called hwnorm1 identified several rules for equivalency spelling.
See here for a description of principles.

The hwnorm1 program implements many of these rules. At the moment, I'm not sure if it implements all of them.

The documentation of the rules implemented by the program is repeated here:

hwnorm1 normalization rules

These rules are independent of the dictionary.

Use homorganic nasal rather than anusvara
normalize so that 'rxx' is 'rx' (similarly, fxx is fx)
ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'
'ttr' is 'tr' (pattra v. patra)
ending 'ant' is 'at'
'cC' is 'C' (Jan 27, 2015)

@funderburkjim
Copy link
Contributor Author

Based on this implementation, we can derive the following statistics regarding how many equivalent spellings occur in only one dictionary, in two dictionaries, etc.:

200045 headwords occur in 01 dictionaries
 52357 headwords occur in 02 dictionaries
 38033 headwords occur in 03 dictionaries
 24936 headwords occur in 04 dictionaries
 16138 headwords occur in 05 dictionaries
 10088 headwords occur in 06 dictionaries
  8204 headwords occur in 07 dictionaries
  8676 headwords occur in 08 dictionaries
  5469 headwords occur in 09 dictionaries
  4383 headwords occur in 10 dictionaries
  3623 headwords occur in 11 dictionaries
  3111 headwords occur in 12 dictionaries
  2246 headwords occur in 13 dictionaries
  1825 headwords occur in 14 dictionaries
  1665 headwords occur in 15 dictionaries
  1494 headwords occur in 16 dictionaries
  1334 headwords occur in 17 dictionaries
  1155 headwords occur in 18 dictionaries
   891 headwords occur in 19 dictionaries
   734 headwords occur in 20 dictionaries
   526 headwords occur in 21 dictionaries
   362 headwords occur in 22 dictionaries
   256 headwords occur in 23 dictionaries
   177 headwords occur in 24 dictionaries
   117 headwords occur in 25 dictionaries
    64 headwords occur in 26 dictionaries
    29 headwords occur in 27 dictionaries
    12 headwords occur in 28 dictionaries
     1 headwords occur in 30 dictionaries
387951 headwords in total

This data uses sanhw1.txt as a basis; in that list there are currently 433787 distinct headwords, NOT taking into account equivalence of spelling. So 45836 of the sanhw1 spellings have equivalent alternate spellings, slightly over 10%.

@funderburkjim
Copy link
Contributor Author

From the table above, we see that 51.6% (200045 out of 387951) occur in only one dictionary. Looking at this from the other side, 48.4% have confirmation of spelling correctness, in that the spelling occurs in more than 1 dictionary.

@funderburkjim
Copy link
Contributor Author

One way to add to the pool of spellings that we infer to be correct is to add to the rules for spelling equivalency.

Thus far, the hwnorm1 program has only applied rules globally, that is, without taking into account the dictionary in which the word occurs.

In browsing through the list of words occurring in only one dictionary, I noticed that 4382 cases are of words that (a) occur only in AP dictionary, and (b) that end in 'am'. Examining a small random sample of these words led to the conclusion that, in this AP dictionary, neuter nouns ending in 'a' are presented in their nominative singular form.

So, I propose that we add this dictionary specific spelling normalization rule to the rules of the hwnorm1 program.

@funderburkjim
Copy link
Contributor Author

@drdhaval2785 If you agree, I'll put the hwnorm1 programs into the https://github.com/sanskrit-lexicon/hwnorm1 repository.

Another reasonable place would be in the CORRECTIONS repository, since the program currently depends only on the sanhw1.txt file, which is in the CORRECTIONS repository.

@gasyoun
Copy link
Member

gasyoun commented Feb 25, 2016

So 45836 of the sanhw1 spellings have equivalent alternate spellings, slightly over 10%.

Adorable stats. If only we could state the correspondence of dhatus - around 2000 words would find neighbors.

in this AP dictionary, neuter nouns ending in 'a' are presented in their nominative singular form.

Indeed, browsing of some bigger lists could give such fruitful rules. How could we get similar 5000 member unique lists, any clue, Jim?

So, I propose that we add this dictionary specific spelling normalization rule to the rules of the hwnorm1 program.

Agree.

I'll put the hwnorm1 programs into the https://github.com/sanskrit-lexicon/hwnorm1 repository.

Must! CORRECTIONS - possible.

@funderburkjim
Copy link
Contributor Author

Have added to hwnorm1 repository here

Further work will be done there. This issue there documents the initial status of that work.

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented Feb 26, 2016 via email

This was referenced Mar 21, 2016
@gasyoun
Copy link
Member

gasyoun commented Sep 9, 2016

https://gandhari.org/n_dictionary.php might be of interest, especially

Sanskrit-Wörterbuch der buddhistischen Texte aus den Turfan-Funden (headword index)

@godruma-vihari-dasa
Copy link

Strange diacritics: diaeresis (dieresis, umlaut).


This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented Sep 10, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants