-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identifying correctly spelled headwords #254
Comments
White-listing is a different approach. For some purposes we propose MW is the ideal list. I hardly understand what should change in real life concerning this |
One fact in favor of the correctness of particular headword spelling in a particular dictionary is that an equivalently spelled headword occurs in some other dictionary. We have, in the approach called hwnorm1 identified several rules for equivalency spelling. The hwnorm1 program implements many of these rules. At the moment, I'm not sure if it implements all of them. The documentation of the rules implemented by the program is repeated here:
|
Based on this implementation, we can derive the following statistics regarding how many equivalent spellings occur in only one dictionary, in two dictionaries, etc.:
This data uses sanhw1.txt as a basis; in that list there are currently 433787 distinct headwords, NOT taking into account equivalence of spelling. So 45836 of the sanhw1 spellings have equivalent alternate spellings, slightly over 10%. |
From the table above, we see that 51.6% (200045 out of 387951) occur in only one dictionary. Looking at this from the other side, 48.4% have confirmation of spelling correctness, in that the spelling occurs in more than 1 dictionary. |
One way to add to the pool of spellings that we infer to be correct is to add to the rules for spelling equivalency. Thus far, the hwnorm1 program has only applied rules globally, that is, without taking into account the dictionary in which the word occurs. In browsing through the list of words occurring in only one dictionary, I noticed that 4382 cases are of words that (a) occur only in AP dictionary, and (b) that end in 'am'. Examining a small random sample of these words led to the conclusion that, in this AP dictionary, neuter nouns ending in 'a' are presented in their nominative singular form. So, I propose that we add this dictionary specific spelling normalization rule to the rules of the hwnorm1 program. |
@drdhaval2785 If you agree, I'll put the hwnorm1 programs into the https://github.com/sanskrit-lexicon/hwnorm1 repository. Another reasonable place would be in the CORRECTIONS repository, since the program currently depends only on the sanhw1.txt file, which is in the CORRECTIONS repository. |
Adorable stats. If only we could state the correspondence of dhatus - around 2000 words would find neighbors.
Indeed, browsing of some bigger lists could give such fruitful rules. How could we get similar 5000 member unique lists, any clue, Jim?
Agree.
Must! |
Have added to hwnorm1 repository here Further work will be done there. This issue there documents the initial status of that work. |
Sorry for late reply. I agree to shifting post facto.
|
https://gandhari.org/n_dictionary.php might be of interest, especially
|
Strange diacritics: diaeresis (dieresis, umlaut). This email has been checked for viruses by Avast antivirus software. |
Welcome on board of dictionary corrections.
|
Thus far, in our search for headword spelling errors, the general tactic has been to search for rare substrings in headwords. Earlier, we used deviations from alphabetical orderings as a source of candidates for spelling errors. Such instances have provided fertile ground for finding errors.
A complementary approach would be to identify headwords for which there is evidence of correctness.
By excluding the probably correct spellings, we will narrow the search field for finding further errors.
I think there are numerous avenues open to us in identifying the correctly spelled headwords.
The text was updated successfully, but these errors were encountered: