Identifying correctly spelled headwords #254

funderburkjim · 2016-02-24T22:26:31Z

Thus far, in our search for headword spelling errors, the general tactic has been to search for rare substrings in headwords. Earlier, we used deviations from alphabetical orderings as a source of candidates for spelling errors. Such instances have provided fertile ground for finding errors.

A complementary approach would be to identify headwords for which there is evidence of correctness.
By excluding the probably correct spellings, we will narrow the search field for finding further errors.

I think there are numerous avenues open to us in identifying the correctly spelled headwords.

gasyoun · 2016-02-24T22:30:51Z

White-listing is a different approach. For some purposes we propose MW is the ideal list. I hardly understand what should change in real life concerning this approved approach, indeed.

funderburkjim · 2016-02-24T22:40:25Z

One fact in favor of the correctness of particular headword spelling in a particular dictionary is that an equivalently spelled headword occurs in some other dictionary.

We have, in the approach called hwnorm1 identified several rules for equivalency spelling.
See here for a description of principles.

The hwnorm1 program implements many of these rules. At the moment, I'm not sure if it implements all of them.

The documentation of the rules implemented by the program is repeated here:

hwnorm1 normalization rules

These rules are independent of the dictionary.

Use homorganic nasal rather than anusvara
normalize so that 'rxx' is 'rx' (similarly, fxx is fx)
ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'
'ttr' is 'tr' (pattra v. patra)
ending 'ant' is 'at'
'cC' is 'C' (Jan 27, 2015)

funderburkjim · 2016-02-24T22:46:21Z

Based on this implementation, we can derive the following statistics regarding how many equivalent spellings occur in only one dictionary, in two dictionaries, etc.:

200045 headwords occur in 01 dictionaries
 52357 headwords occur in 02 dictionaries
 38033 headwords occur in 03 dictionaries
 24936 headwords occur in 04 dictionaries
 16138 headwords occur in 05 dictionaries
 10088 headwords occur in 06 dictionaries
  8204 headwords occur in 07 dictionaries
  8676 headwords occur in 08 dictionaries
  5469 headwords occur in 09 dictionaries
  4383 headwords occur in 10 dictionaries
  3623 headwords occur in 11 dictionaries
  3111 headwords occur in 12 dictionaries
  2246 headwords occur in 13 dictionaries
  1825 headwords occur in 14 dictionaries
  1665 headwords occur in 15 dictionaries
  1494 headwords occur in 16 dictionaries
  1334 headwords occur in 17 dictionaries
  1155 headwords occur in 18 dictionaries
   891 headwords occur in 19 dictionaries
   734 headwords occur in 20 dictionaries
   526 headwords occur in 21 dictionaries
   362 headwords occur in 22 dictionaries
   256 headwords occur in 23 dictionaries
   177 headwords occur in 24 dictionaries
   117 headwords occur in 25 dictionaries
    64 headwords occur in 26 dictionaries
    29 headwords occur in 27 dictionaries
    12 headwords occur in 28 dictionaries
     1 headwords occur in 30 dictionaries
387951 headwords in total

This data uses sanhw1.txt as a basis; in that list there are currently 433787 distinct headwords, NOT taking into account equivalence of spelling. So 45836 of the sanhw1 spellings have equivalent alternate spellings, slightly over 10%.

funderburkjim · 2016-02-24T22:50:49Z

From the table above, we see that 51.6% (200045 out of 387951) occur in only one dictionary. Looking at this from the other side, 48.4% have confirmation of spelling correctness, in that the spelling occurs in more than 1 dictionary.

funderburkjim · 2016-02-24T22:58:14Z

One way to add to the pool of spellings that we infer to be correct is to add to the rules for spelling equivalency.

Thus far, the hwnorm1 program has only applied rules globally, that is, without taking into account the dictionary in which the word occurs.

In browsing through the list of words occurring in only one dictionary, I noticed that 4382 cases are of words that (a) occur only in AP dictionary, and (b) that end in 'am'. Examining a small random sample of these words led to the conclusion that, in this AP dictionary, neuter nouns ending in 'a' are presented in their nominative singular form.

So, I propose that we add this dictionary specific spelling normalization rule to the rules of the hwnorm1 program.

funderburkjim · 2016-02-24T23:11:52Z

@drdhaval2785 If you agree, I'll put the hwnorm1 programs into the https://github.com/sanskrit-lexicon/hwnorm1 repository.

Another reasonable place would be in the CORRECTIONS repository, since the program currently depends only on the sanhw1.txt file, which is in the CORRECTIONS repository.

gasyoun · 2016-02-25T15:49:56Z

So 45836 of the sanhw1 spellings have equivalent alternate spellings, slightly over 10%.

Adorable stats. If only we could state the correspondence of dhatus - around 2000 words would find neighbors.

in this AP dictionary, neuter nouns ending in 'a' are presented in their nominative singular form.

Indeed, browsing of some bigger lists could give such fruitful rules. How could we get similar 5000 member unique lists, any clue, Jim?

So, I propose that we add this dictionary specific spelling normalization rule to the rules of the hwnorm1 program.

Agree.

I'll put the hwnorm1 programs into the https://github.com/sanskrit-lexicon/hwnorm1 repository.

Must! CORRECTIONS - possible.

funderburkjim · 2016-02-25T21:21:52Z

Have added to hwnorm1 repository here

Further work will be done there. This issue there documents the initial status of that work.

drdhaval2785 · 2016-02-26T01:26:59Z

Sorry for late reply. I agree to shifting post facto.

gasyoun · 2016-09-09T19:03:26Z

https://gandhari.org/n_dictionary.php might be of interest, especially

Sanskrit-Wörterbuch der buddhistischen Texte aus den Turfan-Funden (headword index)

godruma-vihari-dasa · 2016-09-10T07:58:08Z

Strange diacritics: diaeresis (dieresis, umlaut).

This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

drdhaval2785 · 2016-09-10T10:44:49Z

Welcome on board of dictionary corrections.

funderburkjim mentioned this issue Feb 25, 2016

ejf/hwnorm1c sanskrit-lexicon/hwnorm1#4

Open

This was referenced Mar 3, 2016

Foreign headwords in IEG #255

Open

Burnouf verbs and verbforms #256

Closed

This was referenced Mar 21, 2016

BHS verbforms #260

Open

CCS headword errors #261

Closed

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identifying correctly spelled headwords #254

Identifying correctly spelled headwords #254

funderburkjim commented Feb 24, 2016

gasyoun commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

gasyoun commented Feb 25, 2016

funderburkjim commented Feb 25, 2016

drdhaval2785 commented Feb 26, 2016 via email

gasyoun commented Sep 9, 2016

godruma-vihari-dasa commented Sep 10, 2016

drdhaval2785 commented Sep 10, 2016 via email

Identifying correctly spelled headwords #254

Identifying correctly spelled headwords #254

Comments

funderburkjim commented Feb 24, 2016

gasyoun commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

funderburkjim commented Feb 24, 2016

gasyoun commented Feb 25, 2016

funderburkjim commented Feb 25, 2016

drdhaval2785 commented Feb 26, 2016 via email

gasyoun commented Sep 9, 2016

godruma-vihari-dasa commented Sep 10, 2016

drdhaval2785 commented Sep 10, 2016 via email