Different conventions of Sanskrit dictionaries #43

drdhaval2785 · 2014-12-01T04:08:44Z

This originates from discussion at #42 (comment) and @gasyoun's remark at point 4 of #42 (comment). Kept here, so that it is not lost in general discussion there.

@funderburkjim
We need to device a reconciliation by which user of both conventions land on the same data.
'nt' or 't' both should end up on the same page.

funderburkjim · 2014-12-04T21:32:48Z

This discussion is related to one at sanskrit-lexicon/COLOGNE#41.

I suggest that before time is spent devising a reconciliation of the user interfaces of the dictionaries,
we need to develop (by means of a list), a specific body of knowledge regarding the spelling peculiarities of the various dictionaries. The Cologne/issues/41 makes a modest proposal of how to begin gathering this knowledge.

When we have made explicit enough of these specifics, we could develop a simple web interface to sanhw1 to help refine the ideas.

drdhaval2785 · 2014-12-05T09:33:42Z

@gasyoun said:

In the collection of the list there I propose four big groups.

Indian

European

Modern

Older / Bengal

If a dictionary was printed in India in 1850 it has a set of similar conventions. If it's 2004 and we look up the Sanskrit-Catalan dictionary, it uses nowadays popular in Europe orthography and ways of writing the dhatus. It might bee too general, but so be it. Not sure if the classification has enough of practical meaning.

drdhaval2785 · 2014-12-05T09:39:13Z

@gasyoun and @funderburkjim

I have seen some words in these dictionaries.
Gasyoun's approach seems to be dictionarywise.
I intend to approach this issue problemwise.

e.g.
There are broadly six places where dictionaries diverge as far as headwords are concerned. More can be added if need be.

1 Treatment of anusvAra
2 Duplication of letters after 'r' and
3 Convention of writing words which have 't' at end but get converted to 'n' in declention.
4 Whether dictionary show uninflected form or inflected form of word.
5 Convention regarding anusvAra of verb
6 Convention regarding handling 'f' at the end of a word

We should group our dictionaries on basis of these parameters.
I will try to note the clusters for each of these 4 group of conventions.

drdhaval2785 · 2014-12-05T12:01:36Z

N.B. - I am using sanhw1.txt as base. Not sure of dictionaries not in them.

1 Treatment of anusvAra

Option 1 - Treat them as M when occuring in between a word (other than the cases where the first member of compound ends with m). e.g. caMcalaM:AP90
AP90

Option 2 - Treat them as fifth letter of each varga when in between a word. e.g. caYcala
AP,BEN,BOP,BUR,CAE,CCS,MD,MW,MW72,PW,PWG,SCH,SHS,STC,VCP,WIL,YAT

Option 3 - Use M at the end of a word to denote neuter gender. e.g. aMSukaM
SKD,AP90,BHS,WIL,PW,PWG,VCP

Option 4 - Use M at the end of a word (not to denote neuter gender, but to denote avyayas mostly) where 'm' is supposed to be. e.g. anukAmaM, anudiSaM etc
YAT

Option 5 - Treat m as M when occuring in the cases where the first member of compound ends with m. e.g. saMgIta (sam+gIta)
AP,AP90,CAE,CCS,MD,MW,PW,PWG,STC

Option 6 - Treat m as fifth letter of a varga when occuring in the cases where the first member of compound ends with m. e.g. saNgIta (sam+gIta)
BUR,MW72,SHS,VCP,WIL,YAT,SKD

drdhaval2785 · 2014-12-05T12:05:04Z

@funderburkjim and @gasyoun Once we are OK with understanding this point 1, we move to next.

The coding implications for options 1 and 2 -
As we can see that the preferred mode is to have fifth letter of each varga (option 2), let's modify the display of AP90 to suit option 2 i.e. If a user enters caYcala in AP90, he should land on caMcalaM data page.

The coding implications for option 3 and 4 -
The most preferred mode is to have 'm' at the end of a headword and not 'M'.
Let's change the frontend of the dictionaries in options 3 and 4 so that a person entering 'm' at the end lands on page with 'M' at end. e.g. aMSukam should land on aMSukaM and anukAmam should land on anukAmaM data page.
The second implication is that users usually don't enter inflected forms. They usually enter the base only. e.g. He will enter aMSuka in most probability and never aMSukam. We will have to see what words in dictionaries of options 3 are neuter gender. We can safely code their uninflected counterparts to land on inflected counterparts. e.g. If a user enters aMSuka in SKD, he should land on saMSukam page.

Coding implications of options 5 and 6 -
This is difficult choice to make, but I will go for option 6 as default (Because it is virtually impossible to determine mechanically the difference in options 1/2 and 5/6 i.e. whether the M is the end of first word of compound or not. So we will have to stick to 6 which is counterpart of 2).
Set dictionaries having convention 5 to land on them with even fifth letter as input. e.g. If a person enters saNgIta in MW - he should land on saMgIta data page in MW.

Hope I am clear in what I want in this section.
If some dictionaries are missing out, flag it up. We will try to locate the conventions in them.

gasyoun · 2014-12-05T13:29:34Z

@drdhaval2785 the approach and classification ploblemwise at the stage is more than enough and is scientific. I understand and agree with №1. The thing is that there are mixed cases. Like exclusions when AP90 is in MW mode and vice versa.
See https://groups.google.com/forum/#!topic/bvparishat/pnYp3GxBRvs and https://groups.google.com/forum/#!topic/bvparishat/Eyz0lSNDk-s and https://groups.google.com/forum/#!topic/bvparishat/IaCEYDmLmbI

I guess https://groups.google.com/forum/#!topic/bvparishat/tMSkglQYmfU is out of scope. So it's not only by rules, but exclusion and inclusion lists are wanted as well.

drdhaval2785 · 2014-12-13T12:21:17Z

@gasyoun

The thing is that there are mixed cases. Like exclusions when AP90 is in MW mode and vice versa

I agree. But once we have to agree on general conventions. Exceptions can come in when we correct our headwords for rule. To find exceptions we will need files which have less headword duplications.

drdhaval2785 · 2014-12-13T16:59:52Z

2 Duplication of letters after 'r'

Option 1 - Duplication done
SKD,VCP,SHS,WIL,YAT,PD

Option 2 - Duplication not done
Rest all dictionaries.

drdhaval2785 · 2014-12-13T17:14:50Z

3 Convention of writing words which have 't' at end but get converted to 'n' in declention.

Option 1 - Keep verb form as 'at'
SHS,WIL,GST,MW,MW72,PD,SCH,MD

Option 2 - Keep verb form as 'ant'
BEN,CAE,CCS,PW,PWG,STC,SCH,PD,MD,BHS,

Some dictionaries seem to follow two conventions. Need a closer look.

Option 4 - Keep 'vat' / 'mat'
AP,AP90,BOP,BUR,GRA,GST,MD,MW,PD,SHS,VCP,WIL,YAT,

Option - Keep 'vant' / 'mant'
PW,PWG,SCH,STC,CAE,CCS,BEN,BHS,

drdhaval2785 · 2014-12-13T17:19:26Z

4 Whether dictionary show uninflected form or inflected form of word.

Option 1 Inflected form
AP,AP90,SKD,VCP

Option 2 Uninflected form
Rest all dictionaries.

drdhaval2785 · 2014-12-13T17:23:13Z

5 Convention regarding anusvAra of verb

Option 1 - As in dhAtupATha (stanBa)
SKD,VCP,PD,

Option 2 - With removal of anubandhas and with conversion to fifth letter. (stamB)
AP,BEN,BOP,BUR,CAE,CCS,MD,MW,MW72,PW,PWG,SHS,STC

Option 3 - With removal of anubandha but without conversion to fifth letter - with anusvAra (staMB)
AP90

drdhaval2785 · 2014-12-13T17:24:10Z

This sums up my bird's view classification.
Now @gasyoun and @funderburkjim may like to add their comments and exceptions to this broad classification, or some other addition / alteration

drdhaval2785 · 2014-12-16T11:57:18Z

6 Convention regarding handling 'f' at the end of a word

This originates from #49 (comment) where in the word prabandDar drew my attention to this tendency in PWG to convert 'f' to 'ar' at the end of a word supposed to be having 'f'.
Same holds true for PW.

Option 1 - Uses 'ar' instead of 'f' at the end. (e.g. kartar )
CCS,PW,PWG,SCH,

Option 2 - Uses 'f' at the end. (e.g. kartf)
AP,AP90,BEN,BOP,BUR,CAE,GRA,MD,MW,MW72,STC

drdhaval2785 · 2014-12-16T12:25:19Z

@funderburkjim
High time we concentrate on this substantial issue of normalizing different dictionaries, rather than correcting individual headwords.

Reasons -
1 Once we normalize this - we have around 60000 less headwords than present.
2 Also the dictionaries can communicate with each other better. e.g. aMSu would be able to communicate to its aMSuH counterpart in another dictionary.
3 Till now we were not having proper documentation of different conventions. Now we have some prepositions which I noted above. Based on this - correction / addition / exception list etc can be worked out by computer.

Let's get started in this precarious but useful area.
What's your vote @gasyoun ?

I give it the highest priority.

gasyoun · 2014-12-16T14:52:00Z

I guess out of top 10 tasks

normalizing different dictionaries
correcting individual headwords
both are the most important ones. Because normalizing is a rather big task I would vote for finishing first round of dictionary cleanup first.

drdhaval2785 · 2014-12-16T15:45:40Z

Considering Marcis's feedback - I guess we will have to do both side by side.

funderburkjim · 2014-12-30T01:50:56Z

How would you handle this case (data from sanhw1.txt):

aMS:AP,AP90,BEN,BOP,BUR,GST,MW,MW72,PD
aMSa:BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MD,MW,MW72,PD,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT
aMSaH:AP,AP90,SKD

In WIL, there are two entries for aMSa, one a verb and one a noun:

[L=5] [p= 001]  .aMSa¦ r. 10th cl. (aMSayati) To separate or divide. See aMsa.
[L=6]   .aMSa¦ m. (-SaH)
1 A share or portion.
2 A part.
3 A shoulder, the shoulder blade.
4 (In arithmetic) a fraction.
5 The numerator of a fraction.
6 A degree of latitude or longitude, &c. See aMsa.
E. aMSa to divide, ac affix.

L=5 aMSa (WIL) <--> aMS in MW
L=6 aMSa (WIL) <--> AMSa in MW & aMSaH in AP.

Maybe as a first approximation, we could say all these are 'equal'.

gasyoun · 2014-12-30T22:09:51Z

If we would have a special markup for roots - including WIL, that would be a good starting point. Near "duplicate" entries are not as bad as they seem - not more than 10k in each dictionary and half of them taggable semi-automatically, I guess. I've had a few battles with them.

funderburkjim · 2015-01-01T20:58:59Z

Here is a simple application that may help us progress in thinking about headword normalization.

Have I left out any normalizations ?

drdhaval2785 · 2015-01-03T14:13:18Z

How would you handle this case (data from sanhw1.txt):
(aMS / aMSa case)

My approach would be to cull out some information from the description and ascertain whether aMSa is a verb or a noun.
If it is a verb - aMSa should be normalised with aMS
If it is a noun - keep it as it is.

drdhaval2785 · 2015-01-03T14:15:24Z

@funderburkjim
normalization rules seem to be oversimplified.
Can we go by the conventions enumerated in this thread one by one and test for positive and negative lists?

1 Treatment of anusvAra
2 Duplication of letters after 'r' and
3 Convention of writing words which have 't' at end but get converted to 'n' in declention.
4 Whether dictionary show uninflected form or inflected form of word.
5 Convention regarding anusvAra of verb
6 Convention regarding handling 'f' at the end of a word

gasyoun · 2015-01-03T17:04:45Z

Which rules seem too simple?

These rules are independent of the dictionary.

Use homorganic nasal rather than anusvara
normalize so that 'rxx' is 'rx' (similarly, fxx is fx)
ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'
'ttr' is 'tr' (pattra v. patra)
ending 'ant' is 'at'

gasyoun · 2015-10-25T20:27:37Z

@drdhaval2785 what about
pitR - MW
pitA - SKD
What would you call it? Would it not deserve to be 8th?
A SKD user (lover of the printed book and scan) said he will not use the OCR version because he can't find pitR in SKD and said that is there another way to write "father"? I opened the scan and although this might not be a very common issue, it still is. The orthographic conventions are very important and in most cases - in word endings.

funderburkjim · 2015-10-26T21:48:01Z

pitA is nominative singular of pitR (HK transliteration)

gasyoun · 2015-11-18T12:30:13Z

@drdhaval2785 I do not insist. If problemwise seems a better solution, Dhaval, let's go for it.
But anyway

ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'

Should be 7th (Option 7), the most important step. In it I would include the killing of Anusvara at the end ['aM' is 'a']. 'rxx' is 'rx' will bring far less. Killing of H and M at end - a treasure chest it's time to open.

sanskrit-lexicon/CORRECTIONS#43

funderburkjim · 2016-12-08T22:11:33Z

In updating the CORRECTIONS repository today, I noticed the file '61267-Sanskrit-Catalan-Words-List.txt',
dated 9/6/2016.

Wondered who uploaded this, and what is the source?

drdhaval2785 · 2016-12-09T02:50:21Z

Gasyoun uploaded it.

gasyoun · 2016-12-09T21:18:03Z

Wondered who uploaded this, and what is the source?

http://www.llibres.cat/llibres-de-filologia-i-linguistica/231453-sanskrit-catalan-dictionary.html

gasyoun · 2017-03-17T13:06:00Z

@drdhaval2785 does sanskrit-lexicon/hwnorm1#9 (comment) deserves to become 7 or is too small?

gasyoun · 2018-03-22T19:07:11Z

@drdhaval2785 what is the link to the final paper in .pdf and text format?

drdhaval2785 · 2018-03-23T01:51:39Z

https://www.academia.edu/30717917/Normalizing_headwords_of_Cologne_digital_dictionaries

gasyoun · 2021-01-27T13:24:09Z

to alldicts in https://github.com/sanskrit-lexicon/hwnorm1/blob/master/hwnorm1.py at least LAN can be added, right @drdhaval2785 ?

funderburkjim mentioned this issue Dec 4, 2014

Sabdakalpadruma headword normalization dropped sanskrit-lexicon/COLOGNE#41

Closed

drdhaval2785 changed the title ~~'nt' at end of dictionaries - way to handle them~~ Different conventions of dictionaries - way to handle them Dec 5, 2014

gasyoun changed the title ~~Different conventions of dictionaries - way to handle them~~ Different conventions of Sanskrit dictionaries Dec 5, 2014

gasyoun added the help wanted label Dec 15, 2014

drdhaval2785 added the wontfix label Apr 16, 2015

gasyoun mentioned this issue Nov 20, 2015

Dictionaries with prefixed forms under the root headword #161

Open

This was referenced Nov 21, 2015

eleven odd members of AP90 #162

Closed

anusvAra convention violations in AP90 #163

Closed

drdhaval2785 added a commit to sanskrit-lexicon/hwnorm1 that referenced this issue Nov 21, 2015

Convention 1 raw files

44f02cb

sanskrit-lexicon/CORRECTIONS#43

drdhaval2785 mentioned this issue Dec 2, 2015

Todo list as of December 2015 #181

Open

drdhaval2785 added Documentation and removed help wanted wontfix labels Dec 2, 2015

gasyoun mentioned this issue Mar 17, 2017

fem. singular/plurals should be joined sanskrit-lexicon/hwnorm1#9

Open

gasyoun assigned drdhaval2785 May 31, 2017

gasyoun mentioned this issue Jul 26, 2017

simple-sanskrit search sanskrit-lexicon/COLOGNE#156

Open

gasyoun added the enhancement label Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different conventions of Sanskrit dictionaries #43

Different conventions of Sanskrit dictionaries #43

drdhaval2785 commented Dec 1, 2014

funderburkjim commented Dec 4, 2014

drdhaval2785 commented Dec 5, 2014

drdhaval2785 commented Dec 5, 2014

drdhaval2785 commented Dec 5, 2014

drdhaval2785 commented Dec 5, 2014

gasyoun commented Dec 5, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 16, 2014

drdhaval2785 commented Dec 16, 2014

gasyoun commented Dec 16, 2014

drdhaval2785 commented Dec 16, 2014

funderburkjim commented Dec 30, 2014

gasyoun commented Dec 30, 2014

funderburkjim commented Jan 1, 2015

drdhaval2785 commented Jan 3, 2015

drdhaval2785 commented Jan 3, 2015

gasyoun commented Jan 3, 2015

gasyoun commented Oct 25, 2015

funderburkjim commented Oct 26, 2015

gasyoun commented Nov 18, 2015

funderburkjim commented Dec 8, 2016

drdhaval2785 commented Dec 9, 2016 via email

gasyoun commented Dec 9, 2016

gasyoun commented Mar 17, 2017

gasyoun commented Mar 22, 2018

drdhaval2785 commented Mar 23, 2018

gasyoun commented Jan 27, 2021

Different conventions of Sanskrit dictionaries #43

Different conventions of Sanskrit dictionaries #43

Comments

drdhaval2785 commented Dec 1, 2014

funderburkjim commented Dec 4, 2014

drdhaval2785 commented Dec 5, 2014

drdhaval2785 commented Dec 5, 2014

drdhaval2785 commented Dec 5, 2014

drdhaval2785 commented Dec 5, 2014

gasyoun commented Dec 5, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 13, 2014

drdhaval2785 commented Dec 16, 2014

drdhaval2785 commented Dec 16, 2014

gasyoun commented Dec 16, 2014

drdhaval2785 commented Dec 16, 2014

funderburkjim commented Dec 30, 2014

gasyoun commented Dec 30, 2014

funderburkjim commented Jan 1, 2015

drdhaval2785 commented Jan 3, 2015

drdhaval2785 commented Jan 3, 2015

gasyoun commented Jan 3, 2015

gasyoun commented Oct 25, 2015

funderburkjim commented Oct 26, 2015

gasyoun commented Nov 18, 2015

funderburkjim commented Dec 8, 2016

drdhaval2785 commented Dec 9, 2016 via email

gasyoun commented Dec 9, 2016

gasyoun commented Mar 17, 2017

gasyoun commented Mar 22, 2018

drdhaval2785 commented Mar 23, 2018

gasyoun commented Jan 27, 2021