Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different conventions of Sanskrit dictionaries #43

Open
drdhaval2785 opened this issue Dec 1, 2014 · 32 comments
Open

Different conventions of Sanskrit dictionaries #43

drdhaval2785 opened this issue Dec 1, 2014 · 32 comments

Comments

@drdhaval2785
Copy link
Contributor

This originates from discussion at #42 (comment) and @gasyoun's remark at point 4 of #42 (comment). Kept here, so that it is not lost in general discussion there.

@funderburkjim
We need to device a reconciliation by which user of both conventions land on the same data.
'nt' or 't' both should end up on the same page.

@funderburkjim
Copy link
Contributor

This discussion is related to one at sanskrit-lexicon/COLOGNE#41.

I suggest that before time is spent devising a reconciliation of the user interfaces of the dictionaries,
we need to develop (by means of a list), a specific body of knowledge regarding the spelling peculiarities of the various dictionaries. The Cologne/issues/41 makes a modest proposal of how to begin gathering this knowledge.

When we have made explicit enough of these specifics, we could develop a simple web interface to sanhw1 to help refine the ideas.

@drdhaval2785
Copy link
Contributor Author

@gasyoun said:

In the collection of the list there I propose four big groups.

Indian

European

Modern

Older / Bengal

If a dictionary was printed in India in 1850 it has a set of similar conventions. If it's 2004 and we look up the Sanskrit-Catalan dictionary, it uses nowadays popular in Europe orthography and ways of writing the dhatus. It might bee too general, but so be it. Not sure if the classification has enough of practical meaning.

@drdhaval2785
Copy link
Contributor Author

@gasyoun and @funderburkjim

I have seen some words in these dictionaries.
Gasyoun's approach seems to be dictionarywise.
I intend to approach this issue problemwise.

e.g.
There are broadly six places where dictionaries diverge as far as headwords are concerned. More can be added if need be.

1 Treatment of anusvAra
2 Duplication of letters after 'r' and
3 Convention of writing words which have 't' at end but get converted to 'n' in declention.
4 Whether dictionary show uninflected form or inflected form of word.
5 Convention regarding anusvAra of verb
6 Convention regarding handling 'f' at the end of a word

We should group our dictionaries on basis of these parameters.
I will try to note the clusters for each of these 4 group of conventions.

@drdhaval2785 drdhaval2785 changed the title 'nt' at end of dictionaries - way to handle them Different conventions of dictionaries - way to handle them Dec 5, 2014
@drdhaval2785
Copy link
Contributor Author

N.B. - I am using sanhw1.txt as base. Not sure of dictionaries not in them.

1 Treatment of anusvAra

Option 1 - Treat them as M when occuring in between a word (other than the cases where the first member of compound ends with m). e.g. caMcalaM:AP90
AP90

Option 2 - Treat them as fifth letter of each varga when in between a word. e.g. caYcala
AP,BEN,BOP,BUR,CAE,CCS,MD,MW,MW72,PW,PWG,SCH,SHS,STC,VCP,WIL,YAT

Option 3 - Use M at the end of a word to denote neuter gender. e.g. aMSukaM
SKD,AP90,BHS,WIL,PW,PWG,VCP

Option 4 - Use M at the end of a word (not to denote neuter gender, but to denote avyayas mostly) where 'm' is supposed to be. e.g. anukAmaM, anudiSaM etc
YAT

Option 5 - Treat m as M when occuring in the cases where the first member of compound ends with m. e.g. saMgIta (sam+gIta)
AP,AP90,CAE,CCS,MD,MW,PW,PWG,STC

Option 6 - Treat m as fifth letter of a varga when occuring in the cases where the first member of compound ends with m. e.g. saNgIta (sam+gIta)
BUR,MW72,SHS,VCP,WIL,YAT,SKD

@drdhaval2785
Copy link
Contributor Author

@funderburkjim and @gasyoun Once we are OK with understanding this point 1, we move to next.

The coding implications for options 1 and 2 -
As we can see that the preferred mode is to have fifth letter of each varga (option 2), let's modify the display of AP90 to suit option 2 i.e. If a user enters caYcala in AP90, he should land on caMcalaM data page.

The coding implications for option 3 and 4 -
The most preferred mode is to have 'm' at the end of a headword and not 'M'.
Let's change the frontend of the dictionaries in options 3 and 4 so that a person entering 'm' at the end lands on page with 'M' at end. e.g. aMSukam should land on aMSukaM and anukAmam should land on anukAmaM data page.
The second implication is that users usually don't enter inflected forms. They usually enter the base only. e.g. He will enter aMSuka in most probability and never aMSukam. We will have to see what words in dictionaries of options 3 are neuter gender. We can safely code their uninflected counterparts to land on inflected counterparts. e.g. If a user enters aMSuka in SKD, he should land on saMSukam page.

Coding implications of options 5 and 6 -
This is difficult choice to make, but I will go for option 6 as default (Because it is virtually impossible to determine mechanically the difference in options 1/2 and 5/6 i.e. whether the M is the end of first word of compound or not. So we will have to stick to 6 which is counterpart of 2).
Set dictionaries having convention 5 to land on them with even fifth letter as input. e.g. If a person enters saNgIta in MW - he should land on saMgIta data page in MW.

Hope I am clear in what I want in this section.
If some dictionaries are missing out, flag it up. We will try to locate the conventions in them.

@gasyoun
Copy link
Member

gasyoun commented Dec 5, 2014

@drdhaval2785 the approach and classification ploblemwise at the stage is more than enough and is scientific. I understand and agree with №1. The thing is that there are mixed cases. Like exclusions when AP90 is in MW mode and vice versa.
See https://groups.google.com/forum/#!topic/bvparishat/pnYp3GxBRvs and https://groups.google.com/forum/#!topic/bvparishat/Eyz0lSNDk-s and https://groups.google.com/forum/#!topic/bvparishat/IaCEYDmLmbI

I guess https://groups.google.com/forum/#!topic/bvparishat/tMSkglQYmfU is out of scope. So it's not only by rules, but exclusion and inclusion lists are wanted as well.

@gasyoun gasyoun changed the title Different conventions of dictionaries - way to handle them Different conventions of Sanskrit dictionaries Dec 5, 2014
@drdhaval2785
Copy link
Contributor Author

@gasyoun

The thing is that there are mixed cases. Like exclusions when AP90 is in MW mode and vice versa

I agree. But once we have to agree on general conventions. Exceptions can come in when we correct our headwords for rule. To find exceptions we will need files which have less headword duplications.

@drdhaval2785
Copy link
Contributor Author

2 Duplication of letters after 'r'

Option 1 - Duplication done
SKD,VCP,SHS,WIL,YAT,PD

Option 2 - Duplication not done
Rest all dictionaries.

@drdhaval2785
Copy link
Contributor Author

3 Convention of writing words which have 't' at end but get converted to 'n' in declention.

Option 1 - Keep verb form as 'at'
SHS,WIL,GST,MW,MW72,PD,SCH,MD

Option 2 - Keep verb form as 'ant'
BEN,CAE,CCS,PW,PWG,STC,SCH,PD,MD,BHS,

Some dictionaries seem to follow two conventions. Need a closer look.

Option 4 - Keep 'vat' / 'mat'
AP,AP90,BOP,BUR,GRA,GST,MD,MW,PD,SHS,VCP,WIL,YAT,

Option - Keep 'vant' / 'mant'
PW,PWG,SCH,STC,CAE,CCS,BEN,BHS,

@drdhaval2785
Copy link
Contributor Author

4 Whether dictionary show uninflected form or inflected form of word.

Option 1 Inflected form
AP,AP90,SKD,VCP

Option 2 Uninflected form
Rest all dictionaries.

@drdhaval2785
Copy link
Contributor Author

5 Convention regarding anusvAra of verb

Option 1 - As in dhAtupATha (stanBa)
SKD,VCP,PD,

Option 2 - With removal of anubandhas and with conversion to fifth letter. (stamB)
AP,BEN,BOP,BUR,CAE,CCS,MD,MW,MW72,PW,PWG,SHS,STC

Option 3 - With removal of anubandha but without conversion to fifth letter - with anusvAra (staMB)
AP90

@drdhaval2785
Copy link
Contributor Author

This sums up my bird's view classification.
Now @gasyoun and @funderburkjim may like to add their comments and exceptions to this broad classification, or some other addition / alteration

@drdhaval2785
Copy link
Contributor Author

6 Convention regarding handling 'f' at the end of a word

This originates from #49 (comment) where in the word prabandDar drew my attention to this tendency in PWG to convert 'f' to 'ar' at the end of a word supposed to be having 'f'.
Same holds true for PW.

Option 1 - Uses 'ar' instead of 'f' at the end. (e.g. kartar )
CCS,PW,PWG,SCH,

Option 2 - Uses 'f' at the end. (e.g. kartf)
AP,AP90,BEN,BOP,BUR,CAE,GRA,MD,MW,MW72,STC

@drdhaval2785
Copy link
Contributor Author

@funderburkjim
High time we concentrate on this substantial issue of normalizing different dictionaries, rather than correcting individual headwords.

Reasons -
1 Once we normalize this - we have around 60000 less headwords than present.
2 Also the dictionaries can communicate with each other better. e.g. aMSu would be able to communicate to its aMSuH counterpart in another dictionary.
3 Till now we were not having proper documentation of different conventions. Now we have some prepositions which I noted above. Based on this - correction / addition / exception list etc can be worked out by computer.

Let's get started in this precarious but useful area.
What's your vote @gasyoun ?

I give it the highest priority.

@gasyoun
Copy link
Member

gasyoun commented Dec 16, 2014

I guess out of top 10 tasks

  • normalizing different dictionaries
  • correcting individual headwords
    both are the most important ones. Because normalizing is a rather big task I would vote for finishing first round of dictionary cleanup first.

@drdhaval2785
Copy link
Contributor Author

Considering Marcis's feedback - I guess we will have to do both side by side.

@funderburkjim
Copy link
Contributor

How would you handle this case (data from sanhw1.txt):

aMS:AP,AP90,BEN,BOP,BUR,GST,MW,MW72,PD
aMSa:BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MD,MW,MW72,PD,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT
aMSaH:AP,AP90,SKD

In WIL, there are two entries for aMSa, one a verb and one a noun:

[L=5] [p= 001]  .aMSa¦ r. 10th cl. (aMSayati) To separate or divide. See aMsa.
[L=6]   .aMSa¦ m. (-SaH)
1 A share or portion.
2 A part.
3 A shoulder, the shoulder blade.
4 (In arithmetic) a fraction.
5 The numerator of a fraction.
6 A degree of latitude or longitude, &c. See aMsa.
E. aMSa to divide, ac affix.

L=5 aMSa (WIL) <--> aMS in MW
L=6 aMSa (WIL) <--> AMSa in MW & aMSaH in AP.

Maybe as a first approximation, we could say all these are 'equal'.

@gasyoun
Copy link
Member

gasyoun commented Dec 30, 2014

If we would have a special markup for roots - including WIL, that would be a good starting point. Near "duplicate" entries are not as bad as they seem - not more than 10k in each dictionary and half of them taggable semi-automatically, I guess. I've had a few battles with them.

@funderburkjim
Copy link
Contributor

Here is a simple application that may help us progress in thinking about headword normalization.

Have I left out any normalizations ?

@drdhaval2785
Copy link
Contributor Author

How would you handle this case (data from sanhw1.txt):
(aMS / aMSa case)

My approach would be to cull out some information from the description and ascertain whether aMSa is a verb or a noun.
If it is a verb - aMSa should be normalised with aMS
If it is a noun - keep it as it is.

@drdhaval2785
Copy link
Contributor Author

@funderburkjim
normalization rules seem to be oversimplified.
Can we go by the conventions enumerated in this thread one by one and test for positive and negative lists?

1 Treatment of anusvAra
2 Duplication of letters after 'r' and
3 Convention of writing words which have 't' at end but get converted to 'n' in declention.
4 Whether dictionary show uninflected form or inflected form of word.
5 Convention regarding anusvAra of verb
6 Convention regarding handling 'f' at the end of a word

@gasyoun
Copy link
Member

gasyoun commented Jan 3, 2015

Which rules seem too simple?

These rules are independent of the dictionary.

Use homorganic nasal rather than anusvara
normalize so that 'rxx' is 'rx' (similarly, fxx is fx)
ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'
'ttr' is 'tr' (pattra v. patra)
ending 'ant' is 'at'

@gasyoun
Copy link
Member

gasyoun commented Oct 25, 2015

@drdhaval2785 what about
pitR - MW
pitA - SKD
What would you call it? Would it not deserve to be 8th?
A SKD user (lover of the printed book and scan) said he will not use the OCR version because he can't find pitR in SKD and said that is there another way to write "father"? I opened the scan and although this might not be a very common issue, it still is. The orthographic conventions are very important and in most cases - in word endings.

@funderburkjim
Copy link
Contributor

pitA is nominative singular of pitR (HK transliteration)

@gasyoun
Copy link
Member

gasyoun commented Nov 18, 2015

@drdhaval2785 I do not insist. If problemwise seems a better solution, Dhaval, let's go for it.
But anyway

ending 'aM' is 'a'
ending 'aH' is 'a'
ending 'uH' is 'u'
ending 'iH' is 'i'

Should be 7th (Option 7), the most important step. In it I would include the killing of Anusvara at the end ['aM' is 'a']. 'rxx' is 'rx' will bring far less. Killing of H and M at end - a treasure chest it's time to open.

@funderburkjim
Copy link
Contributor

In updating the CORRECTIONS repository today, I noticed the file '61267-Sanskrit-Catalan-Words-List.txt',
dated 9/6/2016.

Wondered who uploaded this, and what is the source?

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Dec 9, 2016 via email

@gasyoun
Copy link
Member

gasyoun commented Dec 9, 2016

Wondered who uploaded this, and what is the source?

http://www.llibres.cat/llibres-de-filologia-i-linguistica/231453-sanskrit-catalan-dictionary.html

@gasyoun
Copy link
Member

gasyoun commented Mar 17, 2017

@drdhaval2785 does sanskrit-lexicon/hwnorm1#9 (comment) deserves to become 7 or is too small?

@gasyoun
Copy link
Member

gasyoun commented Mar 22, 2018

@drdhaval2785 what is the link to the final paper in .pdf and text format?

@drdhaval2785
Copy link
Contributor Author

@gasyoun
Copy link
Member

gasyoun commented Jan 27, 2021

to alldicts in https://github.com/sanskrit-lexicon/hwnorm1/blob/master/hwnorm1.py at least LAN can be added, right @drdhaval2785 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants