-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different conventions of Sanskrit dictionaries #43
Comments
This discussion is related to one at sanskrit-lexicon/COLOGNE#41. I suggest that before time is spent devising a reconciliation of the user interfaces of the dictionaries, When we have made explicit enough of these specifics, we could develop a simple web interface to sanhw1 to help refine the ideas. |
@gasyoun said: In the collection of the list there I propose four big groups.
If a dictionary was printed in India in 1850 it has a set of similar conventions. If it's 2004 and we look up the Sanskrit-Catalan dictionary, it uses nowadays popular in Europe orthography and ways of writing the dhatus. It might bee too general, but so be it. Not sure if the classification has enough of practical meaning. |
I have seen some words in these dictionaries. e.g.
We should group our dictionaries on basis of these parameters. |
N.B. - I am using sanhw1.txt as base. Not sure of dictionaries not in them.
Option 1 - Treat them as M when occuring in between a word (other than the cases where the first member of compound ends with m). e.g. Option 2 - Treat them as fifth letter of each varga when in between a word. e.g. Option 3 - Use M at the end of a word to denote neuter gender. e.g. Option 4 - Use M at the end of a word (not to denote neuter gender, but to denote avyayas mostly) where 'm' is supposed to be. e.g. Option 5 - Treat m as M when occuring in the cases where the first member of compound ends with m. e.g. Option 6 - Treat m as fifth letter of a varga when occuring in the cases where the first member of compound ends with m. e.g. |
@funderburkjim and @gasyoun Once we are OK with understanding this point 1, we move to next. The coding implications for options 1 and 2 - The coding implications for option 3 and 4 - Coding implications of options 5 and 6 - Hope I am clear in what I want in this section. |
@drdhaval2785 the approach and classification ploblemwise at the stage is more than enough and is scientific. I understand and agree with №1. The thing is that there are mixed cases. Like exclusions when AP90 is in MW mode and vice versa. I guess https://groups.google.com/forum/#!topic/bvparishat/tMSkglQYmfU is out of scope. So it's not only by rules, but exclusion and inclusion lists are wanted as well. |
I agree. But once we have to agree on general conventions. Exceptions can come in when we correct our headwords for rule. To find exceptions we will need files which have less headword duplications. |
Option 1 - Duplication done Option 2 - Duplication not done |
Option 1 - Keep verb form as 'at' Option 2 - Keep verb form as 'ant' Some dictionaries seem to follow two conventions. Need a closer look. Option 4 - Keep 'vat' / 'mat' Option - Keep 'vant' / 'mant' |
Option 1 Inflected form Option 2 Uninflected form |
Option 1 - As in dhAtupATha (stanBa) Option 2 - With removal of anubandhas and with conversion to fifth letter. (stamB) Option 3 - With removal of anubandha but without conversion to fifth letter - with anusvAra (staMB) |
This sums up my bird's view classification. |
This originates from #49 (comment) where in the word prabandDar drew my attention to this tendency in PWG to convert 'f' to 'ar' at the end of a word supposed to be having 'f'. Option 1 - Uses 'ar' instead of 'f' at the end. (e.g. kartar ) Option 2 - Uses 'f' at the end. (e.g. kartf) |
@funderburkjim Reasons - Let's get started in this precarious but useful area. I give it the highest priority. |
I guess out of top 10 tasks
|
Considering Marcis's feedback - I guess we will have to do both side by side. |
How would you handle this case (data from sanhw1.txt):
In WIL, there are two entries for aMSa, one a verb and one a noun:
L=5 aMSa (WIL) <--> aMS in MW Maybe as a first approximation, we could say all these are 'equal'. |
If we would have a special markup for roots - including WIL, that would be a good starting point. Near "duplicate" entries are not as bad as they seem - not more than 10k in each dictionary and half of them taggable semi-automatically, I guess. I've had a few battles with them. |
Here is a simple application that may help us progress in thinking about headword normalization. Have I left out any normalizations ? |
My approach would be to cull out some information from the description and ascertain whether aMSa is a verb or a noun. |
@funderburkjim
|
Which rules seem too simple?
|
@drdhaval2785 what about |
pitA is nominative singular of pitR (HK transliteration) |
@drdhaval2785 I do not insist. If
Should be 7th (Option 7), the most important step. In it I would include the killing of Anusvara at the end ['aM' is 'a']. 'rxx' is 'rx' will bring far less. Killing of H and M at end - a treasure chest it's time to open. |
In updating the CORRECTIONS repository today, I noticed the file '61267-Sanskrit-Catalan-Words-List.txt', Wondered who uploaded this, and what is the source? |
Gasyoun uploaded it.
|
http://www.llibres.cat/llibres-de-filologia-i-linguistica/231453-sanskrit-catalan-dictionary.html |
@drdhaval2785 does sanskrit-lexicon/hwnorm1#9 (comment) deserves to become 7 or is too small? |
@drdhaval2785 what is the link to the final paper in .pdf and text format? |
to |
This originates from discussion at #42 (comment) and @gasyoun's remark at point 4 of #42 (comment). Kept here, so that it is not lost in general discussion there.
@funderburkjim
We need to device a reconciliation by which user of both conventions land on the same data.
'nt' or 't' both should end up on the same page.
The text was updated successfully, but these errors were encountered: