Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AS to IAST / SLP1 for all dicts #110

Closed
drdhaval2785 opened this issue Apr 4, 2017 · 22 comments
Closed

AS to IAST / SLP1 for all dicts #110

drdhaval2785 opened this issue Apr 4, 2017 · 22 comments
Assignees

Comments

@drdhaval2785
Copy link
Contributor

I am for SLP1.
But if conversion or identification is difficult, IAST will also do.

For all dict.xml please. This will bring uniformity in all XMLs. Need for separate disay tools for different dictionaries can be reduced.

@funderburkjim
Copy link
Contributor

funderburkjim commented Apr 5, 2017

The letter-number system (AS) is used only as a way to represent printed text that is composed of Latin alphabet with diacritics. It is language agnostic -- can represent Sanskrit words, French words, German words, etc.

When reprsenting Sanskrit words, there are differences among texts as to the representation using Latin letters with diacritics.

It is reasonable to have the primary form of the dictionary use modern IAST conventions as a way to represent the Sanskrit words Latin-alphabet-with-diacritics. That's the principle I'm using with these dictionaries, IAST is better than AS, and modern IAST is better than historical IAST.

In the Monier Williams dictionary, I took an approach closer to what you are suggesting, I think; namely,
I coded all the Sanskrit words appearing in Latin-alphabet in two forms: one retaining the original Latin alphabet representation and one converting that to SLP1.

  • NOTE1: The Latin-alphabet form is actually not used in the displays, even though it is present in the
    mw.xml file. Thus, all the Latin-alphabet words display as Devanagari (when that is the output chosen). I think this is what you are thinking might be preferable for other dictionaries.
  • NOTE2: The Latin-alphabet form is still shown in AS coding in MW. Alas. This should also be changed to IAST sometime.

Thus it would be possible to produce a version of other dictionaries (say MD, ACC), where the
Sanskrit words would be further (or dually) represented not in IAST but in SLP1 also. However, this is
also a big task. The hard parts are:

  • identifying Sanskrit words, in particular
  • identifying Sanskrit words that have no diacritics in the Latin letters: yoga, guru, etc.
  • Dealing with words that have partially come into the base language. Like Pluralizing 'Yogas', etc.

I don't plan to undertake this IAST->SLP1 conversion of Sanskrit words any time soon. Conversion to IAST is a big enough task for me now.

But if you want to tackle this for a particular dictionary, I'll work with you if you like.

@gasyoun
Copy link
Member

gasyoun commented Apr 5, 2017

Conversion to IAST is a big enough task for me now.

Too big and no actual need. Time will come.

@funderburkjim
Copy link
Contributor

funderburkjim commented Apr 6, 2017

09/02/2017 Retire this table in place of the one at #177 .

Here is the status of AS/IAST conversion for the various dictionaries:

acc   DONE approx. 05/20/2017    Also meta-line conversion
ae    done?    Not much to do. ae-meta not updated.
ap90  done, with meta-line conversion. 06/30/2017
ap    IAST conversion. Done in April.  meta-line conversion 07/11/2017
ben   done  04/10/2017
bhs   todo
bop   todo
bor   todo
bur   IAST  April 2017. IAST corrections, and conversion to meta-line form 07/28/2017
cae   todo
ccs   todo
gra   todo
gst   todo
ieg   todo
inm   todo
krm   todo  not much IAST
mci   todo
md    done  03/27/2017 .  meta-line conversion 07/07/2017
mw72  DONE- Converted, still some non-standard IAST
mwe   todo  not much IAST
mw    todo
pd    todo
pe    todo 
pgn   todo
pui   todo
pwg   todo also line-length adjustment
pw    todo  also line-length adjustment
sch   DONE  04/27/2017 meta-line Done 06/22/2017
shs   DONE  08/06/2017  meta-line Done 08/08/2017
skd   no IAST conversion required.  meta-line Done 08/24/2017
snp   todo 
stc   todo 
vcp   DONE  no IAST conversion required, all Devanagari;  meta-line-conversion done 08/16/2017
vei   todo 
wil   DONE 06/20/2017. Also meta-line conversion
yat   DONE  05/31/2017. Also meta-line conversion.

@funderburkjim
Copy link
Contributor

A daunting amount of work to do, but it seems worthwhile to convert all the AS coding:

  • for Sanskrit words, to modern IAST Unicode
  • for non-Sanskrit words, to Unicode.

I'll fill in the table above as progress is made.

@gasyoun
Copy link
Member

gasyoun commented Apr 6, 2017

A daunting amount of work to do

Yeah, it's a hunt.

Most wanted:

sch   todo
gra   todo
mw    todo
pd    todo
pwg   todo also line-length adjustment
pw    todo 

@funderburkjim
Copy link
Contributor

It is a dubious honor to be the only assignee .

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

Yeah, let's give @drdhaval2785 or @vvasuki as try 👍

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 19, 2017

A separate question: All sanskrit words are clearly identified in the xml-s ?

Yeah, let's give @drdhaval2785 or @vvasuki as try 👍

(I must decline, @gasyoun - far too occupied by separate sanskrit projects.)

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

A separate question: All sanskrit words are clearly identified in the xml-s ?

No and will not be, if you do not invent a regex and do manual cleanup after.

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 19, 2017

‌> A separate question: All sanskrit words are clearly identified in the xml-s ?

No and will not be, if you do not invent a regex and do manual cleanup after.

@funderburkjim At the very least, the words you do convert to IAST or SLP1 should be marked up (identifiable as sanskrit words). Reason: downstream users would want to see those words in the script of their convenience for easy reading/ lookup.

@funderburkjim
Copy link
Contributor

at the very least.

Easier said than done. It would be good to have all the IAST sanskrit words identified as such, which is your suggestion. I'll put this in my todo list.

Words which appear in Devanagari in a printed text are generally identifiable (with an <s> tag, coded as SLP1) .

Sanskrit words which appear in the text as IAST are the problem. The problem is distinguishing such
words as Sanskrit, rather than words in some other language. For instance, consider the words gam,
guru, etc. These have no diacritics, so how do we know they are Sanskrit and not English or French or German, Latin, etc? What about Sanskrit words adopted into English, which may appear in plurals such as 'yogas', 'karmas', etc.?

This IAST word classification has been done only for MW.

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

good to have all the IAST sanskrit words identified as such, which is your suggestion. I'll put this in my todo list.

Oh, right, that's feasible.

@drdhaval2785
Copy link
Contributor Author

Yeah, let's give @drdhaval2785 or @vvasuki as try 👍

I will try it. Will post here if I do something substantial in this direction.

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 20, 2017

For instance, consider the words gam, guru, etc.
These have no diacritics, so how do we know they are Sanskrit and not English or French or German, Latin, etc?

And there we have the best sanskrit dictionaries to make the distinction :-) .

Even if they're not covered (now / in the near future) its still ok to mark off just the ones we do end up converting to IAST / SLP as indic - it will help the final user downstream to that extant.

What about Sanskrit words adopted into English, which may appear in plurals such as 'yogas', 'karmas', etc.?

I'd say check the prefix minus the terminal s and mark it as an indic word. But that's not so important as these are few.

@gasyoun
Copy link
Member

gasyoun commented Apr 20, 2017

But that's not so important as these are few.

Hundreds, and just regexing will not help, more variants occur. And as it's of no priority for Jim now, let's leave it. There should be an Indian interested in weeding out the words. All we have tried, @vvasuki for the last 3 years was at least to clean the headwords. In 2-3 years will be there. The amount of work done is like 1/10 compared to what has to be done inside the dictionaries. But as I adore what Jim does a lot, I would not want him to do what others can, only where he is best. That's my take and I'll stand on it. Headwords first. Additional markup - let India wake up and tell when she is reading for some Sanskrit NLP. It's just about time. Otherwise best research on Sanskrit for last 200 years is done outside India.

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 20, 2017

Otherwise best research on Sanskrit for last 200 years is done outside India.

You certainly know how to push the right buttons! But I'll let that pass considering the source, time and place :-)

And as it's of no priority for Jim now, let's leave it.

Of course, it's up for Jim to decide (and I'm not insisting) - you've made your opinion clear.

@gasyoun
Copy link
Member

gasyoun commented Apr 20, 2017

you've made your opinion clear.

In 2006 there was nothing. Not even PWG was online. In 10 years a lot has changed. But it's sad to see that the role of people from India (other than Dhaval) is so small. What I see is that a single person can do as much as an academic institute. It's a pity to see such conditions.

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 20, 2017

But it's sad to see that the role of people from India (other than Dhaval) is so small.

There is are proper times, places and forms to express such sadness and examine the causes. This certainly isn't it. What's the relevance of these notes to the task at hand? You are not going to "guilt" or irritate Indians into changing their priorities and jumping in by noting such things here (of all places). In any case, do note that all these dictionaries were manually typed in the first place by Indians paid by Europeans.

image

‌ What I see is that a single person can do as much as an academic institute. It's a pity to see such conditions.

Again, this is neither relevant to the current issue nor helpful. Which academic institute will change because of your comment here? Or can we look for some engineering insight hidden in this unlikely source? I am all for praising dhaval but I think he would take manu quite seriously "सम्मानाद्ब्राह्मणो नित्य- मुद्विजेत विषादिव । अमृतस्येव चाकाङ्क्षे- दवमानस्य सर्वदा ।।"

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 20, 2017

Seriously, if you want to increase Indian participation, write an email to sanskrit-programmers linking to various issues in these projects where they can contribute to and invite contribution by python programmers (without absurdly insulting Indian scholarship along the way). That's far more likely to be productive.

@gasyoun
Copy link
Member

gasyoun commented Apr 20, 2017

email to sanskrit-programmers

I've not seen anything big enough, some small projects and that's it. People code as a hobby and only what they like. These tasks are bigger than just hobby. Can you document what you see, can I ask you for a favor? You know the coders, I do not.
If not Thomas, there would be nobody who typed. And I know the picture as you do as well.

@vvasuki
Copy link
Collaborator

vvasuki commented Apr 20, 2017

I've not seen anything big enough, some small projects and that's it. People code as a hobby and only what they like.

And people such as those here do not? And people there would not like contributing here? People should indeed do what they truly think is important and enjoyable for themselves. Culture need not be advanced by the miserable. It is quite arrogant to think that others ought to share your priorities (smacks of an extension of the classic "white man's burden" ).

These tasks are bigger than just hobby.

There are tasks that are bigger than just hobby, but it is false that hobbyists cannot make significant contributions here.

Can you document what you see, can I ask you for a favor? You know the coders, I do not.

No - I don't know the active coders - same as you. Just shrIvatsa, who is likely to pass. If you're too busy to write the email, I can, of course.

If not Thomas, there would be nobody who typed.

If it were not for such Indians, there would be nobody who typed as well.

And I know the picture as you do as well.

I know that you know, and you know that I know that you know. Just bringing the picture into "the picture" and clearing selective amnesia.

@funderburkjim
Copy link
Contributor

Will use table at #177 and retire above table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants