-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coding of Sanskrit words in MW - #55
Comments
The comment prompting this issue
|
Please, Jim. There is a book on MW we plan to print in Russian in upcoming 6 months with @SergeA so it might be printed as an addition to https://archive.org/details/ApracticalSI |
Review of printed form of Sanskrit words in MWSanskrit words appear in MW in various guises. DevanagariDevanagari script is reserved for headwords which the author classifies as in the 'first line'. The large Devanagari is reserved for so-called 'genuine' roots. Bold Latin alphabet with diacriticsThis representation appears in headwords of the second and third line. Italic Latin alphabet with diacriticsThis type face is used both for Sanskrit words and non-Sanskrit (e.g., Latin, Gothic, etc.) words. This type face also used for fourth line headwords. normal type face, with Diacritics |
Coding of Sanskrit words in Cologne digitization - SLP1In the current version of the Cologne digitization, here is how the various forms illustrated above are coded. Devanagari in SLP1The headword form appears in the Bold Latin text with diacritics in SLP1These headwords also appear in key2 element with SLP1 spelling Italic Latin text with diacritics in SLP1For 4th line headwords, these also appear in key2 element with SLP1 spelling. It should be possible to retrieve the original printed form of the headwords by combining the For other usages, the text appears within the |
Coding of Sanskrit words in Cologne digitization - ASSanskrit words appearing in normal type face, with Diacritics is coded with the AS (letter-number) system. In
|
Coding of non-Sanskrit, non-EnglishThere are words in numerous other languages appearing within the MW dictionary. The next few comment briefly describe the coding conventions used in the current digitization in these cases. |
GreekGreek text is coded in a roundabout way. An example will give the gist of the method. mw.xml = This is within record with cologne id L = 4. Within a record (distinct L-number), instances of Greek text are coded with the An external table, with one line per L-code, is use to correlate this coding (mwaux/mwgreek/mwgreek_input.txt). The line for our L=4 example is
The displays of MW:
Suggested enhancementThe digitization should use a simpler system. I think the one we have used to implement the coding of Greek in other dictionaries by @jmigliori should be employed. Thus, our example would be coded This would drop the links to Perseus, which I doubt are very useful. |
Arabic textThis is coded in a way consistent with the method developed by @jsonreeder in the [ArabicInSanskrit][https://github.com/sanskrit-lexicon/ArabicInSanskrit/) repository. For instance, under headword 'araGawwa',
|
No need to have AS coding.The displays hide the AS coding, by converting on the fly to Unicode, based on a particular AS-ROMAN transcoding, or based on using the An improvement to the MW digitization would replace all the AS with Unicode. This will impinge upon the details of literary source lookup, and probably some other things. And special attention will be needed to transform the |
Jim, thanks for the series of post, filling the gaps myself.
Indeed interesting, because the book-wise digital text is what some, including me, lack.
So is capitalization lost or reversable? And do not feel guilty - MW is a hard nut.
Yeah, it's hard. Opening the XML does not give me what's inside the original book, only web display gives.
links to Perseus are a proof of concept and should be left as is. Why not move to a tag?
At least in a tag = meta data, as we do not touch the original text, sure.
As MW remains the most popular one - can't disagree. |
MW Russian words
And this will make the example unreadable for those who don't know Russian letters (as perhaps MW himself)? You know there are also Armenian examples there, and who knows this Armenian alphabet, except Armenians themselves?? I think we should keep the original MW spelling, but give a tip for correct national spelling. These are Russian examples from MW.
№3 & №10 need corrected Latin unicode text. Also |
@funderburkjim FYI I notice that the Arabic text in your comment is not displaying exactly as I had expected. Note the dots on the left-most character. In GitHub's monospace font (looks like "SFMono-Regular"), the dots appear incorrectly below the letter In GitHub's standard font, the dots appear correctly above the letter The expected behavior is for them to appear above the letter (as in the original dictionary). Let's make sure that the font used on the website displays the letter properly. |
I guess
So nothing of this monospace font bug on the Cologne site itself? |
Me too. In this case we should mark it as print error and correct it this way |
How is this known? |
tama-praBA Here is how this looks in Edge Browser: This looks as expected. So we can say Edge does it 'right'. and in Firefox, which looks wrong in the same way that Chrome looks right, like in Word image above: So, from the Edge example, it's not a bug in the web-font itself. But rather some difference in the Once, a couple of years ago, Susan Moore discovered a problem with the way that Safari browser on Macintosh computer rendered a certain word in Devanagari (I think it was Izky -- where the Mac omitted the virAma in Devanagari). |
Here is how araGawwa looks with Chrome browser. This uses Times New Roman, and appears right And here is how the above comment looks in On my system this uses Segoe UI font. One example conclusion: The arabic text is being rendered properly in the Cologne Displays on Windows OS computers. We could develop a UI that would let you review the rendering of all the Arabic text examples in Cologne displays of MW. @jsonreeder Should we do this to see if there are any anomalies? |
By comparison of the examples. I do not believe in accidental coincidences. All 11 Russ. examples are found also in Bopp, and Bopp often provides also Cyrillic renderings, and give more precise spellings. E.g. Bopp gives Russ. |
@funderburkjim can you make a list with Russ./Slav./Old.Slav. words for MW + Bopp? (The same way as for PWG.) I'd like to compare more words MW vs. Bopp, but manual searching takes too many time. I've noticed Bopp's words marked |
Exactly.
Mihas could try to fix it but will not, he does not update it.
Now that is interesting. I wonder how many of the 900 MW etymologies are found in Bopp? |
Confirmed. It looks right.
I don't think that this is worth much extra effort. This is a rare character, and it appears fine in your screenshot. That said, if you'd like to be extra sure, I'll happily do some cross-browser testing. Just send me links to entries with Arabic characters and I'll compare them using Browser Stack. |
I'm currently limited in what I can do, based on recent cataract surgery. So, I won't get to your suggestion for a while. |
@funderburkjim Wishing you a good recovery for your surgery. Browser Stack is a useful tool for cross-browser testing. They allow you to open your site on quite a few different combinations of OS and Browser. So it's a quick way of checking what the page will look like in various environments. |
Have been using it for years. I think it's an overkill. Let's leave it, guys. Jim, so you have your 3rd eye open now legally? |
After meta-line conversion done and Greek and Russian being tracked elsewhere, this should be closable now. |
Yes -- closable --- But that 'Browserstack' reference could be useful sometime. |
Bopp's work came out in 1847 and contains Greek, Latin, German, Lithuanian, Slavic, Celtic. MW's 1st edition appeared in 1872, and it is but natural for one to use the previously published works, as much as possible. And MW himself has acknowledged (in mW72) which resources he has used- Though MW 2nd edition (MW99) has the Introduction pages from the 1st edition added in it, this particular portion is 'skipped' there!! |
Yes, all etymologies are Bopp based, nothing new. |
A comment elsewhere deserves a full documentation of the relation between the printed text of MW and the various ways this text is represented in the digitization.
I'll attempt to do that documentation here when time permits.
The text was updated successfully, but these errors were encountered: