Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coding of Sanskrit words in MW - #55

Closed
funderburkjim opened this issue Nov 18, 2017 · 31 comments
Closed

Coding of Sanskrit words in MW - #55

funderburkjim opened this issue Nov 18, 2017 · 31 comments

Comments

@funderburkjim
Copy link
Contributor

A comment elsewhere deserves a full documentation of the relation between the printed text of MW and the various ways this text is represented in the digitization.

I'll attempt to do that documentation here when time permits.

@funderburkjim
Copy link
Contributor Author

The comment prompting this issue

[In MW] transliterated terms as "Vedas" etc. while selected Devanagari output are represented as "वेदs". That's not good at all. Transliterated terms and names must be treated separately from real Sanskrit words (stems and quotations). The term "Vedas" should be rendered in Latin letters, no matter which output is selected. A separate markup for these words in digitalization allows also to change the outdated transliteration scheme etc.

@gasyoun
Copy link
Member

gasyoun commented Nov 20, 2017

I'll attempt to do that documentation here when time permits.

Please, Jim. There is a book on MW we plan to print in Russian in upcoming 6 months with @SergeA so it might be printed as an addition to https://archive.org/details/ApracticalSI

@funderburkjim
Copy link
Contributor Author

Review of printed form of Sanskrit words in MW

Sanskrit words appear in MW in various guises.
The images come from page 1.

Devanagari

Devanagari script is reserved for headwords which the author classifies as in the 'first line'.
image

The large Devanagari is reserved for so-called 'genuine' roots.
A list of these in in mw_genuine_roots.

Bold Latin alphabet with diacritics

This representation appears in headwords of the second and third line.

image

Italic Latin alphabet with diacritics

This type face is used both for Sanskrit words and non-Sanskrit (e.g., Latin, Gothic, etc.) words.

This type face also used for fourth line headwords.
image

normal type face, with Diacritics

This is used in literary source reference abbreviations:
image

In proper names:
image

Other:
image

image

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jan 2, 2018

Coding of Sanskrit words in Cologne digitization - SLP1

In the current version of the Cologne digitization, here is how the various forms illustrated above are coded.

Devanagari in SLP1

The headword form appears in the <key2> tag in SLP1 coding. Accents and hyphens of the printed text are retained. In the printed text, there is invariably a latin-italic-diacritic representation of the headword that appears following the Devanagari script; this italic text does NOT have a separate
representation in the digitization; although it could be deduced from the key2 spelling.

Bold Latin text with diacritics in SLP1

These headwords also appear in key2 element with SLP1 spelling
For words in 3rd line of headwords, although the text normally only presents the suffix, nevertheless the digitization key2 field presents a spelling of the entire word implied by the assumed prefix and suffix. It seems likely that the original suffix part that appears in the printed text could be retrieved due to the imposition of hyphen characters within the key2 spelling, and the ability to deduce the 'parent' line 2 headword; but a complete recreation of the original text form at this point would likely falter in a few cases.

Italic Latin text with diacritics in SLP1

For 4th line headwords, these also appear in key2 element with SLP1 spelling.
As with line 3 headwords, they key2 element contains the full implied spelling of the headword, rather than just the suffix shown in the printed text.

It should be possible to retrieve the original printed form of the headwords by combining the
information in the 'key2' element with the headword line (1-4) information present in the H tag;
e.g., line 1 elements appear within <H1> tag, and similarly for lines 2-4 of headwords.

For other usages, the text appears within the <s> tag in SLP1 transliteration.

@funderburkjim
Copy link
Contributor Author

Coding of Sanskrit words in Cologne digitization - AS

Sanskrit words appearing in normal type face, with Diacritics is coded with the AS (letter-number) system.

In <ls> tags

The diacritics appearing in ls tag abbreviations appear without further markup using the AS system.
So the example show in the illustration above (headword aMSin) is coded <ls>Ya1jn5.</ls>, where
'a1' indicates a-macron and 'n5' indicates n-tilde.

The displays do a run-time conversion of this AS coding to Unicode IAST. However,
the resulting IAST is not fully modern. For example (see headword 'akramam' <ls>Naish.</ls> is
rendered just as 'Naish.', whereas a fully modern IAST would be 'Naiṣ.` for Naiṣadha-carita.

Elsewhere (not in <ls> tags).

This text has a more complex coding, using the <as0> and <as1> tags. The example under
akampana in above illustration is coded as
<as0>Ra1kshasa</as0><as1><s>rAkzasa</s></as1>

The as0 tag contains the original word, coded with AS. The as1 tag contains a representation
in which the SLP1 equivalent is coded within the <s> tag used elsewhere for SLP1 coding of Sanskrit.

In the displays as currently written, only the contents of the as1 tag are rendered into HTML. As with
other text appearing in the <s> tag, the renderer transforms the contents according to the user choice
of output format.

Reason for the as0, as1 system

I have to plead guilty here. In my early use of the dictionary, I found it quite difficult, when encountering words like rākṣasa (or rākshasa) , to derive the headword spelling (slp1 rAkzasa). The reasonable solution seemed to be to provide an slp1 spelling. This slp1 spelling also facilitates Advanced Searches for Sanskrit words appearing within the body of entries, as opposed to headwords.
So, it seemed good to have the SLP1 spelling. But, on the other hand, Some of the information (notably capitalization) is lost if only <s>rAkzasa</s> is in the text. Thus, the original AS representation was retained, but within the as0 tag.

The resulting system is inconsistent with an important tenet of text markup, as I now understand it. Namely, the original text should remain after markup is removed.

@funderburkjim
Copy link
Contributor Author

Coding of non-Sanskrit, non-English

There are words in numerous other languages appearing within the MW dictionary. The next few comment briefly describe the coding conventions used in the current digitization in these cases.

@funderburkjim
Copy link
Contributor Author

Greek

Greek text is coded in a roundabout way. An example will give the gist of the method.
image

mw.xml = a prefix corresponding to <ab>Gk.</ab> <gk>1</gk> , <gk>2</gk> , <ab>Lat.</ab>

This is within record with cologne id L = 4.

Within a record (distinct L-number), instances of Greek text are coded with the <gk> tag; the content of the tag is a sequence number of Greek text within the record.

An external table, with one line per L-code, is use to correlate this coding (mwaux/mwgreek/mwgreek_input.txt). The line for our L=4 example is
4 A)<e>A<e>ἀ<gk>A)N<e>A%29N<e>ἀν
Within such a record,

  • 4 is the L-number
  • <gk> element separates the record into two (in this case) Greek words
  • Each element is further parsed into 3 parts, based on the <e> element. For instance , the second
    instance A)N<e>A%29N<e>ἀν is parsed as:
    • A)N is the Greek Beta representation. This was, as I recall, the form in which a student Wendy Teo of either Malcolm Hyman or Peter Scharf originally coded the Greek. Here is an old document providing
      some further information.
    • A%29N is a url-encoded representation of the Beta representation
    • ἀν is Unicode representation of the word.

The displays of MW:

  • use the table to replace <gk>2</gk> for our L=4 with unicode text ἀν,
  • and furthermore use the URL encoding to generate a link to Perseus online Greek dictionary.

Suggested enhancement

The digitization should use a simpler system. I think the one we have used to implement the coding of Greek in other dictionaries by @jmigliori should be employed. Thus, our example would be coded
<ab>Gk.</ab> <lang n="greek">ἀ</lang> , <lang n="greek">ἀν</lang>

This would drop the links to Perseus, which I doubt are very useful.

@funderburkjim
Copy link
Contributor Author

Arabic text

This is coded in a way consistent with the method developed by @jsonreeder in the [ArabicInSanskrit][https://github.com/sanskrit-lexicon/ArabicInSanskrit/) repository. For instance, under headword 'araGawwa',

image

<lang script="Arabic" n="Hindustani">ارهٿ</lang>

@funderburkjim
Copy link
Contributor Author

Russian and other languages

Here is an example of Russian under headword aNgAra:
image

This is coded as <ab>Russ.</ab> <etym>u1golj</etym> .
Note that MW does not use the Cyrillic alphabet, but uses letters with diacritics, and these appear
within the <etym> tag and diacritics are represented with the AS system.

Based on the <ab>Russ.</ab> abbreviation, there are only 11 instances. I notice that in at least one instance, (under headword ahi), a '$' sign is used instead of the Latin letters with diacritics.

There are many other 'etymology' words similarly coded -- i.e. with Latin letters and diacritics represented with AS system.

Suggestions

Our Russian speakers should probably examine all these 11 cases. My hunch is that the 'u1golj' representation should be replaced by Cyrillic letters.

The other <etym> tags should also be examined and at the very least the AS should replaced with Unicode. Also the <etym> tag should probably be replaced with the <lang n="language">X</lang> markup. Finally, the 'etymology section' which, when present, normally appears at the end of an entry,
should probably be separated out into a separate record with a different L-number and H-code:

image

CURRENT:
<H1A><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> 
<lex type="inh">m.</lex> <ab>N.</ab> of a <as0>R2ishi</as0><as1><s>fzi</s></as1> ( with the 
patron. <s>OSanasa</s>) and of another ( with the patron. <s>pEdva</s>) . ( [ <ab>Zd.</ab> 
<etym>az8i</etym> ; <ab>Lat.</ab> <etym>angui-s</etym> ; <ab>Gk.</ab> <gk>1</gk> , 
<gk>2</gk> , <gk>3</gk> , and <gk>4</gk> ; <ab>Lith.</ab> <etym>ungury-s</etym> ; 
<ab>Russ.</ab> $; <ab>Armen.</ab> <etym>o7z</etym> ; <ab>Germ.</ab> <etym>unc</etym>.] 
) </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811</L></tail></H1A>
PERHAPS BETTER:
<H1A><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> 
<lex type="inh">m.</lex> <ab>N.</ab> of a <as0>R2ishi</as0><as1><s>fzi</s></as1> ( with the 
patron. <s>OSanasa</s>) and of another ( with the patron. <s>pEdva</s>) . </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811</L></tail></H!A>

SEPARATE H1E  RECORD.  SHOULD USE <lang> tab for separate language parts.
<H1E><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> 
( [ <ab>Zd.</ab> 
<etym>az8i</etym> ; <ab>Lat.</ab> <etym>angui-s</etym> ; <ab>Gk.</ab> <gk>1</gk> , 
<gk>2</gk> , <gk>3</gk> , and <gk>4</gk> ; <ab>Lith.</ab> <etym>ungury-s</etym> ; 
<ab>Russ.</ab> $; <ab>Armen.</ab> <etym>o7z</etym> ; <ab>Germ.</ab> <etym>unc</etym>.] 
) </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811.1</L></tail></H1E>

Using a search on <etym> tag, there are only about 900 records with such an 'etymology' section.

@funderburkjim
Copy link
Contributor Author

No need to have AS coding.

The displays hide the AS coding, by converting on the fly to Unicode, based on a particular AS-ROMAN transcoding, or based on using the <as1> tag, as needed.

An improvement to the MW digitization would replace all the AS with Unicode. This will impinge upon the details of literary source lookup, and probably some other things. And special attention will be needed to transform the <as0>X</as0><as1><s>Y</s></as1> instances. So the solution requires some care. But the result should be a better digitization, so it is worth doing.

@gasyoun
Copy link
Member

gasyoun commented Jan 2, 2018

Jim, thanks for the series of post, filling the gaps myself.

full implied spelling of the headword, rather than just the suffix shown in the printed text.

Indeed interesting, because the book-wise digital text is what some, including me, lack.

inconsistent with an important tenet of text markup, as I now understand it. Namely, the original text should remain after markup is removed.

So is capitalization lost or reversable? And do not feel guilty - MW is a hard nut.

The digitization should use a simpler system. I think the one we have used to implement the coding of Greek in other dictionaries by @jmigliori should be employed. Thus, our example would be coded
Gk. ἀ , ἀν

Yeah, it's hard. Opening the XML does not give me what's inside the original book, only web display gives.

This would drop the links to Perseus, which I doubt are very useful.

links to Perseus are a proof of concept and should be left as is. Why not move to a tag?
Similarly how you converted the line breakes inside a word in CCS.

Our Russian speakers should probably examine all these 11 cases. My hunch is that the 'u1golj' representation should be replaced by Cyrillic letters. In this case уголь.

At least in a tag = meta data, as we do not touch the original text, sure.

So the solution requires some care. But the result should be a better digitization, so it is worth doing.

As MW remains the most popular one - can't disagree.

@SergeA
Copy link

SergeA commented Jan 7, 2018

MW Russian words

Our Russian speakers should probably examine all these 11 cases. My hunch is that the 'u1golj' representation should be replaced by Cyrillic letters.

And this will make the example unreadable for those who don't know Russian letters (as perhaps MW himself)? You know there are also Armenian examples there, and who knows this Armenian alphabet, except Armenians themselves?? I think we should keep the original MW spelling, but give a tip for correct national spelling.

These are Russian examples from MW.

  1. akSi = akṣi Russ. oko = око
  2. aGgAra = aṅgāra Russ. ūgolj = уголь
  3. ahi = ahi Russ. $ // ûgorj = угорь
  4. iS = iṣ Russ. iskate = искате (искать? искати?)
  5. UrNA = ūrṇā Russ. vōlna = волна
  6. kRmi = kṛmi Russ. červj = червь
  7. kRS = kṛṣ Russ. česhu = чешу
  8. kRSNa = kṛṣṇa Russ. černyi = черный
  9. kravis = kravis Russ. krovj = кровь
  10. gRR = gṝ 2 Russ. z8ora // žora = жора
  11. grIva = grīva Russ. glava & golova = глава & голова

№3 & №10 need corrected Latin unicode text.
№4 is a print error. There is no such Russian word as iskate (искате) with "e" in the end. As MW gives the word as Russ. it should be written as iskatj (?) (искать). Or if it's about Old Russian or OCS then as iskati (искати). Actually MW takes the spellings from Bopp's dic. But in this case Bopp gives not Russ., but Slav. with correct spelling isk-a-ti.

Also Old. Slav. and Slav. words in MW to be revised and compared with Bopp.

@jsonreeder
Copy link

@funderburkjim FYI I notice that the Arabic text in your comment is not displaying exactly as I had expected. Note the dots on the left-most character.

In GitHub's monospace font (looks like "SFMono-Regular"), the dots appear incorrectly below the letter
ارهٿ

In GitHub's standard font, the dots appear correctly above the letter
ارهٿ

The expected behavior is for them to appear above the letter (as in the original dictionary).

Let's make sure that the font used on the website displays the letter properly.

@gasyoun
Copy link
Member

gasyoun commented Jan 8, 2018

OCS then as iskati

I guess i is meant at the end.

Let's make sure that the font used on the website displays the letter properly.

So nothing of this monospace font bug on the Cologne site itself?

@SergeA
Copy link

SergeA commented Jan 8, 2018

OCS then as iskati
I guess i is meant at the end.

Me too. In this case we should mark it as print error and correct it this way
Russ. iskate >> Slav. iskati
(Following Bopp.)

@SergeA
Copy link

SergeA commented Jan 8, 2018

I was wondering where from came a font bug in Siddhanta in the site, while the same Siddhanta looks ok in the Word. Now I know, it was not the same font, the bug comes from the web-font.

Web-font bug vs. Word ok.
siddhanta_webfont_bug_pr
siddhabta_word_ok_pr

@funderburkjim
Copy link
Contributor Author

Actually MW takes the spellings from Bopp's dic.

How is this known?

@funderburkjim
Copy link
Contributor Author

tama-praBA

Here is how this looks in Edge Browser: This looks as expected. So we can say Edge does it 'right'.

image

and in Firefox, which looks wrong in the same way that Chrome looks right, like in Word image above:

image

So, from the Edge example, it's not a bug in the web-font itself. But rather some difference in the
way the particular sequence is being rendered. It might be that the difference is somehow due to the
presence of the 'circle' ° which is rendered in Time New Roman.. Doubt if this detail of font rendering is something that we can control.

Once, a couple of years ago, Susan Moore discovered a problem with the way that Safari browser on Macintosh computer rendered a certain word in Devanagari (I think it was Izky -- where the Mac omitted the virAma in Devanagari).

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jan 9, 2018

@jsonreeder

Here is how araGawwa looks with Chrome browser. This uses Times New Roman, and appears right
according to your description (please confirm).

image

And here is how the above comment looks in Preview Write Mode of GitHub:
image

On my system this uses Segoe UI font.

One example conclusion: The arabic text is being rendered properly in the Cologne Displays on Windows OS computers.

We could develop a UI that would let you review the rendering of all the Arabic text examples in Cologne displays of MW. @jsonreeder Should we do this to see if there are any anomalies?

@funderburkjim
Copy link
Contributor Author

@SergeA

Will make corrections to Russian words as you suggest -- putting on todo list.

Also think the 'native language tooltip' idea is interesting. Not sure how to implement yet -- will depend
on the existing markup.

@SergeA
Copy link

SergeA commented Jan 9, 2018

Actually MW takes the spellings from Bopp's dic.

How is this known?

By comparison of the examples. I do not believe in accidental coincidences. All 11 Russ. examples are found also in Bopp, and Bopp often provides also Cyrillic renderings, and give more precise spellings. E.g. Bopp gives Russ. ćernyĭ with short ĭ, while MW gives černyi loosing breve diacritic. In another place for caturtha Bopp gives ćetvertyĭ while MW gives cetvertyi, loosing both diacritics.

@SergeA
Copy link

SergeA commented Jan 9, 2018

@funderburkjim can you make a list with Russ./Slav./Old.Slav. words for MW + Bopp? (The same way as for PWG.) I'd like to compare more words MW vs. Bopp, but manual searching takes too many time. I've noticed Bopp's words marked Slav. are mostly spelled identical or close to modern Russian. Do not know how to treat them.

@gasyoun
Copy link
Member

gasyoun commented Jan 9, 2018

So, from the Edge example, it's not a bug in the web-font itself. But rather some difference in the
way the particular sequence is being rendered.

Exactly.

Doubt if this detail of font rendering is something that we can control.

Mihas could try to fix it but will not, he does not update it.

caturtha Bopp gives ćetvertyĭ while MW gives cetvertyi, loosing both diacritics

Now that is interesting. I wonder how many of the 900 MW etymologies are found in Bopp?

@jsonreeder
Copy link

@funderburkjim

Here is how araGawwa looks with Chrome browser. This uses Times New Roman, and appears right according to your description (please confirm).

Confirmed. It looks right.

We could develop a UI that would let you review the rendering of all the Arabic text examples in Cologne displays of MW. @jsonreeder Should we do this to see if there are any anomalies?

I don't think that this is worth much extra effort. This is a rare character, and it appears fine in your screenshot. That said, if you'd like to be extra sure, I'll happily do some cross-browser testing. Just send me links to entries with Arabic characters and I'll compare them using Browser Stack.

@funderburkjim
Copy link
Contributor Author

@jsonreeder

I'm currently limited in what I can do, based on recent cataract surgery. So, I won't get to your suggestion for a while.
However, your mention of Browser Stack (is this https://www.browserstack.com/ it?) is new to me. How does that work?

@jsonreeder
Copy link

@funderburkjim Wishing you a good recovery for your surgery.

Browser Stack is a useful tool for cross-browser testing. They allow you to open your site on quite a few different combinations of OS and Browser. So it's a quick way of checking what the page will look like in various environments.

@gasyoun
Copy link
Member

gasyoun commented Jan 22, 2018

Browser Stack is a useful tool for cross-browser testing.

Have been using it for years. I think it's an overkill. Let's leave it, guys. Jim, so you have your 3rd eye open now legally?

@drdhaval2785
Copy link
Contributor

After meta-line conversion done and Greek and Russian being tracked elsewhere, this should be closable now.

@funderburkjim
Copy link
Contributor Author

Yes -- closable --- But that 'Browserstack' reference could be useful sometime.

@Andhrabharati
Copy link
Contributor

Andhrabharati commented Jan 5, 2021

   Actually MW takes the spellings from Bopp's dic.

How is this known?

Bopp's work came out in 1847 and contains Greek, Latin, German, Lithuanian, Slavic, Celtic.

MW's 1st edition appeared in 1872, and it is but natural for one to use the previously published works, as much as possible.

And MW himself has acknowledged (in mW72) which resources he has used-

image

Though MW 2nd edition (MW99) has the Introduction pages from the 1st edition added in it, this particular portion is 'skipped' there!!

@gasyoun
Copy link
Member

gasyoun commented Jan 5, 2021

Bopp's work came out in 1847 and contains Greek, Latin, German, Lithuanian, Slavic, Celtic.

Yes, all etymologies are Bopp based, nothing new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants