MW missing headword puzzle #221

funderburkjim · 2018-04-10T20:34:56Z

Background

One of the goals of the MW meta/iast conversion is to improve the MW markup.

In the course of this, an odd aspect of the coding of the root ni- √mṝ came to the fore.
Both the context leading to this oddity, and the somewhat unrelated nature of the oddity illuminate
interesting aspects of the construction of the MW digitization, past and current.

Context

Yesterday, I was working with the <vlex> tag. This tag was introduced in 2011. The objective
was to identify with markup the parts of root records, so as to (a) know which headwords are roots and (b) to distinguish certain other aspects of root records (class-pada usage information, important for inflected forms; is root a prefixed root, or not; is root classified as a Denominative).

So, we wanted to mine MW for information about roots, and used the addition of a <vlex> tag as
a means to narrow the problem. The <root> tag also played a part here.

Later work then parsed this information and simplified it; for instance a listing of roots with class-pada information was developed.

Meta information

It now seems to me that it is best to think of this markup (using vlex and root tags) as an intermediate
step in generating meta-information about the MW dictionary. For instance it seems better to add the summarized class-pada information as explicitly meta information, rather to leave it in the dictionary
in implicit form involving the vlex and root tags.

The text was updated successfully, but these errors were encountered:

funderburkjim · 2018-04-10T20:46:21Z

A specific meta information example, root aṁś

The codings use the meta-iast forms under development.

Coding with <vlex> and <root> tags

<L>9<pc>1,1<k1>aMS<k2>aMS<e>1
<s>aMS</s> ¦ <vlex type="root"></vlex> <vlex>cl.10 P.</vlex> <s>aMSayati</s> , <to/>to divide, 
distribute, <ls>L.</ls> ; also occasionally <vlex>Ā.</vlex> <s>aMSayate</s>, <ls>L.</ls> ; also 
<s>aMSApayati</s>, <ls>L.</ls><info mwgenuineroot="yes"/>
<LEND>

Coding with <info> tag

<L>9<pc>1,1<k1>aMS<k2>aMS<e>1
<s>aMS</s> ¦  <ab>cl.</ab> 10 <ab>P.</ab> <s>aMSayati</s> , <to/>to divide, distribute, 
<ls>L.</ls> ; also occasionally <ab>Ā.</ab> <s>aMSayate</s>, <ls>L.</ls> ; also 
<s>aMSApayati</s>, <ls>L.</ls>
<info mwgenuineroot="yes"/><info verbcp="10P,10Ā"/>
<LEND>

Note:

<ab>cl.</ab> 10 <ab>P.</ab> reflect the printed text as 'cl.' and 'P.' are first of all abbreviations.
Also <ab>Ā.</ab> appearing later in the entry is also first of all an abbreviation.
Now the interpretation of those abbreviations appears in the <info verbcp="10P,10Ā"/>.
- the <info> tag is reserved for meta information about the entry.
- the particular attribute verbcp has been chosen so as to remind a reader of the nature of the
  meta information (verb class pada information about a root)
- the particular attribute value 10P,10Ā summarizes that, according to this entry, aṁś is a class 10
  root, which may be conjugated in the present system using either parasmaipada or atmanepada
  paradigms.
This information of this tag was imported from the verb_cp.txt file mentioned above.
There is no loss of information by removing the vlex and root tags: all the information is present in
the replacement info tag.
The information of the replacement info tag will be much easier to access, since it is more explicit.

Note:
The <info mwgenuineroot="yes"/> element is another kind of meta information now incorporated into
<info> tags in the current digitization revision. It means that this root entry appears in large Devanagari type in the printed text, which MW in the preface mentioned as a typographical means of
identifying the most important roots, which he called 'genuine' roots. The information for this tag was
previously hidden in an ancillary genuineroots file.

funderburkjim · 2018-04-10T21:58:07Z

The nimṛ / nimṝ mystery

In the course of introducing <info preverb="X"/> meta-information, one of the cases caused a problem. This preverb information was derived from work done in 2009. From that work, we had
an instance `109046 nimf ni+mf', meaning L-number of 109046, etc. But in the current MW,
there is no L=109046.

Further, there is an L= 109047, which gives us page 551, and the scan clearly shows root 'nimf' !

So how did we lose this root?

Well, it turns out that there is an entry in the Supplement to mw for nimṝ (long vowel ṝ), on page 1329:

This says that the original nimṛ was a spelling error, and should be changed to nimṝ.

Back in 2012, we undertook to embed the supplement into the body of the digitization.
And on October 19, 2012, we embedded this change in the following manner:
a) the 109046 nimṛ record was deleted, and
b) the new nimṝ record was inserted in proper alphabetical order as L=109052.5,
between nimṛd and nime.

funderburkjim · 2018-04-10T22:14:17Z

Problems with above solution

The current coding of nimF is:

<L>109052.5<pc>551,2<k1>nimF<k2>ni-mF<e>1
<s>ni-<root>mF</root></s> ¦ ( 2. <ab>sg.</ab> <ab>Impv.</ab> <s>-mfRIhi</s>) , 
to crush, <ls>AV. x, 1 , 17.</ls> <pb n="1329,3"/> <info n="rev"/>
<LEND>

The two fields <pb n="1329,3"/> <info n="rev"/> are useful, in that they indicate that
a) this record has been revised based on the supplement <info n="rev"/>
b) The revision page is 1329, column 3.
The reordering of the record (putting it in a proper alphabetical ordering location) is confusing
when comparison the digitization to the scanned image.
There is no indication that the former spelling was nimf.
There is no help in the displays regarding this change.

I view this as a problem, but I don't have a clear conception of a solution.

I don't know how many cases are 'like' this. We have all the update logs from back in 2012 on the
Cologne server. Somehow using them would allow an identification of similar cases, once we formulate
programmable notions of what similarity means here.

Assuming we have identified the similar cases, should we resuscitate the original headwords and make
them part of the current digitization, with some standard boilerplate comments in both the original
wrong spelling and the current revised spelling.

I'll flag this comment as a 'bug' . Maybe sometime we'll get more data on the scope of the problem and
some clearer idea of a good solution.

gasyoun · 2018-04-11T06:52:55Z

Jim, you know I'm thrilled with verbs. I was not a aware of a genuine dhatu list, so you opened my eyes. I will want to reuse it my dhatu research as well.

make them part of the current digitization

Sure we want to. That is the only big excuse for using the digital version and not the original paper book - that it can fix what the Motilal reprinters will never do.

drdhaval2785 · 2020-12-17T07:40:23Z

I think this should be treated as an edge case, and does not require further analysis. Closing.

funderburkjim added the bug Not working as expected label Apr 10, 2018

drdhaval2785 closed this as completed Dec 17, 2020

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) #325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MW missing headword puzzle #221

MW missing headword puzzle #221

funderburkjim commented Apr 10, 2018

funderburkjim commented Apr 10, 2018

funderburkjim commented Apr 10, 2018

funderburkjim commented Apr 10, 2018

gasyoun commented Apr 11, 2018

drdhaval2785 commented Dec 17, 2020

MW missing headword puzzle #221

MW missing headword puzzle #221

Comments

funderburkjim commented Apr 10, 2018

Background

Context

Meta information

funderburkjim commented Apr 10, 2018

A specific meta information example, root aṁś

funderburkjim commented Apr 10, 2018

The nimṛ / nimṝ mystery

funderburkjim commented Apr 10, 2018

Problems with above solution

gasyoun commented Apr 11, 2018

drdhaval2785 commented Dec 17, 2020