Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MW missing headword puzzle #221

Closed
funderburkjim opened this issue Apr 10, 2018 · 5 comments
Closed

MW missing headword puzzle #221

funderburkjim opened this issue Apr 10, 2018 · 5 comments
Labels
bug Not working as expected

Comments

@funderburkjim
Copy link
Contributor

Background

One of the goals of the MW meta/iast conversion is to improve the MW markup.

In the course of this, an odd aspect of the coding of the root ni- √mṝ came to the fore.
Both the context leading to this oddity, and the somewhat unrelated nature of the oddity illuminate
interesting aspects of the construction of the MW digitization, past and current.

Context

Yesterday, I was working with the <vlex> tag. This tag was introduced in 2011. The objective
was to identify with markup the parts of root records, so as to (a) know which headwords are roots and (b) to distinguish certain other aspects of root records (class-pada usage information, important for inflected forms; is root a prefixed root, or not; is root classified as a Denominative).

So, we wanted to mine MW for information about roots, and used the addition of a <vlex> tag as
a means to narrow the problem. The <root> tag also played a part here.

Later work then parsed this information and simplified it; for instance a listing of roots with class-pada information was developed.

Meta information

It now seems to me that it is best to think of this markup (using vlex and root tags) as an intermediate
step in generating meta-information about the MW dictionary. For instance it seems better to add the summarized class-pada information as explicitly meta information, rather to leave it in the dictionary
in implicit form involving the vlex and root tags.

@funderburkjim
Copy link
Contributor Author

A specific meta information example, root aṁś

The codings use the meta-iast forms under development.

Coding with <vlex> and <root> tags

<L>9<pc>1,1<k1>aMS<k2>aMS<e>1
<s>aMS</s> ¦ <vlex type="root"></vlex> <vlex>cl.10 P.</vlex> <s>aMSayati</s> , <to/>to divide, 
distribute, <ls>L.</ls> ; also occasionally <vlex>Ā.</vlex> <s>aMSayate</s>, <ls>L.</ls> ; also 
<s>aMSApayati</s>, <ls>L.</ls><info mwgenuineroot="yes"/>
<LEND>

Coding with <info> tag

<L>9<pc>1,1<k1>aMS<k2>aMS<e>1
<s>aMS</s> ¦  <ab>cl.</ab> 10 <ab>P.</ab> <s>aMSayati</s> , <to/>to divide, distribute, 
<ls>L.</ls> ; also occasionally <ab>Ā.</ab> <s>aMSayate</s>, <ls>L.</ls> ; also 
<s>aMSApayati</s>, <ls>L.</ls>
<info mwgenuineroot="yes"/><info verbcp="10P,10Ā"/>
<LEND>

Note:

  • <ab>cl.</ab> 10 <ab>P.</ab> reflect the printed text as 'cl.' and 'P.' are first of all abbreviations.
  • Also <ab>Ā.</ab> appearing later in the entry is also first of all an abbreviation.
  • Now the interpretation of those abbreviations appears in the <info verbcp="10P,10Ā"/>.
    • the <info> tag is reserved for meta information about the entry.
    • the particular attribute verbcp has been chosen so as to remind a reader of the nature of the
      meta information (verb class pada information about a root)
    • the particular attribute value 10P,10Ā summarizes that, according to this entry, aṁś is a class 10
      root, which may be conjugated in the present system using either parasmaipada or atmanepada
      paradigms.
  • This information of this tag was imported from the verb_cp.txt file mentioned above.
  • There is no loss of information by removing the vlex and root tags: all the information is present in
    the replacement info tag.
  • The information of the replacement info tag will be much easier to access, since it is more explicit.

Note:
The <info mwgenuineroot="yes"/> element is another kind of meta information now incorporated into
<info> tags in the current digitization revision. It means that this root entry appears in large Devanagari type in the printed text, which MW in the preface mentioned as a typographical means of
identifying the most important roots, which he called 'genuine' roots. The information for this tag was
previously hidden in an ancillary genuineroots file.

@funderburkjim
Copy link
Contributor Author

The nimṛ / nimṝ mystery

In the course of introducing <info preverb="X"/> meta-information, one of the cases caused a problem. This preverb information was derived from work done in 2009. From that work, we had
an instance `109046 nimf ni+mf', meaning L-number of 109046, etc. But in the current MW,
there is no L=109046.

Further, there is an L= 109047, which gives us page 551, and the scan clearly shows root 'nimf' !
image

So how did we lose this root?

Well, it turns out that there is an entry in the Supplement to mw for nimṝ (long vowel ṝ), on page 1329:
image

This says that the original nimṛ was a spelling error, and should be changed to nimṝ.

Back in 2012, we undertook to embed the supplement into the body of the digitization.
And on October 19, 2012, we embedded this change in the following manner:
a) the 109046 nimṛ record was deleted, and
b) the new nimṝ record was inserted in proper alphabetical order as L=109052.5,
between nimṛd and nime.

image

@funderburkjim
Copy link
Contributor Author

Problems with above solution

The current coding of nimF is:

<L>109052.5<pc>551,2<k1>nimF<k2>ni-mF<e>1
<s>ni-<root>mF</root></s> ¦ ( 2. <ab>sg.</ab> <ab>Impv.</ab> <s>-mfRIhi</s>) , 
to crush, <ls>AV. x, 1 , 17.</ls> <pb n="1329,3"/> <info n="rev"/>
<LEND>
  • The two fields <pb n="1329,3"/> <info n="rev"/> are useful, in that they indicate that
    a) this record has been revised based on the supplement <info n="rev"/>
    b) The revision page is 1329, column 3.
  • The reordering of the record (putting it in a proper alphabetical ordering location) is confusing
    when comparison the digitization to the scanned image.
  • There is no indication that the former spelling was nimf.
  • There is no help in the displays regarding this change.

I view this as a problem, but I don't have a clear conception of a solution.

I don't know how many cases are 'like' this. We have all the update logs from back in 2012 on the
Cologne server. Somehow using them would allow an identification of similar cases, once we formulate
programmable notions of what similarity means here.

Assuming we have identified the similar cases, should we resuscitate the original headwords and make
them part of the current digitization, with some standard boilerplate comments in both the original
wrong spelling and the current revised spelling.

I'll flag this comment as a 'bug' . Maybe sometime we'll get more data on the scope of the problem and
some clearer idea of a good solution.

@funderburkjim funderburkjim added the bug Not working as expected label Apr 10, 2018
@gasyoun
Copy link
Member

gasyoun commented Apr 11, 2018

Jim, you know I'm thrilled with verbs. I was not a aware of a genuine dhatu list, so you opened my eyes. I will want to reuse it my dhatu research as well.

make them part of the current digitization

Sure we want to. That is the only big excuse for using the digital version and not the original paper book - that it can fix what the Motilal reprinters will never do.

@drdhaval2785
Copy link
Contributor

I think this should be treated as an edge case, and does not require further analysis. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as expected
Projects
None yet
Development

No branches or pull requests

3 participants