Uniform headword scheme for xxx.txt #130

funderburkjim · 2017-05-06T00:00:07Z

We need to devise a uniform scheme for headword coding in the xxx.txt digitizations.

When I started with these digitizations several years ago, it was difficult even to know how to identify
headwords in the digitizations - each digitization was different. This factor led to a system where the
headword identification was completely dynamic. In particular, the sequence number (what we refer to as the L number) was determined dynamically. This means that if some correction removed or added a headword, then all the subsequent L-numbers for that dictionary would change.

Now that the headword identification for the various dictionaries is much better known, it is time to
impose some uniformity in the headword coding of the xxx.txt digitizations; so at least in this fundamental aspect, the dictionaries are as similar as reasonably possible.

The details of this uniform headword coding are the subject of this issue.

funderburkjim · 2017-05-06T00:08:13Z

Recently, a step in the direction of this goal was made with the Schmidt dictionary documentation.

Current sch.txt headword scheme

Some examples.

.{#a#}1{#a°#}¦ {!2!}  , {%asvaptum%} Tāṇḍya-Br. 10 , 4 , 4. {part=,seq=1,type=,n=1}
.{#a#}1.1{#a#}¦ {!4!}  m. º= {%sarvajño 'rhan%} , S I , 53 , 3. — Viṣṇu , H 31 , 9; Vās. 113 , 1. {part=,seq=2,type=*,n=2}
.{#aMSa#}2{#aṃśa#}¦  1. {%kenāṃśena%} so v.a. in welchem Stücke? Daśak. 51 , 7. — 8. Nenner eines Bruches. {part=,seq=3,type=,n=3}
.{#aMSaka#}3{#aṃśaka#}¦  m. ºSohn , H 43 , 41. {part=,seq=4,type=,n=1}

The form is
.{#key1#}L-number{#key2#}¦ 
or, if there is a homonym number:
.{#key1#}L-number{#key2#}^HOM¦ 

key1 is in SLP1
key2 is in IAST

funderburkjim · 2017-05-06T00:08:39Z

example of adaptation of sch model to acc dictionary

Adapting this to ACC would involve changing all the lines to a similar form:

old:
<HI>{#aMSadaSA#}¦ jy. Rice 28.
new:
.{#aMSadaSA#}1{#aMSadaSA#}aMSadaSA
<HI>{#aMSuDara#}¦ poet Skm.

old 
<HI>{#akzoByatIrTa,#}¦ formerly Govindaśāstrin, successor of Mā-
new: NOTE REMOVAL OF comma in key1
.{#akzoByatIrTa#}L-num{#akzoByatIrTa,#}¦ formerly Govindaśāstrin, successor of Mā-
<HI>{#akzoByatIrTa,#}¦ formerly Govindaśāstrin, successor of Mā-

old:
<HI>{#aKaRqAnanda muni,#}¦ disciple of Akhaṇḍānubhūti:
new : Not removal of space and comma in key1
.{#aKaRqAnandamuni#}L-num{#aKaRqAnanda muni,#}¦ disciple of Akhaṇḍānubhūti:

funderburkjim · 2017-05-06T00:22:04Z

Headword coding in other dictionaries

We want to devise a scheme that is applicable to all the dictionaries. It will be helpful to see the
current headword forms in the xxx.txt digitizations of those other dictionaries.
None of these includes the L-number.

pw.txt and pwg.txt

I think the scheme is the same for these two dictionaries.

<H1>000{a}1{a}^1¦
<H1>000{a}1{a°}^2¦
<H1>100{nijaGni}1{nijaGni/}¦

FORM:  <H1>XXX{KEY1}1{KEY2}^Hom¦
 XXX is a word category number; details unknown
 KEY1 is in slp1
 KEY2 is in slp1, but with accents and other markings

funderburkjim · 2017-05-06T00:36:10Z

acc.txt

Examples above. The form is

<HI>{#KEY2#}¦
KEY2 SLP1

ap.txt

.{#anvavasargaH#}¦
.{#udan#}^1¦ 
.{#eka#}a¦

.{#KEY2#}HOM¦
 KEY2 is in SLP1

ap90.txt

<P>.{#a#}
<P>.{#aN(a)#}¦
<P>.{#udan#}^1¦

<P>.{#KEY2#}HOM¦
KEY2 SLP1

pd.txt

<P>.{#a#}^1¦

<P>.{#KEY2#}HOM¦
KEY2 SLP1

drdhaval2785 · 2017-05-06T01:41:11Z

One observation regarding ACC. ACC has three volumes. It usually does not have homonyms. But the same headword does appear in different volumes often. Should we treat them as separate headwords or homonyms of the same headword? For example, see 'anuBUtiprakASa'.

gasyoun · 2017-05-06T21:11:54Z

It will be helpful to see the
current headword forms in the xxx.txt digitizations of those other dictionaries.

Agree.

Should we treat them as
separate headwords or homonyms of the same headword?

We usually treat such entries as separate headwords. If we treat as pseudo-homonyms, what to do with real homonyms in other dictionaries, @drdhaval2785 ? As we want to achieve uniformity.

funderburkjim · 2017-05-06T21:20:05Z

A summary of the headword forms for all the dictionaries is in
headwordforms.md

This summary is drawn from an examination of

xxx.txt the digitization for a dictionary whose Cologne abbreviation is xxx
headword.py for xxx. The program module that contains a regular expression to parse the headwords from xxx.txt
occasionally, xxxhw0.txt and xxxhw2.txt - currently produced as extractions of
- key2 (xxxhw0.txt)
- key1 (xxxhw2.txt)

xxxhw0 and xxxhw2 also contain page numbers and line-ranges corresponding to entries.

funderburkjim · 2017-05-06T21:47:54Z

Here is one idea for how to modify the xxx.txt files to achieve a uniformity of coding that also preserves other aspects of the dictionaries; this idea is intended to be a conservative enhancement of xxx.txt formats.

For each entry in current xxx.txt, generate two lines in the revised xxx.txt:

old first line of entry:
1. headpart ¦ rest-of-line

new first two lines of entry:
1a. meta-data drawn from headpart and also including page-col and L
1b. headpart-derived-text AND rest-of-line

The meta-data line for entry to include:

L (as in <L> element of xxx.xml)
page (the page-col reference as in <pc> element of xxx.xml)
key1 (the slp1 form of the headword, as in <key1> element of xxx.xml)
key2 (SLP1 or IAST form of raw headword, as in <key2> element of xxx.xml)
homonym identifier where present (as in <hom> element of xxx.xml)

The exact format of the meta-data line should be easy to parse. Maybe a pseudo xml form:
<L>L<pc>page-col<k1>key1<k2>key2<h>hom

This transformation of the xxx.txt file would be a one-time change.
After that, the xxxhw0.txt and xxxhw2.txt files would be derived just by parsing the meta-lines;
in particular, the values of L, key1, key2, and page would not be computed dynamically, but rather
would just be read from the meta line.

The line-range field of xxxhw0 and xxxhw2 provides the information that a particular entry
is found say in lines 1234 through 1239 of xxx.txt. It still would be computed dynamically.

The construction of xxx.xml could also be done directly from xxx.txt - it would not have to reinterpret
the original headword forms, but would simply read these forms off of the meta-lines of xxx.txt.

Of course, the meta-lines of xxx.txt could be updated by the usual updateByLine process.
For instance, if it were found that a headword was wrongly spelled, the appropriate lines 1a and 1b
would be corrected.

If it were found that a record was wrongly marked as a headword, then that 1a meta line would
be replaced by an empty line.

If it were found that a headword was incorrectly not identified as a headword, then a line 1a, 1b WITH A DECIMAL L-NUMBER could be introduced --- this would be potentially awkward ( a program has not been written for such line insertions in xxx.txt), but conceptually simple.

funderburkjim · 2017-05-06T21:57:13Z

meta-line and headword alternates

The <L>L<pc>page-col<k1>key1<k2>key2<h>hom meta-line form could be enhanced to
represent the parsed alternate headwords.

For example if there were three headwords (key1,X,Y) implied by key2, the meta-line could be coded
as
<L>L<pc>page-col<k1>key1<k2>key2<h>hom<k1a>X,LX<k1a>Y,LY , where

X and Y would be slp1 alternates of key1
LX and LY would be decimal point-variants of L.

funderburkjim · 2017-05-08T03:20:26Z

Since I'll be working on all the dictionaries for AS-IAST conversion, it would be convenient to make the suggested changes to the xxx.txt digitizations at the same time.

It's a big change, in that several programs are involved, and subsequent work would need to take the
new form into account.

I hope others will give thought to this proposal now. Your comments will help me decide whether now is the time to proceed with this. Obviously, I am leaning towards going ahead with it.

drdhaval2785 · 2017-05-08T03:53:24Z

Both AS-IAST and uniform headwords are both of utmost importance. Third most important item is uniform DTD. Once these three are achieved, downstream programs can be more or less uniform rather than dictionary specific. So it makes sense to complete these three tasks ASAP.

drdhaval2785 · 2017-05-09T01:46:26Z

1b. headpart-derived-text AND rest-of-line

headpart-derived-text

This phrase also needs to be fixed. I am stuck up with what to keep in headword-derived-text. Currently I am keeping the whole head part as it is. But I feel this is not what we want. We want it to be uniform for all dictionaries. So deciding what we keep here and in what tagging is important for uniformity.

drdhaval2785 · 2017-05-09T01:47:52Z

@funderburkjim, I am trying to create one script (by modifying your make_xml.py) to make acc.txt uniform on dev server. I will post results and scripts once I am through. I am unable to push to newly installed acc too. Maybe you need to run that command for acc too.

funderburkjim · 2017-05-09T03:18:00Z

Maybe you need to run ...

Drat! I forgot there also when remade slimmed down acc.

P.S. I did this, but you could also have done! Just login to dev, cd to acc, issue that git config statement.

funderburkjim · 2017-05-09T03:27:39Z

headpart-derived-text

Right, this is a tricky part. The way I am thinking of it is that the 'meta' line is what will be uniformly
formatted among dictionaries. The 'headword-derived-text' part will vary among dictionaries. The
goal will be to consult (a) the old headword-part of the digitization and (b) the printed form of the beginning of entries for a dictionary, and try to use (a) to approximate (b).

Your approach sounds reasonable (based on make_xml.py). Will be interested to see what you come up with.

drdhaval2785 · 2017-05-09T16:42:01Z

Your approach sounds reasonable (based on make_xml.py). Will be interested to see what you come up with.

https://gist.github.com/drdhaval2785/b2e02e4b72bdbc718bc7d7621bd35ec8 is the code generated from your make_xml.py. It is written for ACC. But with minor modifications, it should be OK to work with other dictionaries too.

'headword-derived-text'

I will do some survey of existing patterns mentioned by you in headwordforms.md.
Will try to device some generic tag.

drdhaval2785 · 2017-05-15T02:44:08Z

Headpartderived

{#key2#}hom[extraInfoField]¦

Corresponding change is Head is

<L>L<pc>page-col<k1>key1<k2>key2<h>hom<k1a>X,LX:Y,LY<e>extraInfo.

Here <h>, <k1a>, <e> are optional.

Any suggestion @funderburkjim?

<e> proved useful when converting back different regexes used in identifying headword in a specific dictionary. E.g verbs have different markup in MD. It was stored as spearate entity extraInfo and used in reverse journey.

Also note that I have suggested single <k1a> tag for alternate headword. Colon separated entries. This should be fine. This will marginally improve parsing. No ambiguous tags.

gasyoun · 2017-05-15T05:07:33Z

verbs have different markup in MD. It was stored as spearate entity extraInfo and used in reverse journey.

Well done.

drdhaval2785 · 2020-12-17T09:52:43Z

Now meta-IAST conversion is already done. Closing.

funderburkjim mentioned this issue May 6, 2017

acc.xml issues #115

Closed

drdhaval2785 mentioned this issue May 10, 2017

Specific issues for converting acc.txt to have identical headword line #133

Closed

drdhaval2785 closed this as completed Dec 17, 2020

gasyoun added the clean-code Code cleanup and modernisation label Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniform headword scheme for xxx.txt #130

Uniform headword scheme for xxx.txt #130

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017 •

edited

Loading

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017 •

edited

Loading

drdhaval2785 commented May 6, 2017 via email

gasyoun commented May 6, 2017

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017

funderburkjim commented May 8, 2017

drdhaval2785 commented May 8, 2017 via email

drdhaval2785 commented May 9, 2017

drdhaval2785 commented May 9, 2017 via email

funderburkjim commented May 9, 2017 •

edited

Loading

funderburkjim commented May 9, 2017

drdhaval2785 commented May 9, 2017

drdhaval2785 commented May 15, 2017

gasyoun commented May 15, 2017

drdhaval2785 commented Dec 17, 2020

Uniform headword scheme for xxx.txt #130

Uniform headword scheme for xxx.txt #130

Comments

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017 • edited Loading

Current sch.txt headword scheme

funderburkjim commented May 6, 2017

example of adaptation of sch model to acc dictionary

funderburkjim commented May 6, 2017

Headword coding in other dictionaries

pw.txt and pwg.txt

funderburkjim commented May 6, 2017 • edited Loading

acc.txt

ap.txt

ap90.txt

pd.txt

drdhaval2785 commented May 6, 2017 via email

gasyoun commented May 6, 2017

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017

funderburkjim commented May 6, 2017

meta-line and headword alternates

funderburkjim commented May 8, 2017

drdhaval2785 commented May 8, 2017 via email

drdhaval2785 commented May 9, 2017

drdhaval2785 commented May 9, 2017 via email

funderburkjim commented May 9, 2017 • edited Loading

funderburkjim commented May 9, 2017

drdhaval2785 commented May 9, 2017

drdhaval2785 commented May 15, 2017

gasyoun commented May 15, 2017

drdhaval2785 commented Dec 17, 2020

funderburkjim commented May 6, 2017 •

edited

Loading

funderburkjim commented May 6, 2017 •

edited

Loading

funderburkjim commented May 9, 2017 •

edited

Loading