Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform headword scheme for xxx.txt #130

Closed
funderburkjim opened this issue May 6, 2017 · 19 comments
Closed

Uniform headword scheme for xxx.txt #130

funderburkjim opened this issue May 6, 2017 · 19 comments
Labels
clean-code Code cleanup and modernisation

Comments

@funderburkjim
Copy link
Contributor

We need to devise a uniform scheme for headword coding in the xxx.txt digitizations.

When I started with these digitizations several years ago, it was difficult even to know how to identify
headwords in the digitizations - each digitization was different. This factor led to a system where the
headword identification was completely dynamic. In particular, the sequence number (what we refer to as the L number) was determined dynamically. This means that if some correction removed or added a headword, then all the subsequent L-numbers for that dictionary would change.

Now that the headword identification for the various dictionaries is much better known, it is time to
impose some uniformity in the headword coding of the xxx.txt digitizations; so at least in this fundamental aspect, the dictionaries are as similar as reasonably possible.

The details of this uniform headword coding are the subject of this issue.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 6, 2017

Recently, a step in the direction of this goal was made with the Schmidt dictionary documentation.

Current sch.txt headword scheme

Some examples.

.{#a#}1{#a°#}¦ {!2!}  , {%asvaptum%} Tāṇḍya-Br. 10 , 4 , 4. {part=,seq=1,type=,n=1}
.{#a#}1.1{#a#}¦ {!4!}  m. º= {%sarvajño 'rhan%} , S I , 53 , 3. — Viṣṇu , H 31 , 9; Vās. 113 , 1. {part=,seq=2,type=*,n=2}
.{#aMSa#}2{#aṃśa#}¦  1. {%kenāṃśena%} so v.a. in welchem Stücke? Daśak. 51 , 7. — 8. Nenner eines Bruches. {part=,seq=3,type=,n=3}
.{#aMSaka#}3{#aṃśaka#}¦  m. ºSohn , H 43 , 41. {part=,seq=4,type=,n=1}

The form is
.{#key1#}L-number{#key2#}¦ 
or, if there is a homonym number:
.{#key1#}L-number{#key2#}^HOM¦ 

key1 is in SLP1
key2 is in IAST

@funderburkjim
Copy link
Contributor Author

example of adaptation of sch model to acc dictionary

Adapting this to ACC would involve changing all the lines to a similar form:

old:
<HI>{#aMSadaSA#}¦ jy. Rice 28.
new:
.{#aMSadaSA#}1{#aMSadaSA#}aMSadaSA
<HI>{#aMSuDara#}¦ poet Skm.

old 
<HI>{#akzoByatIrTa,#}¦ formerly Govindaśāstrin, successor of Mā-
new: NOTE REMOVAL OF comma in key1
.{#akzoByatIrTa#}L-num{#akzoByatIrTa,#}¦ formerly Govindaśāstrin, successor of Mā-
<HI>{#akzoByatIrTa,#}¦ formerly Govindaśāstrin, successor of Mā-

old:
<HI>{#aKaRqAnanda muni,#}¦ disciple of Akhaṇḍānubhūti:
new : Not removal of space and comma in key1
.{#aKaRqAnandamuni#}L-num{#aKaRqAnanda muni,#}¦ disciple of Akhaṇḍānubhūti:

@funderburkjim
Copy link
Contributor Author

Headword coding in other dictionaries

We want to devise a scheme that is applicable to all the dictionaries. It will be helpful to see the
current headword forms in the xxx.txt digitizations of those other dictionaries.
None of these includes the L-number.

pw.txt and pwg.txt

I think the scheme is the same for these two dictionaries.

<H1>000{a}1{a}^1¦
<H1>000{a}1{a°}^2¦
<H1>100{nijaGni}1{nijaGni/}¦

FORM:  <H1>XXX{KEY1}1{KEY2}^Hom¦
 XXX is a word category number; details unknown
 KEY1 is in slp1
 KEY2 is in slp1, but with accents and other markings

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 6, 2017

acc.txt

Examples above. The form is

<HI>{#KEY2#}¦
KEY2 SLP1

ap.txt

.{#anvavasargaH#}¦
.{#udan#}^1¦ 
.{#eka#}a¦

.{#KEY2#}HOM¦
 KEY2 is in SLP1

ap90.txt

<P>.{#a#}
<P>.{#aN(a)#}¦
<P>.{#udan#}^1¦

<P>.{#KEY2#}HOM¦
KEY2 SLP1

pd.txt

<P>.{#a#}^1¦

<P>.{#KEY2#}HOM¦
KEY2 SLP1

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented May 6, 2017 via email

@gasyoun
Copy link
Member

gasyoun commented May 6, 2017

It will be helpful to see the
current headword forms in the xxx.txt digitizations of those other dictionaries.

Agree.

Should we treat them as
separate headwords or homonyms of the same headword?

We usually treat such entries as separate headwords. If we treat as pseudo-homonyms, what to do with real homonyms in other dictionaries, @drdhaval2785 ? As we want to achieve uniformity.

@funderburkjim
Copy link
Contributor Author

A summary of the headword forms for all the dictionaries is in
headwordforms.md

This summary is drawn from an examination of

  • xxx.txt the digitization for a dictionary whose Cologne abbreviation is xxx
  • headword.py for xxx. The program module that contains a regular expression to parse the headwords from xxx.txt
  • occasionally, xxxhw0.txt and xxxhw2.txt - currently produced as extractions of
    • key2 (xxxhw0.txt)
    • key1 (xxxhw2.txt)

xxxhw0 and xxxhw2 also contain page numbers and line-ranges corresponding to entries.

@funderburkjim
Copy link
Contributor Author

Here is one idea for how to modify the xxx.txt files to achieve a uniformity of coding that also preserves other aspects of the dictionaries; this idea is intended to be a conservative enhancement of xxx.txt formats.

For each entry in current xxx.txt, generate two lines in the revised xxx.txt:

old first line of entry:
1. headpart ¦ rest-of-line

new first two lines of entry:
1a. meta-data drawn from headpart and also including page-col and L
1b. headpart-derived-text AND rest-of-line

The meta-data line for entry to include:

  • L (as in <L> element of xxx.xml)
  • page (the page-col reference as in <pc> element of xxx.xml)
  • key1 (the slp1 form of the headword, as in <key1> element of xxx.xml)
  • key2 (SLP1 or IAST form of raw headword, as in <key2> element of xxx.xml)
  • homonym identifier where present (as in <hom> element of xxx.xml)

The exact format of the meta-data line should be easy to parse. Maybe a pseudo xml form:
<L>L<pc>page-col<k1>key1<k2>key2<h>hom

This transformation of the xxx.txt file would be a one-time change.
After that, the xxxhw0.txt and xxxhw2.txt files would be derived just by parsing the meta-lines;
in particular, the values of L, key1, key2, and page would not be computed dynamically, but rather
would just be read from the meta line.

The line-range field of xxxhw0 and xxxhw2 provides the information that a particular entry
is found say in lines 1234 through 1239 of xxx.txt. It still would be computed dynamically.

The construction of xxx.xml could also be done directly from xxx.txt - it would not have to reinterpret
the original headword forms, but would simply read these forms off of the meta-lines of xxx.txt.

Of course, the meta-lines of xxx.txt could be updated by the usual updateByLine process.
For instance, if it were found that a headword was wrongly spelled, the appropriate lines 1a and 1b
would be corrected.

If it were found that a record was wrongly marked as a headword, then that 1a meta line would
be replaced by an empty line.

If it were found that a headword was incorrectly not identified as a headword, then a line 1a, 1b WITH A DECIMAL L-NUMBER could be introduced --- this would be potentially awkward ( a program has not been written for such line insertions in xxx.txt), but conceptually simple.

@funderburkjim
Copy link
Contributor Author

meta-line and headword alternates

The <L>L<pc>page-col<k1>key1<k2>key2<h>hom meta-line form could be enhanced to
represent the parsed alternate headwords.

For example if there were three headwords (key1,X,Y) implied by key2, the meta-line could be coded
as
<L>L<pc>page-col<k1>key1<k2>key2<h>hom<k1a>X,LX<k1a>Y,LY , where

  • X and Y would be slp1 alternates of key1
  • LX and LY would be decimal point-variants of L.

@funderburkjim
Copy link
Contributor Author

Since I'll be working on all the dictionaries for AS-IAST conversion, it would be convenient to make the suggested changes to the xxx.txt digitizations at the same time.

It's a big change, in that several programs are involved, and subsequent work would need to take the
new form into account.

I hope others will give thought to this proposal now. Your comments will help me decide whether now is the time to proceed with this. Obviously, I am leaning towards going ahead with it.

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented May 8, 2017 via email

@drdhaval2785
Copy link
Contributor

1b. headpart-derived-text AND rest-of-line

headpart-derived-text

This phrase also needs to be fixed. I am stuck up with what to keep in headword-derived-text. Currently I am keeping the whole head part as it is. But I feel this is not what we want. We want it to be uniform for all dictionaries. So deciding what we keep here and in what tagging is important for uniformity.

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented May 9, 2017 via email

@funderburkjim
Copy link
Contributor Author

funderburkjim commented May 9, 2017

Maybe you need to run ...

Drat! I forgot there also when remade slimmed down acc.

P.S. I did this, but you could also have done! Just login to dev, cd to acc, issue that git config statement.

@funderburkjim
Copy link
Contributor Author

headpart-derived-text

Right, this is a tricky part. The way I am thinking of it is that the 'meta' line is what will be uniformly
formatted among dictionaries. The 'headword-derived-text' part will vary among dictionaries. The
goal will be to consult (a) the old headword-part of the digitization and (b) the printed form of the beginning of entries for a dictionary, and try to use (a) to approximate (b).

Your approach sounds reasonable (based on make_xml.py). Will be interested to see what you come up with.

@drdhaval2785
Copy link
Contributor

Your approach sounds reasonable (based on make_xml.py). Will be interested to see what you come up with.

https://gist.github.com/drdhaval2785/b2e02e4b72bdbc718bc7d7621bd35ec8 is the code generated from your make_xml.py. It is written for ACC. But with minor modifications, it should be OK to work with other dictionaries too.

'headword-derived-text'

I will do some survey of existing patterns mentioned by you in headwordforms.md.
Will try to device some generic tag.

@drdhaval2785
Copy link
Contributor

Headpartderived

{#key2#}hom[extraInfoField]¦

Corresponding change is Head is

<L>L<pc>page-col<k1>key1<k2>key2<h>hom<k1a>X,LX:Y,LY<e>extraInfo.

Here <h>, <k1a>, <e> are optional.

Any suggestion @funderburkjim?

<e> proved useful when converting back different regexes used in identifying headword in a specific dictionary. E.g verbs have different markup in MD. It was stored as spearate entity extraInfo and used in reverse journey.

Also note that I have suggested single <k1a> tag for alternate headword. Colon separated entries. This should be fine. This will marginally improve parsing. No ambiguous tags.

@gasyoun
Copy link
Member

gasyoun commented May 15, 2017

verbs have different markup in MD. It was stored as spearate entity extraInfo and used in reverse journey.

Well done.

@drdhaval2785
Copy link
Contributor

Now meta-IAST conversion is already done. Closing.

@gasyoun gasyoun added the clean-code Code cleanup and modernisation label Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-code Code cleanup and modernisation
Projects
None yet
Development

No branches or pull requests

3 participants