Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternate headwords for acc #30

Open
funderburkjim opened this issue May 17, 2017 · 14 comments
Open

alternate headwords for acc #30

funderburkjim opened this issue May 17, 2017 · 14 comments

Comments

@funderburkjim
Copy link
Owner

We have previously added alternate headwords for dictionaries AP90 and SKD.

We are now wanting to add alternate headwords for ACC, and other dictionaries.

ACC.txt has been enchanced to contain a meta line with L-codes. This will make the process
different than that used for AP90 and SKD.

@funderburkjim
Copy link
Owner Author

funderburkjim commented May 17, 2017

parseheadline.py

This is a nice routine Dhaval developed to parse strings of the form <key1>val1...<keyn>valn.

It returns a dictionary d, d['key']=val1, etc.

The routine is quite general.

It is used to parse the meta line of acc.txt.

I think it can also be used below to parse acc_hwextra.txt.

@funderburkjim
Copy link
Owner Author

L-number identify entries

Our primary understanding of dictionary structure is that a dictionary is composed of a
sequence of entries, and each entry has as its primary identity a headword.

The headword in general does not completely specify an entry, because there may be multiple entries
with the same headword. This fact is the reason that it is necessary to have a separate entry
identifier (which we have called the L number). The L-number uniquely specifies an entry. The
L-numbers for a dictionary should have further properties:

  • The L-number should be a decimal number, though normally the fractional part will be 0 and need
    not be stated explicitly.
  • If L1 is the L-number for entry1 and L2 is the L-number for some other entry2, then
    the inequality L1 < L2 should occur precisely when entry1 occurs before entry2 within the digitization
    of the dictionary.
  • The L-numbers should be unchanging with respect to corrections in the dictionary.

@funderburkjim
Copy link
Owner Author

alternate headwords per entry

In many dictionaries, a given entry will have two or more associated headwords.
For example, in SKD dictionary we see kube(ve)raH, whose printed version starts out as:
image

Our interpretation is that there are two alternate spellings for the headword of this entry:
kuberaH and kuveraH.

In the digitization skd.txt there is only one entry.

We think of the 'kuberaH' spelling as the primary headword spelling for this entry.

In order to represent the fact that this entry should be accessible with the other spelling, our approach
is to represent this alternate within the skdhw2.txt file associated with the digitization:

2-144:kuberaH:71062,71072:8094
2-144:kuveraH:71062,71072:8094.01:alt

In this case, the L-number associated with the primary headword spelling is 8094.

The 'synthetic' L-number associated with the alternate headword spelling is 8094.01 and we have
qualified this with the ;alt property to indicate that this 'extra' headword is of the type 'alternate'.

The pair if numbers 71062,71072 represents the range of lines in the digitization skd.txt that
represents the entry. Notice:

  • this range of lines is the same for both records of skdhw2.txt --- i.e., the entry is the same for both
  • this range of lines might change over the course of life of the digitization. For instance,
    when we introduce 'meta' lines in skd.txt, this entry will occur on some other range of lines
    of skd.txt.
  • However, the L-numbers of these should not change as we revise skd.txt.

@funderburkjim
Copy link
Owner Author

subheadwords: extra headwords which are not alternates of the primary headword

In many dictionaries, for some entries there are sub-headwords. Think of PWG which has the
prefixed-verb forms nestled within the entry for a non-prefixed root. Or think of STC or other dictionaries
where there are compounds of the primary headword (e.g., look up 'deva' in STC where compounds
like 'deva-karman, devarzi, etc` appear as subheadwords.

We have not yet attempted to represent these subheadwords. For instance, you can't directly access
prefixed root avagam in PWG, since it appears as a subheadword (-- ava) under gam.

The above scheme used for representing alternate headwords in the xxxhw2.txt file is sufficiently general to be able to represent subheadwords as alternate pointers to the entry under which the subheadword appears. i.e., pwghw2.txt could represent :

2-0666:gam:46226,46359:21814
2-0666:avagam:46226,46359:28184.15;sub    <<< .15 is just a guess as to which subheadword this is.

This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form avagam occurs; but it would at least represent that avagam is mentioned somewhere in the gam entry.

@funderburkjim
Copy link
Owner Author

funderburkjim commented May 17, 2017

data sources for acchw2.txt

To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates.

To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates.

The non-alternates are derived directly for acc, from the meta-line of the acc.txt digitization.

acchw2.txt to be constructed from two data sources:

  • acc.txt
    • pagecol, L, key1, key2 - read from the meta line for an entry
    • linenum1-linenum2 - determined by the line number of the meta line within the sequence of
      lines of acc.txt
  • acc_hwextra.txt -- this is a separately constructed data source. The fields for an extra headword:
    • parent field(s) -- for correctly positioning the extra headword within the final acchw2.txt L-numbers
      • L-parent = L-number of primary entry
      • key1-parent = key1 for the primary entry (this is redundant, as implied by L-parent)
    • L = the L-number of this extra hw
    • key1 = the key1of this extra headword.
    • key2 = the key2 of this extra headword.
    • type code: the type of this extra headword
      • alt = alternate headword
      • sub (?) = sub headword

@funderburkjim
Copy link
Owner Author

fields of acchw2.txt

There is an optional additional field, reserved for the 'type' code of an extra headword.
The fields are:

  • pagecol
  • key1
  • line1,line2 the line numbers within acc.txt of the first and last lines of the entry
  • L
  • hwtype : default is 'pri' (primary). Other values 'alt' (alternate), 'sub' (subheadword)

@funderburkjim
Copy link
Owner Author

example of acchw2.txt for primary entry

First few lines of acc.txt: Note: we show the acc.txt line numbers - they are not part of acc.txt

000001 [Page1-001-a+ 36]
000002 <H>CATALOGUS CATALOGORUM. 
000003 <L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA
000004 {#aMSadaSA#}¦ jy. Rice 28.
000005 <LEND>

First primary record of acchw2.txt as derived from lines 3-5 of acc.txt:

1-001,1:aMSadaSA:3,5:1

pagecol = 1-001,1   (copied from `<pc>` field of meta line)
key1 = aMSadaSA  (copied from `<k1>` field of meta line)
line1,line2 = 3,5  (from line numbers of entry within acc.txt)
> Note:  this includes the opening meta line (line3) and the closing meta line (line5)
L = 1   (copied from `<L>` field of meta line)

@funderburkjim
Copy link
Owner Author

example of acchw2.txt for an alternate headword

Although we have not yet developed lists of alternate headwords for acc, an examination of the first
few lines of acc.txt provides an example of a likely alternate headword,

000039 <L>12<pc>1-001,1<k1>akzapAda<k2>akzapAda
000040 {#akzapAda#}¦ or {#akzacaraRa,#} a name of Gautama, the philo-
000041 <>sopher, Hall p. 20.
000042 <LEND>

First, the acchw2.txt entry for the primary headword will be:

1-001,1:akzapAda:39,42:12

The alternate headword is akzacaraRa.

To make the acchw2.txt entry for the alternate hw , we assume that file
acc_hwextra.txt has this line (which we format similarly to the meta lines)

<LP>12<key1P>akzapAda<L>12.1<key1>akzacaraRa<key2>akzacaraRa<type>alt

The LP field value (12) is matched against the primary L-numbers, so we know the primary acchw2 record is 1-001,1:akzapAda:39,42:12 as shown above.

Thus the acchw2 record for the alternate will have fields:

  • pagecol = 1-001,1 (from parent primary record)
  • key1 = akzacaraRa (from acc_hwextra record)
  • line1,line2 = 39,42 (from parent primary record)
  • L = 12.1 (from acc_hwextra record)
  • hwtype = alt (from acc_hwextra record)

Putting these fields together gives the acchw2.txt record:

1-001,1:akzacaraRa:39,42:12.1:alt

In terms of positioning of this alternate record within the sequence of records of acchw2, it will
go in numerical order of L number

1-001,1:akzapAda:39,42:12
1-001,1:akzacaraRa:39,42:12.1:alt
1-001,1:akzamAlApratizWA:43,45:13

@drdhaval2785
Copy link

1-001,1:aMSadaSA:3,5:1

Would it not make sense to exclude meta lines from the startline and endline? The data is included in 4th line only. 3 and 5 are metadata. If this is kept like below, downstream programs will need only minimal changes.

1-001,1:aMSadaSA:4,4:1

@drdhaval2785
Copy link

@funderburkjim, rest all is as apt as possible.

@gasyoun
Copy link

gasyoun commented May 18, 2017

previously added alternate headwords for dictionaries AP90 and SKD

Hmm, none for VCP?

The L-numbers should be unchanging with respect to corrections in the dictionary.

Latest update, was not always so.

This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form avagam occurs

What exactly do you mean? Same as we order by L's, we can order here and always will the inherit order remain. Or you mean not the order, but the grammar data or a subentry for a subheadword?

Where and how do we note that there is a correction entry? Like for PWG 118042 and 118043 is an addition / correction to 872. I would add a markup for that as well at this stage, so we do not have to return to it again. See aja. So I would not limit to

type code: the type of this extra headword
pri = primary
alt = alternate headword
sub (?) = sub headword

but add

cor = correction / addition

@funderburkjim
Copy link
Owner Author

Would it not make sense to exclude meta lines from the startline and endline ?

I think it makes more sense to INCLUDE the meta lines from startline and endline.

Main programming reason is that some downstream programs will need to use the <L> meta line.
Case in point is make_xml.py. It reads each line from acchw2.txt, and
extracts lines from startline to endline (inclusive) of acc.txt. Then from these extracted lines it
creates an xml record for acc.xml. In this construction, make_xml.py definitely needs the <L> meta
line, as it uses all the fields of that line to construct various parts of the <head> and <tail> of the
xml record. Then, it uses the non-meta lines from the acc.txt entry to construct the <body> of the
xml record.

Second lesser reason is one of conceptual simplicity: the startline-endline should point to the entire
scope of lines in acc.txt that pertain to the given 'L' number of the acchw2.txt record.

There might, as the question suggests, be some downstream programs that have no interest in the
meta line . A parsing routine might serve as intermediary to make life easy for all downstream programs, whether they are interested in meta line or not.

Maybe we name the function 'parsedig' (for parsedigitization)

  • Inputs:
    • hw2 object (resulting from a parse of a line of acchw2.txt)
    • acc records - a list of all lines in acc
  • Return value an object with
    • fields derived from meta-line
    • list of lines between the starting and ending meta lines

@funderburkjim
Copy link
Owner Author

none for VCP?

We've done some preparatory work on identifying alternate headwords, but none of this has been installed thus far.

@gasyoun BTW: Did you ever show the VCP alternate headword UI to Radha? ref

@funderburkjim
Copy link
Owner Author

also have 'cor' ?

details of the example

Current pwghw2.txt records for aja (revised to show L-num)

1-0066:aja:1833,1836:872    << mentioned in comment above
1-0066:aja:1837,1838:873
5-0956:aja:134966,134967:62799
5-0956:aja:134968,134969:62800
7-1689:aja:254891,254892:118042  <<
7-1689:aja:254893,254894:118043  <<

The suggested revision to pwghw2.txt:

1-0066:aja:1833,1836:872    << no change
7-1689:aja:254891,254892:118042:cor
7-1689:aja:254893,254894:118043:cor

The suggestion is to add the metadata 'cor' to those last two records.
To be useful, we would need a reference to the record being corrected, maybe via L-number.

1-0066:aja:1833,1836:872  
7-1689:aja:254891,254892:118042:cor,872
7-1689:aja:254893,254894:118043:cor,872

Another place to put such meta information would be within the pwg.txt entries for those two records.
Current form.

254891 <H1>000{aja}1{aja}^1¦ ³1) ²d) ¯{¤SU10RJAS. 2, 45. 13, 11.¤} -- ²f) ¯{¤SA10MAVIDH. BR. 1, 1, 17; 
              vgl. R2V. ANUKR.¤}
254892 
254893 <H1>000{aja}1{aja}^2¦ ³3) ²b) = {#avidyA#} (Comm.) ¯{¤BHA10G. P. 3, 7, 5.¤}
254894 

Possible form (using pseudo xml)

254891 <H1>000{aja}1{aja}^1¦ ³1) ²d) ¯{¤SU10RJAS. 2, 45. 13, 11.¤} -- ²f) ¯{¤SA10MAVIDH. BR. 1, 1, 17; 
              vgl. R2V. ANUKR.¤} <cor n=872>
254892 
254893 <H1>000{aja}1{aja}^2¦ ³3) ²b) = {#avidyA#} (Comm.) ¯{¤BHA10G. P. 3, 7, 5.¤} <cor n=872>
254894 

Such meta information would have potential utility. Another type of meta information might be the 'fehlerhafter` type.

My intuition is 'cor'-type meta data might be better embedded in pwg.txt than in pwghw2.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants