-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alternate headwords for acc #30
Comments
parseheadline.pyThis is a nice routine Dhaval developed to parse strings of the form It returns a dictionary d, d['key']=val1, etc. The routine is quite general. It is used to parse the meta line of acc.txt. I think it can also be used below to parse acc_hwextra.txt. |
L-number identify entriesOur primary understanding of dictionary structure is that a dictionary is composed of a The headword in general does not completely specify an entry, because there may be multiple entries
|
subheadwords: extra headwords which are not alternates of the primary headwordIn many dictionaries, for some entries there are sub-headwords. Think of PWG which has the We have not yet attempted to represent these subheadwords. For instance, you can't directly access The above scheme used for representing alternate headwords in the xxxhw2.txt file is sufficiently general to be able to represent subheadwords as alternate pointers to the entry under which the subheadword appears. i.e., pwghw2.txt could represent :
This representation would be weak in that it does not specify where in the 'gam' entry the prefixed form |
data sources for acchw2.txtTo implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates. To implement alternate headwords for a dictionary in the manner suggested above, we need for the xxxhw2.txt file to include the alternates, intermingled with the non-alternates. The non-alternates are derived directly for acc, from the meta-line of the acc.txt digitization. acchw2.txt to be constructed from two data sources:
|
fields of acchw2.txtThere is an optional additional field, reserved for the 'type' code of an extra headword.
|
example of acchw2.txt for primary entryFirst few lines of acc.txt: Note: we show the acc.txt line numbers - they are not part of acc.txt
First primary record of acchw2.txt as derived from lines 3-5 of acc.txt:
|
example of acchw2.txt for an alternate headwordAlthough we have not yet developed lists of alternate headwords for acc, an examination of the first
First, the acchw2.txt entry for the primary headword will be:
The alternate headword is To make the acchw2.txt entry for the alternate hw , we assume that file
The LP field value (12) is matched against the primary L-numbers, so we know the primary acchw2 record is Thus the acchw2 record for the alternate will have fields:
Putting these fields together gives the acchw2.txt record:
In terms of positioning of this alternate record within the sequence of records of acchw2, it will
|
Would it not make sense to exclude meta lines from the startline and endline? The data is included in 4th line only. 3 and 5 are metadata. If this is kept like below, downstream programs will need only minimal changes.
|
@funderburkjim, rest all is as apt as possible. |
Hmm, none for VCP?
Latest update, was not always so.
What exactly do you mean? Same as we order by L's, we can order here and always will the inherit order remain. Or you mean not the order, but the grammar data or a subentry for a subheadword? Where and how do we note that there is a correction entry? Like for PWG 118042 and 118043 is an addition / correction to 872. I would add a markup for that as well at this stage, so we do not have to return to it again. See type code: the type of this extra headword but add cor = correction / addition |
I think it makes more sense to INCLUDE the meta lines from startline and endline. Main programming reason is that some downstream programs will need to use the Second lesser reason is one of conceptual simplicity: the startline-endline should point to the entire There might, as the question suggests, be some downstream programs that have no interest in the Maybe we name the function 'parsedig' (for parsedigitization)
|
details of the exampleCurrent pwghw2.txt records for aja (revised to show L-num)
The suggested revision to pwghw2.txt:
The suggestion is to add the metadata 'cor' to those last two records.
Another place to put such meta information would be within the pwg.txt entries for those two records.
Possible form (using pseudo xml)
Such meta information would have potential utility. Another type of meta information might be the 'fehlerhafter` type. My intuition is 'cor'-type meta data might be better embedded in pwg.txt than in pwghw2.txt. |
We have previously added alternate headwords for dictionaries AP90 and SKD.
We are now wanting to add alternate headwords for ACC, and other dictionaries.
ACC.txt has been enchanced to contain a meta line with L-codes. This will make the process
different than that used for AP90 and SKD.
The text was updated successfully, but these errors were encountered: