-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniform headword scheme for xxx.txt #130
Comments
Recently, a step in the direction of this goal was made with the Schmidt dictionary documentation. Current sch.txt headword schemeSome examples.
|
example of adaptation of sch model to acc dictionaryAdapting this to ACC would involve changing all the lines to a similar form:
|
Headword coding in other dictionariesWe want to devise a scheme that is applicable to all the dictionaries. It will be helpful to see the pw.txt and pwg.txtI think the scheme is the same for these two dictionaries.
|
acc.txtExamples above. The form is
ap.txt
ap90.txt
pd.txt
|
One observation regarding ACC.
ACC has three volumes. It usually does not have homonyms. But the same
headword does appear in different volumes often. Should we treat them as
separate headwords or homonyms of the same headword?
For example, see 'anuBUtiprakASa'.
|
Agree.
We usually treat such entries as separate headwords. If we treat as pseudo-homonyms, what to do with real homonyms in other dictionaries, @drdhaval2785 ? As we want to achieve uniformity. |
A summary of the headword forms for all the dictionaries is in This summary is drawn from an examination of
xxxhw0 and xxxhw2 also contain page numbers and line-ranges corresponding to entries. |
Here is one idea for how to modify the xxx.txt files to achieve a uniformity of coding that also preserves other aspects of the dictionaries; this idea is intended to be a conservative enhancement of xxx.txt formats. For each entry in current xxx.txt, generate two lines in the revised xxx.txt:
The meta-data line for entry to include:
The exact format of the meta-data line should be easy to parse. Maybe a pseudo xml form: This transformation of the xxx.txt file would be a one-time change. The line-range field of xxxhw0 and xxxhw2 provides the information that a particular entry The construction of xxx.xml could also be done directly from xxx.txt - it would not have to reinterpret Of course, the meta-lines of xxx.txt could be updated by the usual updateByLine process. If it were found that a record was wrongly marked as a headword, then that 1a meta line would If it were found that a headword was incorrectly not identified as a headword, then a line 1a, 1b WITH A DECIMAL L-NUMBER could be introduced --- this would be potentially awkward ( a program has not been written for such line insertions in xxx.txt), but conceptually simple. |
meta-line and headword alternatesThe For example if there were three headwords (key1,X,Y) implied by key2, the meta-line could be coded
|
Since I'll be working on all the dictionaries for AS-IAST conversion, it would be convenient to make the suggested changes to the xxx.txt digitizations at the same time. It's a big change, in that several programs are involved, and subsequent work would need to take the I hope others will give thought to this proposal now. Your comments will help me decide whether now is the time to proceed with this. Obviously, I am leaning towards going ahead with it. |
Both AS-IAST and uniform headwords are both of utmost importance. Third
most important item is uniform DTD. Once these three are achieved,
downstream programs can be more or less uniform rather than dictionary
specific. So it makes sense to complete these three tasks ASAP.
|
This phrase also needs to be fixed. I am stuck up with what to keep in headword-derived-text. Currently I am keeping the whole head part as it is. But I feel this is not what we want. We want it to be uniform for all dictionaries. So deciding what we keep here and in what tagging is important for uniformity. |
@funderburkjim,
I am trying to create one script (by modifying your make_xml.py) to make
acc.txt uniform on dev server.
I will post results and scripts once I am through.
I am unable to push to newly installed acc too. Maybe you need to run that
command for acc too.
|
Drat! I forgot there also when remade slimmed down acc. P.S. I did this, but you could also have done! Just login to dev, cd to acc, issue that git config statement. |
Right, this is a tricky part. The way I am thinking of it is that the 'meta' line is what will be uniformly Your approach sounds reasonable (based on make_xml.py). Will be interested to see what you come up with. |
https://gist.github.com/drdhaval2785/b2e02e4b72bdbc718bc7d7621bd35ec8 is the code generated from your make_xml.py. It is written for ACC. But with minor modifications, it should be OK to work with other dictionaries too.
I will do some survey of existing patterns mentioned by you in headwordforms.md. |
Corresponding change is Head is
Here Any suggestion @funderburkjim?
Also note that I have suggested single |
Well done. |
Now meta-IAST conversion is already done. Closing. |
We need to devise a uniform scheme for headword coding in the xxx.txt digitizations.
When I started with these digitizations several years ago, it was difficult even to know how to identify
headwords in the digitizations - each digitization was different. This factor led to a system where the
headword identification was completely dynamic. In particular, the sequence number (what we refer to as the L number) was determined dynamically. This means that if some correction removed or added a headword, then all the subsequent L-numbers for that dictionary would change.
Now that the headword identification for the various dictionaries is much better known, it is time to
impose some uniformity in the headword coding of the xxx.txt digitizations; so at least in this fundamental aspect, the dictionaries are as similar as reasonably possible.
The details of this uniform headword coding are the subject of this issue.
The text was updated successfully, but these errors were encountered: