Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific issues for converting acc.txt to have identical headword line #133

Closed
drdhaval2785 opened this issue May 10, 2017 · 56 comments
Closed

Comments

@drdhaval2785
Copy link
Contributor

#130 mandates that we create one uniform headword line.

I started with ACC conversion.
Jim gave some practical tips privately on mail.

I guess there are some items in those tips which need reproduction verbatim here for public consumption.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 10, 2017

Hi, Dhaval -
There are a couple of problems:

  1. example:
old:
<L>1<pc>1-001,1-1<k1>aMSadaSA<k2>aMSadaSA<h>
new:
<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA<h>

The ',1' is indicating the first column of page 1-001 (first page of first volume).
There is no need for the '-1'

  1. Your accwithmeta.txt skips lines at the beginning and end of acc.txt;
    These are lines before the first headword and after the last headword.
    We want to have ALL the lines of acc.txt PLUS the meta-lines.
    So # of lines of accwithmeta.txt should = # of lines of acc.txt + NHW
    where NHW is # of headwords, which could be computed as # of lines in acchw0.txt or acchw2.txt

Couple of choices (These are not exactly problems - but especially note the comment on invertibility).

  1. If there is no homonym, the element could be omitted from the meta-line.
    Reason in favor of omitting if no homonym:

    • absence of implies no homonym
    • remove superfluous markup
      Reason in favor of keeping even if no homonym:
    • parsing of meta line would be marginally simpler.
  2. The lines could be simplified to be more representative of the printed text. For ACC, this could be
    old:
    {#aMSadaSA#}¦ jy. Rice 28.
    new:
    {#aMSadaSA#} jy. Rice 28.
    Reasons for removing <HI>

  • The <HI> is not needed any more to recognize entry headwords, since this information is in the meta line
  • The new make_xml will not need to have logic to represent this
    Reason for removing ¦ (or replacing it with a space)
  • Similarly, the broken bar ¦ is not needed to delineate the part of the entry from which the key2+homonym derive.
  • I am less certain regarding ¦ . If we ever wanted to reparse this first line of an entry to rederive the meta line, then
    this demarcation would be useful.
    Let me expand on this argument regarding keeping ¦.
    One principle I always have in mind when making a big change, such as this meta-line change, to a digitization is a
    principle of invertibility. In fact, I often try to write a program which reconstructs the original from the modified version.
    If such a program is written, then we know for sure that we have lost no information by our modifications - our inverting program
    proves this.

Applying to this case, this would mean having a separate program invert_meta.py which would read accwithmeta.txt and
construct acc_invert_meta.txt, with the objective of the program to be that acc_invert_meta.txt should be absolutely identical
to the original acc.txt: e.g. diff acc_invert_meta.txt acc.txt should show no difference.
It may or may not be essential to keep the ¦ for invertibility. If it is needed, we should keep ¦; if it is not needed, we should discard ¦.
BTW the 2nd problem above (dropping lines) would have been noticed by the invertibility discipline.
Jim
P.S. I haven't examined your code yet. Will do that after the final form is established

@drdhaval2785
Copy link
Contributor Author

Principle of invertibility

Generation code - here
Reversal code - here

Happy to report that there is no difference after reverse journey.

@drdhaval2785
Copy link
Contributor Author

11-001,1-1aMSadaSA. Remove -1.

Done.

@drdhaval2785
Copy link
Contributor Author

Your accwithmeta.txt skips lines at the beginning and end of acc.txt

Done. Corrected code to keep these items. Now there is no diff.

@drdhaval2785
Copy link
Contributor Author

If there is no homonym, the element could be omitted from the meta-line.

I agree. Will reduce size and remove superfluous items.
Coded accordingly.

Reason in favor of keeping even if no homonym: parsing of meta line would be marginally simpler.

Careful drafting of code may parse the meta line properly. Not much hassle.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 10, 2017

broken bar ¦

I feel we should keep this broken bar or some other unusual separator for sure.

I propose that we keep this broken bar / anything unique in its place. We should have a uniform separator for 36 dictionaries all.

@gasyoun
Copy link
Member

gasyoun commented May 10, 2017

Reason for removing ¦ (or replacing it with a space)

Makes sense.

I propose that we keep this broken bar / anything unique in its place. We should have a uniform separator for 36 dictionaries all.

This ¦ makes me want to scream when I see it in the code. If all will have it I'll die. Even @ will be better.

In fact, I often try to write a program which reconstructs the original from the modified version.

This is the lesson I learned when I converted non-Unicode devanagari in Adobe InDesign. I missed this step and the price was high.

@funderburkjim
Copy link
Contributor

¦ makes me want to scream

I thought I heard an unusual sound recently -- glad I was too far away to get the full force of the scream :)

@funderburkjim
Copy link
Contributor

@drdhaval2785 I noticed that you left the <HI>, so the first form of the entry (after the meta line) is

<HI>complex headword ¦ rest of first line

The conventions used by other dictionaries (like .<P>hw ¦ ) might be changed to this model.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 11, 2017 via email

@funderburkjim
Copy link
Contributor

Please revisit headwordforms.md

? I thought we were talking about acc?

Your {#key2#}hom¦ is fine. Are you going to implement this for acc?

Let's defer discussion of the [extraInfoField] situation when it arises.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 11, 2017 via email

@funderburkjim
Copy link
Contributor

One detail re {#key2#}hom¦ form is that sometimes key2 will be presented in IAST (unlike acc, where key2 is presented in Devanagari). The {#...#} is appropriate for acc, since this is the universal coding in dictionaries for Devanagari transcoded into SLP1. For IAST key2, this {#...#} would not be quite right.

@gasyoun
Copy link
Member

gasyoun commented May 11, 2017

I agree. It is just an add on if at all needed later.

Dhaval's planning for future disasters is welcome.

{#...#} is appropriate for acc, since this is the universal coding in dictionaries for Devanagari transcoded into SLP1

Oh, never knew before. Is there a full list of what is what in coding?

@funderburkjim
Copy link
Contributor

special end of entry code needed in xxx.txt

Our proposed structure for acc.txt now looks approximately like

[LINES BEFORE FIRST ENTRY]
<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA<h>    [META LINE FOR FIRST ENTRY]
{#aMSadaSA#}¦ REST OF LINE
[OTHER LINES FOR FIRST ENTRY, IF ANY]
<L>2<pc>....   [META LINE FOR 2ND ENTRY]
{#KEY2#}¦ REST OF LINE
[OTHER LINES FOR 2ND ENTRY, IF ANY]
...
...   REPETITION OF THIS PATTERN FOR ALL THE ENTRIES
...
<L>99999<pc>....   [META LINE FOR **LAST** ENTRY]
{#KEY2#}¦ REST OF LINE
[OTHER LINES FOR LAST ENTRY, IF ANY]
[LINES AFTER LAST  ENTRY]      <<<<<

The point of making this structure summary is to emphasize a minor deficiency:

There is no way to tell where the LAST entry ends.

For any entry except the last, we infer that the digitization lines for that entry include all lines

  • up to and including the line before the next <L> meta line*.
    But for the last entry, there is no next <L> meta line. And, the lines for the last entry typically
    do not include all lines following the last meta line.

Thus, we need an extra line with a marker indicating the end of the last entry.

I'm not quite sure what to call this marker, maybe <LEND> or maybe <L>END .

Although not applicable to acc, there is another circumstance where this entry-ending marker will
be required. In a few dictionaries (VCP is one, there are probably a couple of others) there are
some sections of the digitization which are between two entries but not part of an entry.
Identifying the
end of the entry preceding such a section also will require some special marker; probably we could use
the same marker as that for the last entry.

@gasyoun
Copy link
Member

gasyoun commented May 11, 2017

END

If a simple empty line at the end is not enough, let it be so.

@funderburkjim
Copy link
Contributor

Is there a full list of what is what in coding?

This markup information for xxx.,txt is in the xxx-meta.txt file.

As big changes occur (such as AS->IAST) for a dictionary,
the markup information for new xxx.txt is being put in a separate xxx-meta2.txt file.

@gasyoun
Copy link
Member

gasyoun commented May 12, 2017

As big changes occur (such as AS->IAST) for a dictionary,
the markup information for new xxx.txt is being put in a separate xxx-meta2.txt file.

I guess you are the only one who knows where to look for and in what cases. It's sooo complicated.

@drdhaval2785
Copy link
Contributor Author

Hi Jim,
After seeing your comment regarding VCP etc,
I feel like keeping <LEND> after each line. Currently kept like that. If you feel it is superfluous, I can remove it and keep it at the end of last entry.

Similarly, there are other circumstances also which favour such choice e.g. ACC has three books. There are addenda corrigenda part after each book.

@gasyoun
Copy link
Member

gasyoun commented May 15, 2017

ACC has three books. There are addenda corrigenda part after each book.

How will it help in such cases? PWG, PWK has 7 volumes and we know after which L's the corrigenda start. But how does the <LEND> help?

@drdhaval2785
Copy link
Contributor Author

Hi @funderburkjim ,

Now the acc.txt is having the headword metadata line in it.
Typical lines

<L>36<pc>1-001,2<k1>agastyagItA<k2>agastyagItA
{#agastyagItA#}¦ from Paśupālopākhyāna of Varāhapurāṇa.
<>Burnell 193^b.<LEND>
<L>37<pc>1-001,2<k1>agastyaniGaRwu<k2>agastyaniGaRwu
{#agastyaniGaRwu#}¦ vocabulary. Oppert 7795.<LEND>

The subsequent programs for generating XML from it have been suitably modified, mainly hw0.py and make_xml.py.

I guess there are not many errors, because I have retained the old code as much as I could and Jim's python didn't spit out errors.

@funderburkjim the git pull may be relatively heavy this time, because the headword startline and endlines have changed. and acchw0.txt / acchw1.txt / acchw2.txt were part of the git.

But subsequent changes may not be that heavy. Let us accept this as necessary evil.

@gasyoun
Copy link
Member

gasyoun commented May 15, 2017

Let us accept this as necessary evil.

Indeed.

@funderburkjim
Copy link
Contributor

@drdhaval2785 I have a small comment regarding <LEND>, but want to be sure you agree with the changes I made today (commit 2db107fb7...) before making that comment.

@gasyoun The <LEND> provides an explicit end to an entry. That's why it is useful.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 16, 2017 via email

@funderburkjim
Copy link
Contributor

funderburkjim commented May 16, 2017

git and line endings

For work on Unix systems, such as Cologne server and dev server, text files have lines ending in '\n'
(line feed).

Text files created by native Windows OS apps have lines ending in '\r\n',

Our initial dev server repository was created from files copied from Cologne server to Dev server.
All these files have Unix line endings.

When we clone this repository to our local windows machine, the files still have unix line endings.

However, when we work with files in this local repository, changes to the line endings can occur, thanks
in part to git. For instance, I noticed that acc2.txt somehow appeared to have the windows line endings; and this caused a plain 'diff ../../acc2.txt acc_invert_meta.txt' to give a huge diff file.

An example shows how the git system can be very confusing with regard to line endings.
When I recreated acc2.txt (as described elsewhere) --- with Unix line endings. At this point
'git status' showed that orig/acc2.txt had changed. When I then did 'git add acc2.txt', an odd thing
happened. A 'git status' showed that there were nothing to commit ! Weird and confusing.

I think we are both following the general recommendation of git documentation by setting this config:
ref

git config --global core.autocrlf true

The documentation describes this as:

Git can handle this by auto-converting CRLF line endings into LF when you add a file to the index, and 
vice versa when it checks out code onto your filesystem. 

But in practice, it is confusing to know when Git 'checks out code onto your filesystem'.
All the discussion in stackoverflow and the git documentation shows that, despite Git's best intentions,
the situation with line-endings is still confusing when working with Git projects in both
unix and windows.

a possible solution for us

First, we should each set our local git config to the git recommendation:

git config --global core.autocrlf true

Second, we can replace the diff utility with a Python program.

python diff.py <file1> <file2>

Why diff.py is better than 'diff -w'

The unix diff utility with the '-w' option compares two files but ignores all white space.
So if file1 and file2 were the same except for a possible difference in line endings, then 'diff -w' would
show no difference in the two files. Also diff.py would show no difference. In this circumstance,
the diff.py and 'diff -w' give the same answer, as desired.

However, it would also show no difference between these two one-line files:

FILE1
This is a line with spaces.

FILE2
Thisisalinewithspaces.

For our purposes, these two files should be counted as different; and diff.py does count them as
different.

unixify.py

It might be that from time to time we need to assure that a particular file has unix line endings.

python unixify.py <file>

This program reads the file into an array of lines, with the line endings stripped.
Then it writes these lines back onto the file, with Unix line endings.

Both unixify.py and diff.py are in the pywork directory of acc repository.

redo.sh has been changed to use diff.py instead of 'diff -w'.

readme.txt has been revised to be consistent with redo.sh.

These changes have been pushed to dev server (commit 8908d043...).

@drdhaval2785 agree with this solution for line-ending problem?

@funderburkjim
Copy link
Contributor

small comment on <LEND>

Currently, the entry-ending tag <LEND> is placed at the end of the last line of the entry:

<L>36<pc>1-001,2<k1>agastyagItA<k2>agastyagItA
{#agastyagItA#}¦ from Paśupālopākhyāna of Varāhapurāṇa.
<>Burnell 193^b.<LEND>

Since that tag is also a 'meta' tag (i.e., not part of the original text of the entry), would it be better for
it to be on a separate line?

<L>36<pc>1-001,2<k1>agastyagItA<k2>agastyagItA
{#agastyagItA#}¦ from Paśupālopākhyāna of Varāhapurāṇa.
<>Burnell 193^b.
<LEND>

@drdhaval2785 What do you think?

If you agree, why don't you make the necessary changes to meta_hw.py and invert_meta.py, and
rerun redo.sh. Don't bother with changing hw2.py or hw0.py or make_xml.py -- I need to do some
revisions of these (to make provision for alternate headwords), and will make the minor changes to
handling new location of LEND at that time.

@gasyoun
Copy link
Member

gasyoun commented May 16, 2017

would it be better for
it to be on a separate line?

I guess so.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 17, 2017 via email

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 17, 2017 via email

@drdhaval2785
Copy link
Contributor Author

@funderburkjim
Done the changes in meta_hw.py, invert_meta.py, reran redo.sh.
Just made a single line change in hw0.py (decrement of 1 from line number of <LEND> to demarcate entry ending).

The field is now open for you to make L-numbers immutable and prepare some methodology in algorithm by which alternate headwords / missed headwords etc can be added without altering L-numbers. I will wait for you to finish this before we move ahead.

@funderburkjim
Copy link
Contributor

@drdhaval2785

After git pull origin master, I reran redo.sh and looked at accwithmeta.txt and it seems fine.

The git pull statement showed that also 'orig/acc.txt' was updated, but not orig/acc3.txt. Further examination also showed a modification in .gitignore to not track orig/acc3.txt.

acc3.txt needs to be tracked.

It is true that it can be recreated by copying accwithmeta.txt. However, unlike accwithmeta.txt, acc3.txt
plays a role in subsequent updates, as seen in update.sh:

python updateByLine.py ../orig/acc3.txt manualByLine01.txt ../orig/acc.txt 

I modified .gitignore accordingly.

In fact ALL the files in orig directory should be tracked. Theoretically, only the first form of the
digitization (orig/acc_orig.txt) is required, and the others can be recreated by pywork/update.sh.
However, it is safer to keep all the major intermediate forms present in orig directory.

recreation of acc3.txt and acc.txt from accwithmeta.txt

There are two steps, as can be seen by examination of pywork/update.sh. These steps should be
done manually, by copy-pasting from update.sh to terminal session.
These assume current directory of acc/pywork.

  • cp correctionwork/issue-cologne-130/accwithmeta.txt ../orig/acc3.txt
  • python updateByLine.py ../orig/acc3.txt manualByLine01.txt ../orig/acc.txt

@funderburkjim
Copy link
Contributor

revise opinion on core.autocrlf setting: use 'input', not 'true'

When I added acc3.txt with git (git add acc3.txt), I got this confusing warning message:

warning: LF will be replaced by CRLF in orig/acc3.txt.
The file will have its original line endings in your working directory.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)


Now, I don't want CRLF to be the line ending in acc3.txt, but rather the unix LF.

So, I did "git reset head acc3.txt" to unstage acc3.

This stackexchange article had an interesting comment:

If you’re on a Linux or Mac system that uses LF line endings, then you don’t want Git to automatically 'convert them when you check out files; however, if a file with CRLF endings accidentally gets introduced, 'then you may want Git to fix it. You can tell Git to convert CRLF to LF on commit but not the other way 'around by setting core.autocrlf to input:

$ git config --global core.autocrlf input

This setup should leave you with CRLF endings in Windows checkouts, but LF endings 
on Mac and Linux systems and in the repository.

That sounds like exactly what we want. So, I changed the core.autocrlf to 'input' as mentioned:

git config --global core.autocrlf input

Then, 'git add acc3.txt' made no complaints or warnings, as expected since we know from its
construction as a copy of accwithmeta.txt that it has LF ('\n') for line endings.

So let's adopt input as our standard global configuration in Git Bash.

@funderburkjim
Copy link
Contributor

more git growing pains

Pushing the above changes to dev server failed:

On local machine:

$ git push origin master
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 815 bytes | 0 bytes/s, done.
Total 6 (delta 3), reused 0 (delta 0)
error: Untracked working tree file 'orig/acc3.txt' would be overwritten by merge                     .
To [IPADDRESS]:/var/www/html/cologne/acc/
 ! [remote rejected] master -> master (Could not update working tree to new HEAD                     )
error: failed to push some refs to 'dpjf@[IPADDRESS]/var/www/html/cologne/acc

stackexchange to the rescue: clean dev server

Investigation led to stackexhange discussion . In our case, we need to
remove from dev server the untracked file acc3.txt.

FIrst, a dry run to show untracked files of the repository. This is via an ssh connection to dev server:

dpjf> git clean -n -X
Would remove acc3.txt
Would remove temp_acc_xxd.txt

This looks right -- we want to remove these from dev server repository. '-f' does the removal.

dpjf> git clean -f -X
Removing acc3.txt
Removing temp_acc_xxd.txt

No commit is needed here on dev server. This git status confirms:

dpjf> git status
On branch master
nothing to commit, working directory clean

Now, back on local machine -- the push works now:

$ git push origin master
Counting objects: 6, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 815 bytes | 0 bytes/s, done.
Total 6 (delta 3), reused 0 (delta 0)
To IPADDRESS:/var/www/html/cologne/acc/
   d238289..e70fc29  master -> master

Whew!

@gasyoun
Copy link
Member

gasyoun commented May 17, 2017

Whew!

A hard day.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 18, 2017 via email

@funderburkjim
Copy link
Contributor

funderburkjim commented May 18, 2017

Here are some notes written in a temporary issue
regarding construction of the important acchw2.txt file.

Comments, suggestions solicited.

If these ideas still seem right tomorrow, I'll generate hw2.py code to implement the ideas.

@funderburkjim
Copy link
Contributor

significance of <e> in meta line

For 7 entries, the construction of meta line has an extra parameter <e>2.
First case:

<L>144<pc>1-004,1<k1>aGoraSivaAcArya<k2>aGoraSiva AcArya<e>2
{#aGoraSiva AcArya#}¦ Quoted in Śaivadarśana of Sa-
<>rvadarśanasaṃgraha. Oxf. 246^a.
<HI1>Kriyākramoddyota. Burnell 207^a.
<HI1>Tattvatrayanirṇayavyākhyā. Mysore 4.
<HI1>Tattvaprakāśikāvṛtti. Burnell 111^a. Śivatattva-
<>prakāśikāvṛtti. Burnell 111^a. Mysore 4.
<HI1>Tattvasaṃgrahalaghuṭikā. Burnell 111^a.
<HI1>Nādakārikāvṛtti. L. 1434. Burnell 111^a. 
[Page1-004-b+ 45]
<HI1>Paddhati. Poona 337.
<HI1>Sarvajñānottaravṛtti. Burnell 111^a.
<LEND>

@drdhaval2785 What is this about?

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 19, 2017

@funderburkjim
re <e>2
They are print errors where period is placed after the headword. No other headword has period after it.
So to enable return journey without information loss, this extraInfo was added.

image

@gasyoun
Copy link
Member

gasyoun commented May 19, 2017

So to enable return journey without information loss, this extraInfo was added.

To make print errors legal?

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 19, 2017 via email

@gasyoun
Copy link
Member

gasyoun commented May 19, 2017

Till we remove them, yes.

That's what I thought, ok.

@funderburkjim
Copy link
Contributor

OK. This kind of irregularity will probably occur in many dictionaries as we convert to xxxwithmeta.txt form,
and try for invertibility. For this particular acc dictionary, the irregularity was simple enough to handle as you did. Another solution method would be to hard code quirks in the invert.py program for a particular dictionary.

@funderburkjim
Copy link
Contributor

Revisions to redo_hw and make_xml

During implementation of the ideas described in the temporary issue, several changes were made. These changes make the system both conceptually simpler and slightly more general.

acc.txt + acc_hwextra.txt --> acchw.txt

The hw.py program combines the meta-lines of acc.txt and the lines of acc_hwextra.txt (for the alternate (and sub) headwords) into the acchw.txt file. This file was not part of the prior system.
For the sake of this description, lets call either one of these two kinds of lines a general meta line.

To each general meta line is added the two linenum1/2 fields (indicating the line range of acc.txt corresponding to the entry designated by the general meta line), and the result becomes a line of
acchw.txt.

Example of acchw records:

From first acc meta line
<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA<ln1>3<ln2>5
From an alternate headword line from acc_hwextra.txt:
<L>12.1<pc>1001,1<k1>akzacaraRa<k2>akzacaraRa<type>alt<LP>12<k1P>akzapAda<ln1>39<ln2>42

fields of meta lines of acc.txt

required fields: L, pc, k1, k2
optional field: hom

fields of acc_hwextra.txt

required fields: L, pc, k1, k2, type, LP, k1P
optional field ?: hom -- not sure if this ever required for an alternate or sub headword

fields of an acchw.txt record :

  • required fields
    • L Cologne record identifier
    • pc page-col reference to scanned image
    • k1 key1. The headword spelling, in slp1 coding for Sanskrit headwords
    • k2 key2. The original headword spelling, either in slp1 or IAST
    • ln1 linenum1 = line number of first line of acc.txt for the associated entry [a 'metaline']
    • ln2 linenum2 = line number of last line of acc.txt for the associated entry []
  • optional field for homonym
    • hom The homonym number (usually a digit). Not present in acc dictionary
  • required fields for an alternate headword
    • type : alt (currently). Anticipate also 'sub' for subheadword. Maybe other types not yet developed.
    • LP : L-code of 'parent' headword.
    • k1P : key1 of 'parent' headword

<key>val sequence for format of acchw.txt records

This is a more convenient format than a colon-delimited value sequence such as used by acchw2.txt and
acchw0.txt. Optional fields may just be omitted. The presence of <key> reminds us of what the
field represents.

acchw.txt -> acchw2.txt and acchw0.txt

Since acchw2.txt is used elsewhere (notably in construction of sanhw1 - headwords for all dictionaries),
it is convenient to maintain this file in its current format. Each line of acchw2.txt is easily constructed from acchw.txt as a colon-delimited sequence of certain required fields:

pc:k1:ln1,ln2:L

acchw0 is constructed similarly, the difference being that 'k2' (key2) is used instead of 'k1' (key1).

hwparse.py

This program contains a class HW for records of acchw.txt and a function which reads acchw.txt and
parses the lines into a sequence of HW objects. Each HW instance object allows access to the
fields of the acchw line as instance attributes (e.g. obj.k1 for the 'k1' (key1) field). Alternately, a dictionary form of access is possible (e.g. obj.d.['k1'] ). Missing optional fields are given the Python None value.

HW class also contains some class variables, that may be useful in subsequent programs.
E.g., some of these are used in the make_xml.py program.

  • Ldict - a dictionary associating each L-code with the corresponding acchw.txt record
  • Sanskrit - A boolean flag: Are the headwords Sanskrit?
  • dictcode - e.g. 'acc' - the Cologne dictionary identifier
  • hwrec_keys = the possible keys of the acchw.txt records

make_xml.py

This program is conceptually much simpler than before. It uses both acc.txt and acchw.txt as inputs.
It constructs an xml structure representing all the entries of the dictionary, one entry for each
acchw,txt record. It makes provision for the 'hom' element. Next comment discusses current choice in handling of alternate headwords in xxx.xml.

@gasyoun
Copy link
Member

gasyoun commented May 20, 2017

optional field ?: hom -- not sure if this ever required for an alternate or sub headword

Guess never.

k2 key2. The original headword spelling, either in slp1 or IAST

And there will be no meta code to know if it's SLP1 or IAST?

Sanskrit - A boolean flag: Are the headwords Sanskrit?

Mark some as Prakrit in future?

@funderburkjim
Copy link
Contributor

alternate headwords in acc.xml

Currently, there is just one alternate headword implemented for acc; to get more we have to generate
additional records of hwextra/acc_hwextra.txt. That's a separate consideration.

Here are the xml records for the primary and alternate headwords:

PRIMARY (akzapAda)
<H1><h><key1>akzapAda</key1><key2>akzapAda</key2></h>
 <body>
   <s>akzapAda</s>  or <s>akzacaraRa,</s> a name of Gautama, the philo- <br/>sopher,  Hall p. 20.
  </body>
  <tail><L>12</L><pc>1-001,1</pc></tail></H1>
ALTERNATE (akzacaraRa)
<H1><h><key1>akzacaraRa</key1><key2>akzacaraRa</key2></h>
<body>
   THIS LINE IS ADDITIONAL FOR ALTERNATE
  <alt><s>akzacaraRa</s> is an alternate spelling of <s>akzapAda</s></alt> 
  <s>akzapAda</s>  or <s>akzacaraRa,</s> a name of Gautama, the philo- <br/>sopher, Hall p. 20.</body>
  <tail><L>12.1</L><pc>1-001,1</pc>
    THIS LINE IS ADDITIONAL FOR ALTERNATE
    <hwtype n="alt" ref="12"/>
   </tail></H1>

Two elements are special for alternate (or sub) headwords:

  • <alt> in body element. This is a simple description of the fact that we have an alternate spelling.
    • If this were a sub-headword, the wording would be
      <alt><s>akzacaraRa</s> is a sub-headword of <s>akzapAda</s></alt>
  • <hwtype n="TYPE" ref="LP"/>
    This element in the tail indicates the type of the alternate ('alt' or 'sub' ) and the Cologne record
    identifier of the parent primary entry.

the xml form used for skd alternates

Here is the current skd.xml record for the alternate headword 'kuveraH':

`<H1><h n="alt"><key1>kuveraH</key1><key2>kube(ve)raH</key2></h>
<body ref="8094"></body>
 <tail><L>8094.01</L><pc>2-144</pc></tail></H1>
  • an attribute of <h> indicates the type of alternate
  • an attribute of <body> indicates the cologne record id of the parent record.
  • the <body> element has no text.

A downstream programs using skd.xml must contain logic to interpret these special attribute. In particular this somewhat complex interpretation logic is present in the disp.php program which generates the basic display of records for skd.

A downstream user of acc.xml will still need logic to deal with the <alt> element, but this logic should
be simpler (e.g., no need to make an extra search to know that the ref value of '8094' corresponds to the 'kuberaH' spelling).

Time will tell which approach is better. Currently, I prefer the acc.xml approach, which was originally suggested by Dhaval.

@funderburkjim
Copy link
Contributor

current status

The changes described in the prior two comments have been pushed to dev server in commit a19c3335.

Next step will be to modify disp.php to handle the <alt><hwtype> elements; and also the <div n="3"> case that Dhaval introduced.

@drdhaval2785 Do you want to give these changes to web/webtc/disp.php a try?

@funderburkjim
Copy link
Contributor

And there will be no meta code to know if it's SLP1 or IAST?

Good observation - as currently key2 is only in IAST for some dictionaries. Maybe the right place to put this dictionary-meta piece of information is as an HW class variable within hwparse.py.

Prakrit

Have not thought about this. Are there examples? Currently the headwords of all dictionaries are (I think) either Sanskrit words or English words. The use thus far of this flag is to know how to render key1. If Sanskrit flag is True, then we use the fact that key1 field is always SLP1, regardless of whether the dictionary shows Devanagari or IAST; thus we can use transcoding to render key1 in Devanagari, IAST or whatever the user display chose. If Sanskrit flag is False, then render key1 'as is'. For instance, if we had a Russian-Sanskrit dictionary, then we would set Sanskrit flag to false, since the headwords would be in Russian.

@gasyoun
Copy link
Member

gasyoun commented May 21, 2017

currently key2 is only in IAST for some dictionaries

What is the need for it? Or why some are SLP1? Does not make sense to me - the diversity.

Are there examples?

There were, but lost again.

@funderburkjim
Copy link
Contributor

Does not make sense to me - the diversity.

Good point. I'm not sure how to resolve. When we are next working on such a dictionary (one with IAST headwords in print), we should be alert. Maybe a solution will present itself when we see the exact details in such a dictionary.

@funderburkjim
Copy link
Contributor

@drdhaval2785 I got an email of a comment 'acc.xml doesn't have akzacaraRa entry in it.'

but don't see it in the comments now -- presume you solved this problem (by update_sync.sh)?

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 22, 2017 via email

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented May 22, 2017 via email

@gasyoun
Copy link
Member

gasyoun commented May 22, 2017

I did redo_hw.sh but forgot to do redo_xml.sh. It was false alarm.

For such cases I write code that runs all steps for me 💃

@funderburkjim
Copy link
Contributor

web/webtc/disp.php modified on dev server

Change commited.

There were numerous other files whose rw status had changed for some reason; and these changes
also appear in the commit.

You can see changes by comparing Basic-dev and
Basic-Cologne for these
SLP1 headwords:

  • alt headword: akzacaraRa
  • div with n=3: agastyasaMhitA
  • different handling of break: akzapAda, akzamAlikopanizad

Also, these are same on both systems, and are the other types of div in acc

  • div with n=P (paRqitasvAmin)
  • div with n=2 (akulAgamatantra)

@funderburkjim
Copy link
Contributor

These issues now resolved on dev server and moved to Cologne server.

@funderburkjim
Copy link
Contributor

acc-meta2.txt now revised to mention meta lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants