-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specific issues for converting acc.txt to have identical headword line #133
Comments
Hi, Dhaval -
The ',1' is indicating the first column of page 1-001 (first page of first volume).
Couple of choices (These are not exactly problems - but especially note the comment on invertibility).
Applying to this case, this would mean having a separate program invert_meta.py which would read accwithmeta.txt and |
Done. |
Done. Corrected code to keep these items. Now there is no diff. |
I agree. Will reduce size and remove superfluous items.
Careful drafting of code may parse the meta line properly. Not much hassle. |
I feel we should keep this broken bar or some other unusual separator for sure. I propose that we keep this broken bar / anything unique in its place. We should have a uniform separator for 36 dictionaries all. |
Makes sense.
This ¦ makes me want to scream when I see it in the code. If all will have it I'll die. Even @ will be better.
This is the lesson I learned when I converted non-Unicode devanagari in Adobe InDesign. I missed this step and the price was high. |
I thought I heard an unusual sound recently -- glad I was too far away to get the full force of the scream :) |
@drdhaval2785 I noticed that you left the
The conventions used by other dictionaries (like |
@drdhaval2785 <https://github.com/drdhaval2785> I noticed that you left the
<HI>, so the first form of the entry (after the meta line) is
<HI>complex headword ¦ rest of first line
I left it as it is because we had not decided on headword-derived-part
structure yet. I feel there are mainly two items here. Key2, hom.
I propose we keep opening part as
{#key2#}hom¦
for all dictionaries.
If we want to add some information like 100 in PW or some other dictionary,
{#key2#}hom[extraInfoField]¦
This should work as generic solution. Please revisit headwordforms.md and
change the suggested form. I will make necessary changes to code.
|
? I thought we were talking about acc? Your Let's defer discussion of the [extraInfoField] situation when it arises. |
? I thought we were talking about acc?
Yes, but decisions we take should be generic so that future implementations
in other dictionaries dont pose much of a problem.
Your {#key2#}hom¦ is fine. Are you going to implement this for acc?
Yes.
Let's defer discussion of the [extraInfoField] situation when it arises.
I agree. It is just an add on if at all needed later.
|
One detail re |
Dhaval's planning for future disasters is welcome.
Oh, never knew before. Is there a full list of what is what in coding? |
special end of entry code needed in xxx.txtOur proposed structure for acc.txt now looks approximately like
The point of making this structure summary is to emphasize a minor deficiency: There is no way to tell where the LAST entry ends. For any entry except the last, we infer that the digitization lines for that entry include all lines
Thus, we need an extra line with a marker indicating the end of the last entry. I'm not quite sure what to call this marker, maybe Although not applicable to acc, there is another circumstance where this entry-ending marker will |
If a simple empty line at the end is not enough, let it be so. |
This markup information for xxx.,txt is in the xxx-meta.txt file. As big changes occur (such as AS->IAST) for a dictionary, |
I guess you are the only one who knows where to look for and in what cases. It's sooo complicated. |
Hi Jim, Similarly, there are other circumstances also which favour such choice e.g. ACC has three books. There are addenda corrigenda part after each book. |
How will it help in such cases? PWG, PWK has 7 volumes and we know after which L's the corrigenda start. But how does the |
Hi @funderburkjim , Now the acc.txt is having the headword metadata line in it.
The subsequent programs for generating XML from it have been suitably modified, mainly hw0.py and make_xml.py. I guess there are not many errors, because I have retained the old code as much as I could and Jim's python didn't spit out errors. @funderburkjim the git pull may be relatively heavy this time, because the headword startline and endlines have changed. and acchw0.txt / acchw1.txt / acchw2.txt were part of the git. But subsequent changes may not be that heavy. Let us accept this as necessary evil. |
Indeed. |
@drdhaval2785 I have a small comment regarding @gasyoun The |
@drdhaval2785 <https://github.com/drdhaval2785> I have a small comment regarding <LEND>, but want to be sure you agree with the changes I made today (commit 2db107fb7...) before making that comment.
I agree with that comment except the line break additions. This gives a
large diff file. Earlier version kept line breaks as near to each other as
possible. So diff files were actually diff files.
|
git and line endingsFor work on Unix systems, such as Cologne server and dev server, text files have lines ending in '\n' Text files created by native Windows OS apps have lines ending in '\r\n', Our initial dev server repository was created from files copied from Cologne server to Dev server. When we clone this repository to our local windows machine, the files still have unix line endings. However, when we work with files in this local repository, changes to the line endings can occur, thanks An example shows how the git system can be very confusing with regard to line endings. I think we are both following the general recommendation of git documentation by setting this config:
The documentation describes this as:
But in practice, it is confusing to know when Git 'checks out code onto your filesystem'. a possible solution for usFirst, we should each set our local git config to the git recommendation:
Second, we can replace the diff utility with a Python program.
Why diff.py is better than 'diff -w'The unix diff utility with the '-w' option compares two files but However, it would also show no difference between these two one-line files:
For our purposes, these two files should be counted as different; and diff.py does count them as unixify.pyIt might be that from time to time we need to assure that a particular file has unix line endings.
This program reads the file into an array of lines, with the line endings stripped. Both unixify.py and diff.py are in the pywork directory of acc repository. redo.sh has been changed to use diff.py instead of 'diff -w'. readme.txt has been revised to be consistent with redo.sh. These changes have been pushed to dev server (commit 8908d043...). @drdhaval2785 agree with this solution for line-ending problem? |
small comment on
|
I guess so. |
@drdhaval2785 <https://github.com/drdhaval2785> agree with this solution for line-ending problem?
I agree wholeheartedly.
|
Since that tag is also a 'meta' tag (i.e., not part of the original text of the entry), would it be better for it to be on a separate line?
Yes
@drdhaval2785 <https://github.com/drdhaval2785> What do you think? If you agree, why don't you make the necessary changes to meta_hw.py and invert_meta.py, and rerun redo.sh.
I will do so.
|
@funderburkjim The field is now open for you to make L-numbers immutable and prepare some methodology in algorithm by which alternate headwords / missed headwords etc can be added without altering L-numbers. I will wait for you to finish this before we move ahead. |
After git pull origin master, I reran redo.sh and looked at accwithmeta.txt and it seems fine. The git pull statement showed that also 'orig/acc.txt' was updated, but not orig/acc3.txt. Further examination also showed a modification in .gitignore to not track orig/acc3.txt. acc3.txt needs to be tracked.It is true that it can be recreated by copying accwithmeta.txt. However, unlike accwithmeta.txt, acc3.txt
I modified .gitignore accordingly. In fact ALL the files in orig directory should be tracked. Theoretically, only the first form of the recreation of acc3.txt and acc.txt from accwithmeta.txtThere are two steps, as can be seen by examination of pywork/update.sh. These steps should be
|
revise opinion on core.autocrlf setting: use 'input', not 'true'When I added acc3.txt with git (git add acc3.txt), I got this confusing warning message:
Now, I don't want CRLF to be the line ending in acc3.txt, but rather the unix LF. So, I did "git reset head acc3.txt" to unstage acc3. This stackexchange article had an interesting comment:
That sounds like exactly what we want. So, I changed the core.autocrlf to 'input' as mentioned:
Then, 'git add acc3.txt' made no complaints or warnings, as expected since we know from its So let's adopt |
more git growing painsPushing the above changes to dev server failed:On local machine:
stackexchange to the rescue: clean dev serverInvestigation led to stackexhange discussion . In our case, we need to FIrst, a dry run to show untracked files of the repository. This is via an ssh connection to dev server:
This looks right -- we want to remove these from dev server repository. '-f' does the removal.
No commit is needed here on dev server. This git status confirms:
Now, back on local machine -- the push works now:
Whew! |
A hard day. |
Learning deeper git on the way. I also had to do some debugging when there
was a file modified by both Jim and me. Had to manually resolve conflicts
and commit. But with git, nothing is permanently lost. Some hassles, yes.
|
Here are some notes written in a temporary issue Comments, suggestions solicited. If these ideas still seem right tomorrow, I'll generate hw2.py code to implement the ideas. |
significance of
|
@funderburkjim |
To make print errors legal? |
To make print errors legal?
Till we remove them, yes. Once the headwordwithmeta.txt is stable, we can
process them as regular print error corrections and close the issue. A new
CORRECTIONS issue would be in order.
|
That's what I thought, ok. |
OK. This kind of irregularity will probably occur in many dictionaries as we convert to xxxwithmeta.txt form, |
Revisions to redo_hw and make_xmlDuring implementation of the ideas described in the temporary issue, several changes were made. These changes make the system both conceptually simpler and slightly more general. acc.txt + acc_hwextra.txt --> acchw.txtThe hw.py program combines the meta-lines of acc.txt and the lines of acc_hwextra.txt (for the alternate (and sub) headwords) into the acchw.txt file. This file was not part of the prior system. To each general meta line is added the two linenum1/2 fields (indicating the line range of acc.txt corresponding to the entry designated by the general meta line), and the result becomes a line of Example of acchw records:
fields of meta lines of acc.txtrequired fields: L, pc, k1, k2 fields of acc_hwextra.txtrequired fields: L, pc, k1, k2, type, LP, k1P fields of an acchw.txt record :
|
Guess never.
And there will be no meta code to know if it's SLP1 or IAST?
Mark some as Prakrit in future? |
alternate headwords in acc.xmlCurrently, there is just one alternate headword implemented for acc; to get more we have to generate Here are the xml records for the primary and alternate headwords:
Two elements are special for alternate (or sub) headwords:
the xml form used for skd alternatesHere is the current skd.xml record for the alternate headword 'kuveraH':
A downstream programs using skd.xml must contain logic to interpret these special attribute. In particular this somewhat complex interpretation logic is present in the disp.php program which generates the basic display of records for skd. A downstream user of acc.xml will still need logic to deal with the Time will tell which approach is better. Currently, I prefer the acc.xml approach, which was originally suggested by Dhaval. |
current statusThe changes described in the prior two comments have been pushed to dev server in commit a19c3335. Next step will be to modify disp.php to handle the @drdhaval2785 Do you want to give these changes to web/webtc/disp.php a try? |
Good observation - as currently key2 is only in IAST for some dictionaries. Maybe the right place to put this dictionary-meta piece of information is as an HW class variable within hwparse.py.
Have not thought about this. Are there examples? Currently the headwords of all dictionaries are (I think) either Sanskrit words or English words. The use thus far of this flag is to know how to render key1. If Sanskrit flag is True, then we use the fact that key1 field is always SLP1, regardless of whether the dictionary shows Devanagari or IAST; thus we can use transcoding to render key1 in Devanagari, IAST or whatever the user display chose. If Sanskrit flag is False, then render key1 'as is'. For instance, if we had a Russian-Sanskrit dictionary, then we would set Sanskrit flag to false, since the headwords would be in Russian. |
What is the need for it? Or why some are SLP1? Does not make sense to me - the diversity.
There were, but lost again. |
Good point. I'm not sure how to resolve. When we are next working on such a dictionary (one with IAST headwords in print), we should be alert. Maybe a solution will present itself when we see the exact details in such a dictionary. |
@drdhaval2785 I got an email of a comment 'acc.xml doesn't have akzacaraRa entry in it.' but don't see it in the comments now -- presume you solved this problem (by update_sync.sh)? |
@drdhaval2785 <https://github.com/drdhaval2785> I got an email of a comment
'acc.xml doesn't have akzacaraRa entry in it.'
but don't see it in the comments now -- presume you solved this problem (by
update_sync.sh)?
I did redo_hw.sh but forgot to do redo_xml.sh. It was false alarm.
|
@drdhaval2785 <https://github.com/drdhaval2785> Do you want to give these
changes to web/webtc/disp.php a try?
I misunderstood this statement to test the disp.php rather than to try
modifying it myself.
|
For such cases I write code that runs all steps for me 💃 |
web/webtc/disp.php modified on dev serverChange commited.
You can see changes by comparing Basic-dev and
Also, these are same on both systems, and are the other types of div in acc
|
These issues now resolved on dev server and moved to Cologne server. |
acc-meta2.txt now revised to mention meta lines. |
#130 mandates that we create one uniform headword line.
I started with ACC conversion.
Jim gave some practical tips privately on mail.
I guess there are some items in those tips which need reproduction verbatim here for public consumption.
The text was updated successfully, but these errors were encountered: