Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta-line, IAST conversion tracker #177

Closed
drdhaval2785 opened this issue Sep 2, 2017 · 13 comments
Closed

meta-line, IAST conversion tracker #177

drdhaval2785 opened this issue Sep 2, 2017 · 13 comments
Assignees
Labels
Documentation How TXT , XML work

Comments

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented Sep 2, 2017

This is to follow the progress of meta-line conversion and IAST of different dictionaries.
#110 (comment) this comment gave some status by @funderburkjim. It was in some non-descript issue. So tagging it here, lest it should be lost track of.

Dict IAST meta-line
ACC 20 May 2017 approx. 20 May 2017 approx.
AE 03/13/2018 #217 03/13/2018
AP90 #159 #158 06/30/2017
AP April 2017 #162 07/11/2017
BEN #112 04/10/2017
BHS 01/30/2018 #201 01/30/2018
BOP 02/02/2018 #202 02/02/2018
BOR 02/23/2018 #213 02/23/2018
BUR #105 April 2017 #166 07/28/2017
CAE #191 Oct 2017 #191 10/30/2017
CCS #198 12/28/2017 #198 12/28/2017
GRA #199 01/24/2018 01/24/2018
GST 02/06/2018 #204 02/06/2018
IEG 02/08/2018 #205 02/08/2018
INM 02/11/2018 #206 02/11/2018
KRM 01/28/2018 #200 01/28/2018
MCI 02/11/2018 #207 02/11/2018
MD #103 03/27/2017 #161 07/07/2017
MW72 03/09/2018 #215 03/09/2018
MW 06/19/2018 #216
MWE 02/26/2018 #214 02/26/2018
PD 02/04/2018 #203 02/04/2018
PE 02/13/2018 #208 02/13/2018
PGN 02/15/2018 #209 02/15/2018
PUI 02/17/2018 #210 02/17/2018
PWG #190 12/16/2017 12/16/2017
PW #183 10/17/2017 10/17/2017
SCH 27 Apr 2017 22 June 2017
SHS #170 08/06/2017 #172 08/08/2017
SKD No conversion needed. #176 08/24/2017
SNP 02/19/2018 #212 02/19/2018
STC 09/16/2017 #182 09/16/2017
VCP No conversion needed. #173 08/16/2017
VEI 02/18/2018 #211 02/18/2018
WIL #154 20 June 2017 #154 20 June 2017
YAT #154 31 May 2017 #154 31 May 2017
@funderburkjim
Copy link
Contributor

Added some missing dates to table above based on #110.

Will use table here for further progress.

@gasyoun
Copy link
Member

gasyoun commented Sep 2, 2017

pwg, pw, gra, cae, ccs should come next in order of importance, IMHO.

@funderburkjim
Copy link
Contributor

funderburkjim commented Sep 2, 2017

I want to do stc next, so we'll have both the Sanskrit-French dictionaries done.

Although pwg and pw are more important dictionaries than cae and ccs, pwg and pw are also going to be much harder to do (i think) so my tendency is to do cae and ccs next. And then pwg and pw.

I anticipate that gra is going to be very hard from the IAST conversion point of view (lots of accents), so I want to put it off.

meta-line conversion for AE will be done AFTER Sampada is finished; Reason: Generating the files for the UI she is using would require reprogramming by the meta-line conversion.

For MW72, kind of want Jonathan's Greek entry to be done first -- but may need to do conversion before he has time to complete his work.

@funderburkjim
Copy link
Contributor

With the PW/PWG dictionaries now done, my intention is to do the meta-line/iast conversion on the
other dictionaries, but in a more cursory manner. I think it is important to have all the dictionaries in this form, so that we may then have display and maintenance (correction) code to be more uniform.

I'm still not sure how to fit MW into this mold, but it should not be forever different, even though it
will retain its place as the original instance of this Sanskrit Dictionary digitization project.

@gasyoun
Copy link
Member

gasyoun commented Dec 17, 2017

I'm still not sure how to fit MW into this mold, but it should not be forever differen

Exactly and the fact remains - it's the most popular as well.

Another question. Right now there is no single dictionary file where all the corrections are incorporated, right? It's always generated on the fly for web, based on several files, right? Can we have a cleaned .xml as well, Jim?

@funderburkjim
Copy link
Contributor

there is no single dictionary file where all the corrections are incorporated

If I understand the question, then this statement is not right.

The corrections ARE incorporated into X.txt and X.xml.

For example, consider PWG dictionary. There are numerous versions of the digitization,
starting with pwg. These versions are kept in the 'orig' (for 'original') directory.

pwg.txt      Always the latest version
pwg_orig.txt   The version as presented by Thomas (the first version)
pwg_orig_utf8.txt     converted to utf8 encoding
pwg_orig_utf8_slp1.txt.   Devanagari encoding changed to slp1
Other intermediate versions
pwg0.txt
pwg1.txt
pwg2.txt
pwg3.txt
pwg4.txt
pwg5.txt
pwg6.txt
pwg7.txt
pwg8.txt
pwg9.txt
pwgdel.txt
pwgheader.txt

The script 'update.sh' aims to describe the exact steps by which each of the intermediate versions
are constructed. Here is that file for pwg:

echo "BEGIN update.sh"
#  construction of pwg_orig_utf8_slp1.txt
# cd convertwork
# python transcode.py 1 ../../orig/pwg_orig_utf8.txt ../../orig/pwg_orig_utf8_slp1.txt
#  construction of pwg0.txt
# # back to pywork
# cd ../
# python pwgall.py ../orig/pwg_orig_utf8_slp1.txt ../orig/pwgheader.txt ../orig/pwg0.txt ../orig/pwgdel.txt
echo "Apply changes in manualByLine01_slp1 to pwg0, getting pwg1"
python updateByLine.py ../orig/pwg0.txt manualByLine01_slp1.txt ../orig/pwg1.txt 
echo "Apply changes in manualByLine02_slp1 to pwg1, getting pwg2"
python updateByLine.py ../orig/pwg1.txt manualByLine02_slp1.txt ../orig/pwg2.txt 
echo "apply manualByLine03_slp1 changes to pwg2 to get pwg3..."
python updateByLine.py ../orig/pwg2.txt manualByLine03_slp1.txt ../orig/pwg3.txt 
#skip old manualByLine04, since accent changes already made in pwg0
#echo "apply manualByLine04_slp1 changes to pwg3 to get pwg4..."
#python updateByLine.py ../orig/pwg3.txt manualByLine04_slp1.txt ../orig/pwg4.txt 
echo "apply manualByLine04_slp1 changes to pwg3 to get pwg4..."
python updateByLine.py ../orig/pwg3.txt manualByLine04_slp1.txt ../orig/pwg4.txt 
#echo "construct missingByLine"
#cat missing/updprephk/missing*.txt > missingByLine.txt
#cat missing/updprep/missing*.txt > missingByLine_slp1.txt
echo "apply missingByLine_slp1 changes to pwg4 to get pwg5..."
python updateByLine.py ../orig/pwg4.txt missingByLine_slp1.txt ../orig/pwg5.txt 
# manualByLine05_slp1 is a copy of 
#  correctionwork/arabic/arabic_prep4_upd_completed.txt
echo "apply manualByLine05_slp1 changes to pwg5 to get pwg6... (Arabic)"
python updateByLine.py ../orig/pwg5.txt manualByLine05_slp1.txt ../orig/pwg6.txt 
echo "apply manualByLine06_slp1 changes to pwg6 to get pwg7..."
python updateByLine.py ../orig/pwg6.txt manualByLine06_slp1.txt ../orig/pwg7.txt 

echo "manualByLine07_slp1 is from correctionwork/greek/greek_prep5_upd_edit.txt"
echo "apply manualByLine07_slp1 changes to pwg7 to get pwg8... "
python updateByLine.py ../orig/pwg7.txt manualByLine07_slp1.txt ../orig/pwg8.txt 

#echo "apply manualByLine08_slp1 changes to pwg8 to get pwg... "
#python updateByLine.py ../orig/pwg8.txt manualByLine08_slp1.txt ../orig/pwg.txt
# 12-14-2017 Meta-line conversion, with IAST
cd correctionwork/cologne-issue-190
sh redo.sh  # generates temp_pwgwithmeta2.txt
cd ../../
cp correctionwork/cologne-issue-190/temp_pwgwithmeta2.txt ../orig/pwg9.txt
python updateByLine.py ../orig/pwg9.txt manualByLine09_slp1.txt ../orig/pwg.txt

echo "END update.sh"
echo "NEXT redo_hw.sh"

The headword list and the xml version are constructed from pwg.txt. So, the xml version also
incorporates all corrections to date.

@gasyoun
Copy link
Member

gasyoun commented Dec 20, 2017

xml version also
incorporates all corrections to date

Thanks for the clarification, was not sure.

@drdhaval2785
Copy link
Contributor Author

Have beem a mute spectator for quite some time. Remarkable achievement and speed by @funderburkjim. Once the conversion is done for all (or near all), many of the scripts can be made generic.

Some work may be needed to make content markup uniform then.

@gasyoun
Copy link
Member

gasyoun commented Feb 16, 2018

Only 7 left, but there is MW, so we are getting actually close to Unicode and understandable and standardised code thanks to Jim.

@funderburkjim
Copy link
Contributor

make content markup uniform

Right, there will be. Will adapt some of the code conversion programs to do a survey of the markup of
the various dictionaries, with the aim of converging markup where possible.

MW will be a bear; saving it until last. I'm sure it will put up a valiant struggle when I try to make its
form more similar to that of other dictionaries.

@funderburkjim
Copy link
Contributor

All the little boxes seem to be filled in now. Hurray!

Next steps will probably be to review the work, and
a) uniformity of tags across dictionaries (#87, #116)
b) document the iast conversions (#216, #227)

  1. differences between text iast and modern iast
  2. differences between coded iast and modern iast (e.g., I think mw72 has a couple of variances
    here

I think we can close this issue now.

@gasyoun
Copy link
Member

gasyoun commented Jun 20, 2018

All the little boxes seem to be filled in now.

In less than a year. In a Jimless condition we would not make it in 20.

I think we can close this issue now.

You sure deserve it and wanted to do it long ago, so yes, yes, yes.

@funderburkjim
Copy link
Contributor

@gasyoun Thanks for the encouragement! It helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation How TXT , XML work
Projects
None yet
Development

No branches or pull requests

3 participants