-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PWG meta-line/IAST conversion #190
Comments
PWG and PW was my most wanted. |
Adjust page breaks within
|
So 1751 times the source was broken = unrecognized. I guess we can recognize, add markup and get back or the line brakes will be left everywhere except literary sources? |
All the line breaks are present; they are just offset slightly so they don't occur in the middle of a literary source. |
Oh, understood. |
Seems worthwhile. |
You know it is. Based on it one day we can add hyperlinks.
If it's weeks - yes. If months - no. I would divide the trivial and leave the non-trivial if a solution can't be found in a week. |
Does not seem so. @SergeA any clue? |
All these examples show transliterated words, as terms (gaṇa, avj.) or names (Viṣṇu, Śaṁkar.) given in their full form, or abbreviated, and which are neither German words, nor Sanskrit headwords, nor Sanskrit quotations. So they are printed in a peculiar way for easy separating from other text. In this connection I want to mention the output I´ve seen in MW. There transliterated terms as "Vedas" etc. while selected Devanagari output are represented as "वेदs". That's not good at all. Transliterated terms and names must be treated separately from real Sanskrit words (stems and quotations). The term "Vedas" should be rendered in Latin letters, no matter which output is selected. A separate markup for these words in digitalization allows also to change the outdated transliteration scheme etc. |
You make an interesting observation. I've opened an MWS issue as a placeholder for responding ... don't want to divert right now from thinking about PWG. avj.Many of the words seem to be Sanskrit words. But what kind of abbreviation is avj. ? If not Sanskrit, maybe this is miscoded. Reference of all instancesFor reference, this file has a listing of the current instances, with frequency. In this file, there are many letter-number codings. As a later step in this conversion, I'll transcode these I started using the |
I suppose avj. can be for avjaja (avyaya) = indeclinable. |
Nax. questionNax. is coded as 'wide' text 250 or so times. A random sample of these indicates that they are all part of some literary source reference related to WEBER. It will have a chance to get transcoded to IAST by virtue of being part of a literary source. I think this should not be coded as 'wide'. Any objections, @SergeA ? |
More wide text with literary sourcesNax., and avj. mentioned above have the common feature that they coded as 'wide' text that occurs within the scope of a literary reference. So they are recognizable as having form How to read Agni ?Here's an instance with Agni. How should this literary source be read? |
By comparison with the corresponding RV text, the reading is: And also: As I noticed, the numbers visually differ, according to the level of text divisions. The number for the big section is the highest and also black, the number for the verse is smallest. This representation makes the references more readable. |
Nax. is explained in the sources |
Indian users would disagree.
Yeah, in the past Jim was able to represent the levels with font sizes at a REGEX level.
Small caps is reserved for sources only, right? |
Just a note to let others know that progress is being made in the |
Good to know. |
This round of work on pwg is primarily over. The current status is that the Basic, List, Adv. Search, and mobile1 displays are all based on the new form of the data. The data used in the list-0.2, list-0.2s displays is based on the prior form of the data. My next task will be to document what has been done, and some things that remain to be done some other time. I'll also work to get caught up with the comments others have made while I've been on this pwg excursion. I'll pull list-0.2(s) to use the current data when a bit of time has passed and we don't need to look at the old form for comparison. |
So be it. Was missing you on this trip around (or rather inside) the world. |
Summary of changes to pwgThe main changes were to the base digitization, pwg.txt. These changes flowed through to similar changes in pwg.xml. In addition, a few differences in the base display were introduced. pwg-meta2Most of the changes to pwg.txt are quite technical in nature. One way to understand these changes Sample comparisonAn intuitive understanding of the changes is given by a close reading of comparable entries in the previous and current versions of the digitization. Since it is short, the first entry is a good place to start. PREVIOUS (pwg8.txt) [This is just one line -in pwg8.txt- I've introduced line breaks so this comment will be easier to read]
CURRENT (pwg.txt)
|
Guided tour of the comparisonHeaderThe old header is
Sanskrit text and italic textThese are identified in the same way in both forms: {#X#} and {%X%}. [No italic text in this example]. Literary sourceOLD : The first difference is simply a change of notation: from iast textWords appearing in the original digitization with coding In the old form, X is coded in the AS (letter-number) system (e.g. gan2a). Examination of the instances throughout the text led me to believe that X is always a Sanskrit word appearing in Roman alphabet with diacritics. The text author uses his own system of diacritics. With this assumption, X in the new coding is transformed to modern IAST. So, gan2a becomes gaṇa, and The distinct occurences of these are relatively rare. I'll discuss this more fully in a separate issue. Incidentally, note closely the position of the period in the second example: |
DivisionsThe other major difference in the two forms regards coding of subdivisions within an entry. letter divisionsThe second entry shows letter divisions:
NEW
Compare In comparing
It seems to be a feature of the print that the first division of a sequence has no em-dash. number divisionsNumber divisions are similar. Compare Greek alphabet divisionsCompare Number, Letter, , Greek hierarchyI think the prevailing hierarchy principle is : Numbers > Letters > Greek. Prefix verb formsFor verb entries, the author uses a generally consistent system of presenting prefix forms.
NEW
So the general conversion is Note that the prefix in question generally appears just after the division markup (space + {#X#}), so Vgl. divisionsIt seemed to me that there is a common pattern which should be marked as a division, although the
Vgl. is an abbreviation:
|
Other conversion detailslex tag
lang tagVarious language tags are all changed to the
More special cases
|
xml markupWith the new form of the pwg.txt digitization, the xml form pwg.xml is only modestly different than pwg.txt. Here are the main differences Meta-lineThe meta line elements get converted to xml elements as follows:
bodyThe non-meta lines are put into the
n attribute of
|
Display featuresMany of the changes to pwg.txt do not show up as differences in the html displays for pwg, since However, there are a couple of display-visible differences:
|
This ends the comments that come to mind on the general features of the conversion. Additional issues will provide more details regarding |
makes sense. What a tremendous work! I think the prevailing hierarchy principle is : Numbers > Letters > Greek Agree.
The only big feature I lack myself badly left.
Agree on Vgl. = compare.
Agree.
Agree, not worth the trouble in 2018.
I disagree. It did help a lot not getting lost. I would want to see it as it is still possible, Jim. It's brilliant even as it was.
Indeed, but I would go for a CSS class and not just hard coding. But let it be, it's just the puritan in me. Because hard coding was old school even in 1999, the year I launched my 1st website. And by the fact - I'm in St. Petersbourg righ now, just in a few hundert metres away where the Dictionary was printed. |
@gasyoun Thanks for feedback, Marcis. I'll take a look at mimicing Is there some memorial in St. Pet. that identifies the spot where PWG was printed? Right now, it's more convenient to imbed styles in disp.php, since there are different CSS files for the different displays (Basic, List, etc.). So by putting in disp.php, all displays get the benefit. Otherwise I'd have to change multiple css files. Not bragging about this arrangement for sure, but that's the way it is now. Curious of your opinion on use of tooltips for LS references, rather than link to popup. |
The list-0.2s display now also is based on the new form of data for pwg. |
Time to make it public? |
Added a link to list-0.2s on home page. Needs documentation. Hint Hint! |
User comment re display details:User Odile Caujolle made a comment regarding the popup LS references in MW, and I asked her to
What do others think? |
Can only agree. |
Blue is blue, but the reference sizes are gone and we sure want to see them back, as bad as they are - they give a visual hint. |
Tool-tips are good for abbreviations, and are not so good for the sources. For me the great benefit of the pop-up window for sources is the possibility of easy copying of the text. With the tool-tips this copying becomes impossible. Is it possible to make them tool-tipped but also keep clickable for pop-up window? So the copying functionality will not be lost. The pop-up works fine in my FireFox, but in Chrome the line position is sometimes slightly misplaced. |
copying tooltip textThe pwg display example uses the default tooltip, so behavior is governed by the browser's internal (and not modifiable) behavior. The jQueryUI Tooltip widget provides for customization. After half an hour of research, I found no immediate customization that permits copy-pasting from the tooltip text, but I suspect that this could be done. Also, Bootstrap has tooltip functionality that might be customizable in this way. An example from wikipediaLook at this example (from Vedic Sanskrit article). If you hover over one of the superscript numbers, you get a nicely formatted little 'tooltip' and you can move the mouse into it and copy/paste from it. This looks like a very nice solution to me. What do you think? |
Yes, this works. |
Indeed. A copy-pastable tooltip would be a solution. |
OK. I'll put this on todo list. It remains to know how to extract this particular piece of web technology @gasyoun |
Jim, let me explore. |
@artforlife any idea? |
After sending the only remaining item to #321, this issue is safe to close. |
This is a placeholder for questions which arise in the course of this conversion, which will begin in a few weeks.
I'm starting this issue now to have a place for this link to a related question.
The text was updated successfully, but these errors were encountered: