Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ap.xml issues #113

Closed
drdhaval2785 opened this issue Apr 7, 2017 · 36 comments
Closed

ap.xml issues #113

drdhaval2785 opened this issue Apr 7, 2017 · 36 comments
Assignees

Comments

@drdhaval2785
Copy link
Contributor

drdhaval2785 commented Apr 7, 2017

  1. daRqa sent out of the <s> tag. See </s><lb/>.<s>

<s>divasasyAzwame BAge SAkaM pacati yo naraH </s><lb/>.<s> afRI cApravAsI ca sa vAricara modate ..</s>

This gives rise to the period being seen in the web display / stardict conversion too.

Verified from Web display to have period.

@drdhaval2785
Copy link
Contributor Author

Seems to be related to the recent discussion (can't locate) when the lines crossing line breaks have been given separate <s> tags.

@funderburkjim
Copy link
Contributor

what is the ap headword where this occurs?

@drdhaval2785
Copy link
Contributor Author

afRin

@drdhaval2785
Copy link
Contributor Author

One potential enhancement which is quite useful.

AP has markings for layer of information.
They can be indented to give proper display.

Look at superscript 2 and superscript 3.
tmp_3740-screenshot_20170408-0902341047081210

@funderburkjim
Copy link
Contributor

afRin problem

Problem solved as follows.

  • The problem is an artifact of the line-break and Sanskrit-closure-per-line logic, as Dhaval suspected.
  • Here's the ap.txt data
    .{#afRin#}¦ {%a.%} (epic) ({#f#} being here regarded as a consonant) Not
     a debtor, free from debt; {#divasasyAzwame BAge SAkaM pacati yo naraH #}
    .{# afRI cApravAsI ca sa vAricara modate ..#} Mb. The normal form {#anfRin#}
     also occurs in this sense.
    
  • The problem identified by a line beginning with `.{#' and preceding line ending '#}'
    • Note: headwords excepted. So then new form of above is:
    .{#afRin#}¦ {%a.%} (epic) ({#f#} being here regarded as a consonant) Not
     a debtor, free from debt; {#divasasyAzwame BAge SAkaM pacati yo naraH .#}
    {# afRI cApravAsI ca sa vAricara modate ..#} Mb. The normal form {#anfRin#}
    also occurs in this sense.
    
  • 66 cases found like this. The full list of change transactions is here.
    • Maybe Dhaval should do a spot check on a few others to be sure they are proper.
  • Here's the new display:
    image

@funderburkjim
Copy link
Contributor

Change display of subsection.

It will be easier to change ap.xml, since this is done by a PYthon program which can handle the
Unicode characters (superscripts 1,2,3) more reliably.

Probably will add markup so .²1 -> <div n="2">1</div>, and similarly for others (Something like this is done with PW, as I recall).

The question then will be how to render the new markup in the displays.

Suggestions?

@gasyoun
Copy link
Member

gasyoun commented Apr 11, 2017

Probably will add markup so .²1 ->

1
, and similarly for others (Something like this is done with PW, as I recall). The question then will be how to render the new markup in the displays.

Make it close to book visually? Bolded numbers.

bolded

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 11, 2017 via email

@gasyoun
Copy link
Member

gasyoun commented Apr 11, 2017

No need to retrogress.

Bolded numbers are no retro. They catch the eye.

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 11, 2017 via email

@funderburkjim
Copy link
Contributor

improved version ready for review

A version of ap.xml, with related changes to the display, is now ready for your viewing enjoyment.

The changes are evident in the basic, list, etc. displays [Not giving links because of the restricted status
of the dictionary -- I'm assuming the interested parties know how to navigate to these displays.]

However, for the sake of allowing comparisons to the previous version, I haven't yet installed the
changes in list-02.html display.

Take a look, and give me feedback.

When there is general agreement on the changes, I'll finish installation steps, and describe some details of the process.

@gasyoun
Copy link
Member

gasyoun commented Apr 13, 2017

Jim, has these "glued" words always existes or it's something new? In dA

toexchange
one'slife
sometimes

Otherwise it's much better.

give

Still I have a long pending question.

lp

I do not like what I see on the left, the way numbers are presented. Nobody (whom I asked) did not understand what is L or p. I would suggest mark them different colours and remove the L and p tags. And make p one line with no break, like http://stackoverflow.com/questions/7219007/html-no-line-break-at-hyphens

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 13, 2017

When entry spans more than one line, next line starts at the same indent as the numbers. See
tmp_6948-screenshot_20170413-1319541676508781

@drdhaval2785
Copy link
Contributor Author

I would like something like this.
See how the 1,2, etc stand out of crowd.

tmp_6948-screenshot_20170413-1321331574285965

@gasyoun
Copy link
Member

gasyoun commented Apr 13, 2017

I agree, Google's spacing is well thought.

@funderburkjim
Copy link
Contributor

funderburkjim commented Apr 13, 2017

The 'indent' question is one I struggled with.

I first tried the css text-indent property. But ended up using a 'position:relative; left:2em;' style.

I think that the hanging part of text-indent might give the feature you suggest, but this is not implemented in browsers currently, acc. to MDN, and according to my experiments trying it.

If you know can show me how to implement the indentation style your image shows, I'll be glad to
use it.

@funderburkjim
Copy link
Contributor

funderburkjim commented Apr 13, 2017

Regarding the 'L=', etc. comments.

I also think the current format is awkward.

Currently, the whole part of the basic display is a table with 2 columns; the 'key1, L=,p=' part is in
the first column, and the main entry is in second column.

What about making it just one column, and changing the labeling. [Idea implmented experimentally -- take a look.]

@funderburkjim
Copy link
Contributor

toexchange

This is a bug in the revised make_xml.py program. Bug now corrected. Good catch! 👍

@drdhaval2785
Copy link
Contributor Author

drdhaval2785 commented Apr 16, 2017

@funderburkjim

First few lines in ap.txt

.{#a#}¦ The first letter of the alphabet; {#akzarARAmakAro'smi#}  Bg. 10. 33.
.{#{@-aH@}#} [{#avati, atati sAtatyena tizWatIti vA; av--at vA, qa#} Tv.]

Whereas it is rendered in ap.xml as

<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s>   The first letter of the alphabet; <s>akzarARAmakAro'smi</s>  Bg. 10. 33.<lb/>.<b><s>-aH</s></b> [<s>avati, atati sAtatyena tizWatIti vA; av--at vA, qa</s> Tv.]<lb/>

Have a look at <lb/>.<b>.
A superfluous period here.
.{# signifies starting of a chunk in AP. No need to keep the period there.
@funderburkjim what is your take?

@drdhaval2785
Copy link
Contributor Author

Euro character is not killed. It identifies verbs, but we should identify verb numbers and tag them in XML and not keep euro character.

<H1><h><key1>aMh</key1><key2>aMh</key2></h><body><s>aMh</s>   €1A <s>aMhate, aMhituM</s> To go; approach; set out; Bk. 3. 25,<lb/>46, <s>AnaMhe cAntikaM pituH</s> 14. 51, 4. 4. &amp;c. <i>-Caus.</i><lb/>.²1 To send; <s>tamAYjihanmETilayajYaBUmiM</s> Bk. 2. 40, 15. 75.<lb/>.²2 To shine.<lb/>.²3 To speak.</body><tail><L>20</L><pc>0002-2</pc></tail></H1>

See €1A

@funderburkjim
Copy link
Contributor

.{# and new xml structure

Here is the first part of headword 'a` in the revised xml:

<H1><h><key1>a</key1><key2>a</key2></h><body><s>a</s>   The first letter of the alphabet; 
<s>akzarARAmakAro'smi</s>  Bg. 10. 33. 

<div n="?"><b><s>-aH</s></b> [<s>avati, atati sAtatyena tizWatIti vA; av--at vA, qa</s> Tv.]</div> 

<div n="2" name="1">1 N. of Viṣṇu, the first of the three sounds constituting the sacred syllable <s>om; akAro vizRuruddizwa ukArastu maheSvaraH . makArastu smfto</s> <s>brahmA praRavastu trayAtmakaH ..</s> For more explanations of the three syllables <s>a, u, m</s> see <s>om</s>.</div> 

I think this 'div' structure takes care of your concerns there. @drdhaval2785 Agree?

@funderburkjim
Copy link
Contributor

€ and roots

There are 3068 lines matching , and these do appear to be roots.

As usual, there are multiple forms that need to be identified, and some likely errors also.

We could employ an xml markup similar to that of MW; here's a sample of MW under hw aMS

<vlex type="root"></vlex> <vlex>cl.10 P.</vlex> 

And here is the full record for the root aMh in MW:
<H1><h><hc3>503</hc3><key1>aMh</key1><hc1>1</hc1><key2>aMh</key2><hom>1</hom></h>
<body> <vlex type="root"></vlex> <p><cf/>~<root/>~<s>aNG</s></p> <vlex>cl.1 A1.</vlex> 
<s>aMhate</s> , <c><to/>to_go_,_set_out_,_commence</c> <ls>L.</ls> <msc/> <c>
<to/>to_approach</c> <ls>L.</ls> <msc/> <vlex>cl.10 P.</vlex> <s>aMhayati</s> , <c>
<to/>to_send</c> <ls>Bhat2t2.</ls> <msc/> <c><to/>to_speak</c> <ls>Bhat2t2.</ls> <msc/> <c>
<to/>to_shine</c> <ls>L.</ls> </body><tail><mul/> <MW>000068</MW> <pc>1,2</pc> <L>107</L><mscverb/></tail></H1>

In AP case, we could render €1A as <vlex>1A</vlex> in ap.xml.

While most cases are simple like this, it will take some work to completely handle all relevant cases.
Here are some other forms that catch my eye:

.{#Acakz#}¦ €2Ā.   Notice the Macron on the A
.{#AGf#}¦ €10P. or {%Caus.%} To pour down upon, sprinkle.  should the 'Caus' be in scope of <vlex>?
.{#ArAD#}¦ €5, 10 P.
.{#akz#}¦ €15P.      Probably should be at least 1,5P  (with added comma)

I have so many things on my todo list that I am reluctant to volunteer to do the needed work to
add this markup to the xml form of ap.


@funderburkjim
Copy link
Contributor

Since no additional comments regarding the revised ap displays (in particular regarding the adjustment to the handling of [p=123][L=345] ), I'll assume that it is safe to go ahead and complete the full installation of the current revised xml, and the corresponding revisions to the displays.

@funderburkjim
Copy link
Contributor

documentation of adding <div> markup to ap.xml

The ap.txt digitization has a form of markup for sections. This ap.txt markup is identified by lines
that start with a period. Before markup can be added to ap.xml, it is necessary to classify the various
types of lines that start with '.'; and correct mistakes that impede the classification.

four categories of lines in ap.txt beginning with periods:

  • Headwords : example .{#a#}¦ The broken bar is part of the identification of this class. These do
    not require additional markup
  • superscript 2 cases: .²1 N. of Viṣṇu ... these are of form + digit-sequence
  • superscript 3 cases: .³({%a%}) N. of Viṣṇu ... these are of form + ({%X%}) , i.e., an italicized letter in parentheses.
  • others: For instance .{@{#-aH#}@} (not a headword, since no broken bar.

To decide the superscript cases, the ap.txt was filtered using re.search(u'(.)?([²³][^ ]*)',line),
and the categories printed out. See filter_test1_cases_orig.txt here.

As you see, there are some garbagey looking cases. The most labor-intensive part of the exercise is to identify and correct these. The corrections are in the 'superscript_changes.txt' file of the gist just mentioned.

After correction, the set of superscript cases is quite regular, as described above; this list is in the 'filter_test1_cases.txt' file of the gist. We now are ready to change these ap.txt markups to ap.xml markups.

Converting to markup in ap.xml

This is done by the 'make_xml.py' program, and the process involved two phases:

  • First Add an opening div tag: Accomplished in the adjust_xml function.
    • .²5 -> <div n="2" name="5">5
    • .³({%a%}) -> <div n="3" name="a">({%a%})
    • .{@{#-aH#}@} -> <div n="?">{@{#-aH#}@}`
      • Maybe at some other time we will want to further classify these cases.
  • Second Add a closing </div> tag. This is done by the close_divs function.
    We want this to be at the end of the scope of the opening
    tag. The easiest way is to say the the scope of an opening <div> goes all the way up to the next
    opening <div>.
    • There probably is more hierarchical structure than this markup choice identifies. We just have
      implemented a simple sequence of div elements,
      and no div element is a 'child' of another div element.

@drdhaval2785
Copy link
Contributor Author

I think this 'div' structure takes care of your concerns there. @drdhaval2785 Agree?

Yes

@drdhaval2785
Copy link
Contributor Author

There probably is more hierarchical structure than this markup choice identifies. We just have
implemented a simple sequence of div elements,
and no div element is a 'child' of another div element.

There may be some hierarchy uncaptured. But at least we made a start. Slowly we will inch there too.

@drdhaval2785
Copy link
Contributor Author

simple sequence of div elements

Is sufficient to properly indent the display. Good leap in readability and user friendliness.

@funderburkjim
Copy link
Contributor

documentation of the html rendering of the divs

The html rendering of the <div> elements is done in the disp.php program of web/webtc .

This program is used directly by the 'basic' display. Since the other displays (list display, advanced search, mobile-friendly, apidev/sample/list-02.html, etc.) piggy-back on disp.php, the change is reflected in all the displays.

step 1. make the first 'word' of the div bold

This is done by rewriting the xml at the start of each div; here the $line variable contains the entire xml line: <H1>...<body>...<div..>W ...<\div>...</body>...</H1> and we replace W by <b>W</b>.

 $line = preg_replace('|(<div[^>]*>)(\(<i>.</i>\))|','\\1<b>\\2</b>',$line);
 $line = preg_replace('|(<div[^>]*>)([0-9]+)|','\\1<b>\\2</b>',$line);

Note that this applies to the two superscript cases of ap.txt. For the other div case, we assume the
element is already bold.

step 2. indentation

For the 'n=2' superscript, we indent by 1em and for the 'n=3' superscript, by 2em. The other kind of div is not indented.

  } else if ($el == "div") {
   $n=$attribs['n'];
   if ($n == '3') {
    $style="position:relative; left:2em;";
    $row .= "<br/><span style='$style'>";
   }else if ($n == '2') {
    $style="position:relative; left:1em;";
    $row .= "<br/><span style='$style'>";
   }else {
    $style="";
    //$row .= "<p style='$style'>";
    $row .= "<br/><span style='$style'>";
   }

This occurs in the context of a SAX xml parser, using the xml_parser_create and related functions of PHP; this is a php version of the expat parser for xml. It is likely that there is an expat parser that could be used in a browser's Javascript code to do all this rendering in the browser.

Anyway, the php parser essentially does a tree-walk of the xml structure, and when it encounters a <div> element, it examines the n attribute value, and based on this value (2,3 or ?) constructs a style element that introduces an extra 'left' indentation to the span element that contains the subsequent text of the <div>.

Then when the end of the <div> is encountered later in the tree walk, the closing </span> element is inserted into the html stream under construction.

In an earlier comment above, I suggested using an empty div element for the xml markup. However, I changed this to enclose the entire division <div...>text of the div</div>, and a major
reason was so that the rendering method above would know where to put that closing </span> element.

That's it. Not too hard, once all the context is understood.

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

In AP case, we could render €1A as 1A in ap.xml.

Agree.

.{#AGf#}¦ €10P. or {%Caus.%} To pour down upon, sprinkle. should the 'Caus' be in scope of ?

Should not. For Causative forms even some general verbal tag would not do? Sure it's better than nothing, but as Caus. with the abbreviature occur in many dictionaries, it could be used for RegExing them out and giving them what they deserve.

.{#akz#}¦ €15P. Probably should be at least 1,5P (with added comma)

Yeah, indeed, plenty of issues.

@funderburkjim
Copy link
Contributor

adjustment to css of list-0.2.html display

There was an annoying side-effect of the indentation, when viewed in the list-0.2.html display.
Namely, part of the lines were hidden under the scrollbar.

This is no doubt due to the relative positioning technique used for indentation.

To improve this, a 'padding-right:15px' css rule was added to the list-0.2 display

tech note: in file apidev/css/basic.css, at #CologneBasic table.display rule.

This improves the situation for ap, and simply adds a little space at the right for other displays.

This was referenced Apr 19, 2017
@funderburkjim
Copy link
Contributor

Everything now installed.

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

(in particular regarding the adjustment to the handling of [p=123][L=345] )

p and L tell nothing to nobody. I know them, but none of my students could not grasp it wihtout me telling what it is.

There may be some hierarchy uncaptured. But at least we made a start. Slowly we will inch there too.

Yeah.

For the 'n=2' superscript, we indent by 1em and for the 'n=3' superscript, by 2em. The other kind of div is not indented.

I would add a that is not n=2. I would add a CSS. And in the CSS file I could change and play around and see if 2 is a appropriate choice.

To improve this, a 'padding-right:15px' css rule was added to the list-0.2 display

Makes sense, indeed. Let me write my proposals in a new thread.

@funderburkjim
Copy link
Contributor

p and L ...

Did you notice the change in AP display? Isn't this better?
image

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

Isn't this better?

Do not think so.

[record id=394] [scan-page 0030-2]

  1. are not obvious as well. Is the record ID in the book, where to find him? What is the -2 in the number?
  2. I would make it like 0030-2 - move the mouse over and see the title attribute I've used. We can have a longer explanation there. The only thing is that the ID should be made as a link as well. At least a link to # can be made, so a fake one - no big issue. And I would add a CSS class with a brighter colour, not black for these numbers.

As per me, if explained in a FAQ and in the tooltip for each link, it's ok to have:
[ID 394] [P 0030-2]

@funderburkjim
Copy link
Contributor

Do not think so.

Sorry to hear that. Will add your suggestion to todo list (currently 7 deep).

@gasyoun
Copy link
Member

gasyoun commented Apr 19, 2017

Sorry to hear that.

It's not about me. I understand, sure. But it's from what I've seen how @Shalu411 used it initially and people for whom English is not a mother language. The abbreviations are not obvious and need explanation. Even if they are longer, a commentary and intro is wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants