misrendered PDF (unreadable trailing "d") #5507

dkg · 2023-04-18T21:21:23Z

Describe the issue

Section 4.1.2.2 of the PDF form of draft-ietf-lamps-e2e-mail-guidance-06 makes it look like there is a missing "d".

The text in that section says:

└┬╴multipart/encrypted
 ├─╴application/pgp-encrypted
 └─╴application/octet-stream

But it renders without the trailing "d" on application/pgp-encrypted.

I can't replicate this misrendering with my own toolchain (this pdf was generated by the datatracker directly), so i don't know what specifically is causing the problem.

Code of Conduct

I agree to follow the IETF's Code of Conduct

The text was updated successfully, but these errors were encountered:

kesara · 2023-04-19T02:12:20Z

This seems like a datatracker's HTMLized PDF generation issue since xml2rfc generates all formats correctly.

larseggert · 2023-04-19T10:19:18Z

CC @martinthomson

larseggert · 2023-04-19T13:13:16Z

There is also an issue with the asterisks in the list following the figure.

The short answer here is that weasyprint's CSS support is real bad. We need to find a hack that makes it work.

cabo · 2023-04-19T15:07:59Z

Here are reviewer comments on https://datatracker.ietf.org/doc/pdf/draft-ietf-jsonpath-base-13


The bullets on the bottom of page 24 (why are there no page numbers?)
and top of page 25 have no spaces after them:

 - equal objects with no duplicate names…

   oboth objects …
   ofor each of those names, …

And on closer inspection that seems to be the case for fifth or sixth
level bullets generally.

The other thing is that the heading “Parameters:” for each of the
extension functions is mangled such that it reads “ParameteXs” where X
is an overstrike of “r” and “1”. It looks like the parameters are a
numbered list and the spacing is off. By the time you get down to 2.4.7,
you can see that the second item in the list begins “2.” at the same
horizontal location as the “1” that mangles the “r” in parameters.

Honestly, I have no idea why we are torturing people with this form of pdfization if it is so easy to do the real thing.

larseggert · 2023-04-19T15:19:42Z

You mean PDFize the HTML? Happy to switch to that, but users will need to be OK with that (past feedback wanted the text version PDFs.)

cabo · 2023-04-19T15:22:32Z

You mean PDFize the HTML? Happy to switch to that, but users will need to be OK with that (past feedback wanted the text version PDFs.)

I'm not sure I understand the terminology, but I don't understand why we have to have a different (and vastly inferior) rendering from the (mostly) debugged one that is provided by xml2rfc.

cabo · 2023-04-19T15:24:07Z

(past feedback wanted the text version PDFs.)

(Trying to interpret the terminology:
Note that the .txt files have page numbers; the datatracker PDFs don't.
What are they?)

larseggert · 2023-04-19T15:27:27Z

They are based on the plaintextified HTML using @martinthomson's CSS. (Same as is shown for HTMLized.)

cabo · 2023-04-19T15:33:02Z

They are based on the plaintextified HTML using @martinthomson's CSS. (Same as is shown for HTMLized.)

Yes. Again: I don't see the point to do this when we can do the real thing.

Of course, using typewriter style conceals the fact that we cannot do standard typography correctly yet.
But the disadvantages outweigh the lack of pleasance of that inconvenient revelation.

larseggert · 2023-04-19T16:09:00Z

As I said, we can "do the real thing". It will require the community to agree that this is what they want. Past feedback indicated they wanted PDFs of text versions.

cabo · 2023-04-19T16:10:28Z

Note that this isn't an alternative. We could produce both real and fake PDFs, if the latter are really needed, just like we have html and "htmlized" (which no longer is).

rjsparks · 2023-04-19T22:07:33Z

If we can remove codepaths and alternate representations in favor of a smaller, easier to explain set, then we should do that. But I don't think we can.

Right now we do not require submission in xml. If someone provides only plain text, we do htmlize and then pdfize that. We cannot use xml2rfc to produce html or pdf.

We can't really stop providing the -ized formats when we do have v3 xml, because that would force people (and systems like wikipedia) to have to learn to point to different types of things depending on the underlying arcana of the submission, which is really a non-starter.

So, I don't think we can make less, and I cringe at the proposal to make more because of the confusion it causes.

cabo · 2023-04-19T22:35:59Z

On 2023-04-20, at 00:07, Robert Sparks ***@***.***> wrote: If we can remove codepaths and alternate representations in favor of a smaller, easier to explain set, then we should do that. But I don't think we can. Right now we do not require submission in xml. If someone provides only plain text, we do htmlize and then pdfize that.

That is fine for me — punishes the plain text submitters as they deserve :-)

We cannot use xml2rfc to produce html or pdf. We can't really stop providing the -ized formats when we do have v3 xml, because that would force people (and systems like wikipedia) to have to learn to point to different types of things depending on the underlying arcana of the submission, which is really a non-starter.

I’m having a hard time believing that would be a problem for PDF. I don’t even think this would be a problem for HTML. The RFC editor can provide proper HTML and proper PDF for those RFCs that have that (8650+), and get by with -ized surrogates for the others. What is so special about datatracker that it can’t do that?

So, I don't think we can make less, and I cringe at the proposal to make more because of the confusion it causes.

I don’t know why we need the confusion between htmlized plaintext, typewriterized html, and real html. Having plaintext’s page numbers might be a reason, but we have botched that, too. Grüße, Carsten

cabo · 2023-04-19T22:40:59Z

And remember that the reason this ticket exists appears to be that the CSS that comes with typewriterized html blows the little mind of weasyprint. Getting rid of typewriterized HTML would solve this problem right there.

dkg · 2023-04-20T01:42:29Z

i don't know how to solve this problem, but i appreciate that y'all are looking into it. Maybe it's worth looping in the WeasyPrint developers as well? @grewn0uille, perhaps you have some hints about how the datatracker can align its CSS with what WeasyPrint supports? or maybe WeasyPrint can use the current datatracker CSS as a source of feature requests/plans for improvement?

larseggert · 2023-04-20T08:07:53Z

We could of course PDFize the text versions, but that would mean no links in PDF and ASCII figures.

grewn0uille · 2023-04-20T09:26:05Z

Hi @dkg,
We can have a look at this to find what’s wrong with the current CSS!

dkg · 2023-04-20T12:12:56Z

@martinthomson, can you provide a minimized reproducer to @grewn0uille, or at least point him to the inputs (html + CSS) passed to weasyprint for draft-ietf-lamps-e2e-mail-guidance-06 by the datatracker?

martinthomson · 2023-04-24T05:42:28Z

Two issues appear to be going on here:

The figure.
The list.

The figure (1) is drawn using flexbox. There are three items there. The first is a 3ch gutter, generated with a ::before rule. The second is the figure itself, this is where I suspect the error occurs, more to follow. The third is a pilcrow: a single-character item. The rules are sometimes a little tricky, but my theory is that the trickiness (the gutter is squishy, the pilcrow can sometimes overflow), doesn't apply here.

My theory is that the offending line ( ├─╴application/pgp-encrypted) is being rendered into a fixed width box. The box size is calculated based on the monospace character width, which ordinarily works fine. In this case, either something about the font metrics of these specific characters or the specific width of the box (35ch), is hitting a bug (a floating point rounding error perhaps) that results in WeasyPrint thinking that the last character doesn't fit the box.

I might lean more toward the font metric hypothesis, because it looks like one of these line drawing characters is being substituted from a different font as there is a small discontinuity that doesn't show in a browser.

However, there are other examples further down the document that lose 2 or more characters, which either suggests that maybe that theory doesn't hold or the substitute font has very different metrics.

As for a reproducer, try this:

<style>div, pre { margin: 0; border: 0; padding: 0; font-family: monospace; }</style>
<div style="width: 72ch">
<div style="flex-wrap: nowrap; align-items: end; display: flex;">
<pre style="flex: 0 0 content; max-width: 72ch;">
Cryptographic Protections: none
H └┬╴multipart/mixed
J  ├─╴[protected part, may be arbitrary MIME subtree]
L  └─╴[footer, typically text/plain]
</pre>
<div style="flex: 0 0 1ch;">x</div>
</div>
</div>

This isn't perfect, because it doesn't cut off the text, but it at least shows that the x is rendered in the wrong place.

For the list (2), we're just using a marker. The list is pretty simply styled with padding-left: 2ch (margin and border are zero) and list-style-type: "*". WeasyPrint is rendering the marker immediately next to the text instead of placing it outside the main box, as browsers do. That seems simple enough.

dkg · 2023-04-24T20:10:47Z

Another issue here might be related to the fonts used on the datatracker. I note that the pdf of draft -06 embeds six fonts:

$ pdffonts draft-ietf-lamps-e2e-mail-guidance-06 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ZGBHTJ+Liberation-Mono               CID TrueType      Identity-H       yes yes yes   1220  0
KYXPWB+Liberation-Mono-Bold          CID TrueType      Identity-H       yes yes yes   1224  0
AALTDG+Liberation-Mono-Italic        CID TrueType      Identity-H       yes yes yes   1228  0
TNCRMY+DejaVu-Sans-Mono              CID TrueType      Identity-H       yes yes yes   1232  0
NLDACI+Source-Code-Pro               CID Type 0C (OT)  Identity-H       yes yes yes   1236  0
BWIYDA+Liberation-Mono-Bold-Italic   CID TrueType      Identity-H       yes yes yes   1240  0
$

I don't know my pdf details well enough to know which embedded fonts were used in each section, or for each calculation, but it might be worth looking into whether the presence of specific fonts causes (or minimizes) the problem.

larseggert · 2023-05-25T13:01:10Z

It's not a font issue. I have a PR in #5688 that uses the normal fonts when generating the PDF, and the issue is still there. (But the document line-wraps now as it should at least.)

larseggert · 2023-05-26T07:22:28Z

This is what I now see:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
IVBPZU+Noto-Sans-Mono                CID TrueType      Identity-H       yes yes yes   1217  0
JUCHNC+Noto-Sans-Mono-Bold           CID TrueType      Identity-H       yes yes yes   1221  0
LVFZNZ+Noto-Sans-Mono-Oblique        CID TrueType      Identity-H       yes yes yes   1225  0
TNCRMY+DejaVu-Sans-Mono              CID TrueType      Identity-H       yes yes yes   1229  0
BEKZJC+Noto-Sans-Mono-Bold-Oblique   CID TrueType      Identity-H       yes yes yes   1233  0

I don't actually know where DejaVu-Sans-Mono comes from; the Datatracker isn't injecting this AFAIK. Is there way to tell is Weasyprint is using it for the line art, and maybe that is causing the issue that @martinthomson suspects?

larseggert · 2023-05-26T07:30:31Z

Also, the d is actually present in the PDF. If you copy and paste the whole line

it comes out as

application/pgp-encrypted

Related to ietf-tools/datatracker#5507

liZe · 2023-05-29T13:21:10Z

But it renders without the trailing "d" on application/pgp-encrypted.

This problem is fixed by 0a03e3d.

For the record: it only happens on the longest line of preformatted text, only when the previous line includes non-ASCII characters, and only when preformatted block’s width is set to maximum content width (in this case, it’s a flex item).

This isn't perfect, because it doesn't cut off the text, but it at least shows that the x is rendered in the wrong place.

I’ll check that, but it probably comes from WeasyPrint’s limited support of the flexbox layout.

I don't actually know where DejaVu-Sans-Mono comes from

It’s used as a fallback font for characters that are not included in Noto Mono (for example ↧ or ⇩).

liZe · 2023-05-29T13:32:38Z

For the list (2), we're just using a marker. The list is pretty simply styled with padding-left: 2ch (margin and border are zero) and list-style-type: "*". WeasyPrint is rendering the marker immediately next to the text instead of placing it outside the main box, as browsers do. That seems simple enough.

This issue has already been reported: Kozea/WeasyPrint#1557.

liZe · 2023-05-29T13:45:54Z

I’ll check that, but it probably comes from WeasyPrint’s limited support of the flexbox layout.

WeasyPrint supports flex-end (as defined by the CSS Flexbox specification, only for flex layout), but not end (as defined by the CSS Box Alignment specification, the general case). Using align-items: flex-end works as expected.

martinthomson · 2023-05-29T19:21:14Z

Thanks for looking into this @liZe.

dkg · 2023-05-30T15:41:22Z

On Mon 2023-05-29 06:21:21 -0700, Guillaume Ayoub wrote: For the record: it only happens on the longest line of preformatted text, only when the previous line includes non-ASCII characters, and only when preformatted block’s width is set to maximum content width (in this case, it’s a flex item).

Wow, this isn't just a corner case. It's a corner case of a corner case of a corner case. Thank you for tracking this down and fixing it, Guillaume.

larseggert · 2023-10-30T08:41:42Z

This is now fixed:

dkg added the bug Something isn't working label Apr 18, 2023

kesara transferred this issue from ietf-tools/xml2rfc Apr 19, 2023

larseggert self-assigned this Apr 19, 2023

rjsparks added major accepted component: doc/ labels May 1, 2023

liZe added a commit to Kozea/WeasyPrint that referenced this issue May 29, 2023

Use UTF8 indices instead of unicode indices for line split

0a03e3d

Related to ietf-tools/datatracker#5507

larseggert closed this as completed Oct 30, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misrendered PDF (unreadable trailing "d") #5507

misrendered PDF (unreadable trailing "d") #5507

dkg commented Apr 18, 2023

kesara commented Apr 19, 2023

larseggert commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

cabo commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

rjsparks commented Apr 19, 2023

cabo commented Apr 19, 2023 via email

cabo commented Apr 19, 2023

dkg commented Apr 20, 2023

larseggert commented Apr 20, 2023

grewn0uille commented Apr 20, 2023

dkg commented Apr 20, 2023

martinthomson commented Apr 24, 2023 •

edited

Loading

dkg commented Apr 24, 2023

larseggert commented May 25, 2023

larseggert commented May 26, 2023

larseggert commented May 26, 2023 •

edited

Loading

liZe commented May 29, 2023 •

edited

Loading

liZe commented May 29, 2023

liZe commented May 29, 2023

martinthomson commented May 29, 2023

dkg commented May 30, 2023 via email

larseggert commented Oct 30, 2023

misrendered PDF (unreadable trailing "d") #5507

misrendered PDF (unreadable trailing "d") #5507

Comments

dkg commented Apr 18, 2023

Describe the issue

Code of Conduct

kesara commented Apr 19, 2023

larseggert commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

cabo commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

larseggert commented Apr 19, 2023

cabo commented Apr 19, 2023

rjsparks commented Apr 19, 2023

cabo commented Apr 19, 2023 via email

cabo commented Apr 19, 2023

dkg commented Apr 20, 2023

larseggert commented Apr 20, 2023

grewn0uille commented Apr 20, 2023

dkg commented Apr 20, 2023

martinthomson commented Apr 24, 2023 • edited Loading

dkg commented Apr 24, 2023

larseggert commented May 25, 2023

larseggert commented May 26, 2023

larseggert commented May 26, 2023 • edited Loading

liZe commented May 29, 2023 • edited Loading

liZe commented May 29, 2023

liZe commented May 29, 2023

martinthomson commented May 29, 2023

dkg commented May 30, 2023 via email

larseggert commented Oct 30, 2023

martinthomson commented Apr 24, 2023 •

edited

Loading

larseggert commented May 26, 2023 •

edited

Loading

liZe commented May 29, 2023 •

edited

Loading