Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misrendered PDF (unreadable trailing "d") #5507

Closed
1 task done
dkg opened this issue Apr 18, 2023 · 29 comments
Closed
1 task done

misrendered PDF (unreadable trailing "d") #5507

dkg opened this issue Apr 18, 2023 · 29 comments
Assignees
Labels

Comments

@dkg
Copy link

dkg commented Apr 18, 2023

Describe the issue

Section 4.1.2.2 of the PDF form of draft-ietf-lamps-e2e-mail-guidance-06 makes it look like there is a missing "d".

2023-04-18_17-19-44_window_610x370

The text in that section says:

└┬╴multipart/encrypted
 ├─╴application/pgp-encrypted
 └─╴application/octet-stream

But it renders without the trailing "d" on application/pgp-encrypted.

I can't replicate this misrendering with my own toolchain (this pdf was generated by the datatracker directly), so i don't know what specifically is causing the problem.

Code of Conduct

@dkg dkg added the bug Something isn't working label Apr 18, 2023
@kesara
Copy link
Member

kesara commented Apr 19, 2023

This seems like a datatracker's HTMLized PDF generation issue since xml2rfc generates all formats correctly.

@kesara kesara transferred this issue from ietf-tools/xml2rfc Apr 19, 2023
@larseggert
Copy link
Collaborator

CC @martinthomson

@larseggert
Copy link
Collaborator

There is also an issue with the asterisks in the list following the figure.

The short answer here is that weasyprint's CSS support is real bad. We need to find a hack that makes it work.

@larseggert larseggert self-assigned this Apr 19, 2023
@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023

Here are reviewer comments on https://datatracker.ietf.org/doc/pdf/draft-ietf-jsonpath-base-13


The bullets on the bottom of page 24 (why are there no page numbers?)
and top of page 25 have no spaces after them:

 - equal objects with no duplicate names…

   oboth objects …
   ofor each of those names, …

And on closer inspection that seems to be the case for fifth or sixth
level bullets generally.

The other thing is that the heading “Parameters:” for each of the
extension functions is mangled such that it reads “ParameteXs” where X
is an overstrike of “r” and “1”. It looks like the parameters are a
numbered list and the spacing is off. By the time you get down to 2.4.7,
you can see that the second item in the list begins “2.” at the same
horizontal location as the “1” that mangles the “r” in parameters.

Honestly, I have no idea why we are torturing people with this form of pdfization if it is so easy to do the real thing.

@larseggert
Copy link
Collaborator

You mean PDFize the HTML? Happy to switch to that, but users will need to be OK with that (past feedback wanted the text version PDFs.)

@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023

You mean PDFize the HTML? Happy to switch to that, but users will need to be OK with that (past feedback wanted the text version PDFs.)

I'm not sure I understand the terminology, but I don't understand why we have to have a different (and vastly inferior) rendering from the (mostly) debugged one that is provided by xml2rfc.

@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023

(past feedback wanted the text version PDFs.)

(Trying to interpret the terminology:
Note that the .txt files have page numbers; the datatracker PDFs don't.
What are they?)

@larseggert
Copy link
Collaborator

They are based on the plaintextified HTML using @martinthomson's CSS. (Same as is shown for HTMLized.)

@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023

They are based on the plaintextified HTML using @martinthomson's CSS. (Same as is shown for HTMLized.)

Yes. Again: I don't see the point to do this when we can do the real thing.

Of course, using typewriter style conceals the fact that we cannot do standard typography correctly yet.
But the disadvantages outweigh the lack of pleasance of that inconvenient revelation.

@larseggert
Copy link
Collaborator

As I said, we can "do the real thing". It will require the community to agree that this is what they want. Past feedback indicated they wanted PDFs of text versions.

@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023

Note that this isn't an alternative. We could produce both real and fake PDFs, if the latter are really needed, just like we have html and "htmlized" (which no longer is).

@rjsparks
Copy link
Member

If we can remove codepaths and alternate representations in favor of a smaller, easier to explain set, then we should do that. But I don't think we can.

Right now we do not require submission in xml. If someone provides only plain text, we do htmlize and then pdfize that. We cannot use xml2rfc to produce html or pdf.

We can't really stop providing the -ized formats when we do have v3 xml, because that would force people (and systems like wikipedia) to have to learn to point to different types of things depending on the underlying arcana of the submission, which is really a non-starter.

So, I don't think we can make less, and I cringe at the proposal to make more because of the confusion it causes.

@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023 via email

@cabo
Copy link
Collaborator

cabo commented Apr 19, 2023

And remember that the reason this ticket exists appears to be that the CSS that comes with typewriterized html blows the little mind of weasyprint. Getting rid of typewriterized HTML would solve this problem right there.

@dkg
Copy link
Author

dkg commented Apr 20, 2023

i don't know how to solve this problem, but i appreciate that y'all are looking into it. Maybe it's worth looping in the WeasyPrint developers as well? @grewn0uille, perhaps you have some hints about how the datatracker can align its CSS with what WeasyPrint supports? or maybe WeasyPrint can use the current datatracker CSS as a source of feature requests/plans for improvement?

@larseggert
Copy link
Collaborator

We could of course PDFize the text versions, but that would mean no links in PDF and ASCII figures.

@grewn0uille
Copy link

Hi @dkg,
We can have a look at this to find what’s wrong with the current CSS!

@dkg
Copy link
Author

dkg commented Apr 20, 2023

@martinthomson, can you provide a minimized reproducer to @grewn0uille, or at least point him to the inputs (html + CSS) passed to weasyprint for draft-ietf-lamps-e2e-mail-guidance-06 by the datatracker?

@martinthomson
Copy link
Contributor

martinthomson commented Apr 24, 2023

Two issues appear to be going on here:

  1. The figure.
  2. The list.

The figure (1) is drawn using flexbox. There are three items there. The first is a 3ch gutter, generated with a ::before rule. The second is the figure itself, this is where I suspect the error occurs, more to follow. The third is a pilcrow: a single-character item. The rules are sometimes a little tricky, but my theory is that the trickiness (the gutter is squishy, the pilcrow can sometimes overflow), doesn't apply here.

My theory is that the offending line ( ├─╴application/pgp-encrypted) is being rendered into a fixed width box. The box size is calculated based on the monospace character width, which ordinarily works fine. In this case, either something about the font metrics of these specific characters or the specific width of the box (35ch), is hitting a bug (a floating point rounding error perhaps) that results in WeasyPrint thinking that the last character doesn't fit the box.

I might lean more toward the font metric hypothesis, because it looks like one of these line drawing characters is being substituted from a different font as there is a small discontinuity that doesn't show in a browser.

However, there are other examples further down the document that lose 2 or more characters, which either suggests that maybe that theory doesn't hold or the substitute font has very different metrics.

As for a reproducer, try this:

<style>div, pre { margin: 0; border: 0; padding: 0; font-family: monospace; }</style>
<div style="width: 72ch">
<div style="flex-wrap: nowrap; align-items: end; display: flex;">
<pre style="flex: 0 0 content; max-width: 72ch;">
Cryptographic Protections: none
H └┬╴multipart/mixed
J  ├─╴[protected part, may be arbitrary MIME subtree]
L  └─╴[footer, typically text/plain]
</pre>
<div style="flex: 0 0 1ch;">x</div>
</div>
</div>

This isn't perfect, because it doesn't cut off the text, but it at least shows that the x is rendered in the wrong place.

For the list (2), we're just using a marker. The list is pretty simply styled with padding-left: 2ch (margin and border are zero) and list-style-type: "*". WeasyPrint is rendering the marker immediately next to the text instead of placing it outside the main box, as browsers do. That seems simple enough.

@dkg
Copy link
Author

dkg commented Apr 24, 2023

Another issue here might be related to the fonts used on the datatracker. I note that the pdf of draft -06 embeds six fonts:

$ pdffonts draft-ietf-lamps-e2e-mail-guidance-06 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ZGBHTJ+Liberation-Mono               CID TrueType      Identity-H       yes yes yes   1220  0
KYXPWB+Liberation-Mono-Bold          CID TrueType      Identity-H       yes yes yes   1224  0
AALTDG+Liberation-Mono-Italic        CID TrueType      Identity-H       yes yes yes   1228  0
TNCRMY+DejaVu-Sans-Mono              CID TrueType      Identity-H       yes yes yes   1232  0
NLDACI+Source-Code-Pro               CID Type 0C (OT)  Identity-H       yes yes yes   1236  0
BWIYDA+Liberation-Mono-Bold-Italic   CID TrueType      Identity-H       yes yes yes   1240  0
$ 

I don't know my pdf details well enough to know which embedded fonts were used in each section, or for each calculation, but it might be worth looking into whether the presence of specific fonts causes (or minimizes) the problem.

@larseggert
Copy link
Collaborator

It's not a font issue. I have a PR in #5688 that uses the normal fonts when generating the PDF, and the issue is still there. (But the document line-wraps now as it should at least.)

@larseggert
Copy link
Collaborator

This is what I now see:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
IVBPZU+Noto-Sans-Mono                CID TrueType      Identity-H       yes yes yes   1217  0
JUCHNC+Noto-Sans-Mono-Bold           CID TrueType      Identity-H       yes yes yes   1221  0
LVFZNZ+Noto-Sans-Mono-Oblique        CID TrueType      Identity-H       yes yes yes   1225  0
TNCRMY+DejaVu-Sans-Mono              CID TrueType      Identity-H       yes yes yes   1229  0
BEKZJC+Noto-Sans-Mono-Bold-Oblique   CID TrueType      Identity-H       yes yes yes   1233  0

I don't actually know where DejaVu-Sans-Mono comes from; the Datatracker isn't injecting this AFAIK. Is there way to tell is Weasyprint is using it for the line art, and maybe that is causing the issue that @martinthomson suspects?

@larseggert
Copy link
Collaborator

larseggert commented May 26, 2023

Also, the d is actually present in the PDF. If you copy and paste the whole line
Screenshot 2023-05-26 at 09 29 37
it comes out as

application/pgp-encrypted

liZe added a commit to Kozea/WeasyPrint that referenced this issue May 29, 2023
@liZe
Copy link
Contributor

liZe commented May 29, 2023

But it renders without the trailing "d" on application/pgp-encrypted.

This problem is fixed by 0a03e3d.

For the record: it only happens on the longest line of preformatted text, only when the previous line includes non-ASCII characters, and only when preformatted block’s width is set to maximum content width (in this case, it’s a flex item).

This isn't perfect, because it doesn't cut off the text, but it at least shows that the x is rendered in the wrong place.

I’ll check that, but it probably comes from WeasyPrint’s limited support of the flexbox layout.

I don't actually know where DejaVu-Sans-Mono comes from

It’s used as a fallback font for characters that are not included in Noto Mono (for example ↧ or ⇩).

@liZe
Copy link
Contributor

liZe commented May 29, 2023

For the list (2), we're just using a marker. The list is pretty simply styled with padding-left: 2ch (margin and border are zero) and list-style-type: "*". WeasyPrint is rendering the marker immediately next to the text instead of placing it outside the main box, as browsers do. That seems simple enough.

This issue has already been reported: Kozea/WeasyPrint#1557.

@liZe
Copy link
Contributor

liZe commented May 29, 2023

I’ll check that, but it probably comes from WeasyPrint’s limited support of the flexbox layout.

WeasyPrint supports flex-end (as defined by the CSS Flexbox specification, only for flex layout), but not end (as defined by the CSS Box Alignment specification, the general case). Using align-items: flex-end works as expected.

@martinthomson
Copy link
Contributor

Thanks for looking into this @liZe.

@dkg
Copy link
Author

dkg commented May 30, 2023 via email

@larseggert
Copy link
Collaborator

This is now fixed:
Screenshot 2023-10-30 at 10 41 27

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants