Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Roadmap for version 1 #170

Open
14 of 23 tasks
danfickle opened this issue Jan 16, 2018 · 20 comments
Open
14 of 23 tasks

RFC - Roadmap for version 1 #170

danfickle opened this issue Jan 16, 2018 · 20 comments
Assignees
Milestone

Comments

@danfickle
Copy link
Owner

danfickle commented Jan 16, 2018

These are my thought on the issues that need to be addressed before version 1 is released, in no particular order:

  • MathML Support #161 - MathML support - COMPLETE
  • NO-ISSUE-YET, COMPLETE - Entity support such as nbsp for HTML, InvisibleTimes for MathML and SVG entities. This is tricky as there is no way to programatically inject entities using the Java XML parser. I propose that we add a doctype dynamically to the start of the XML input, with the desired entities. However this means we have to read XML input into a string, rather than just passing a file or input stream to the XML reader. The builder can be used to specify which entities to load. Used custom doc types and an external entity resolver instead. DOCUMENT.
  • Implement CSS3 transform #38 - Transforms. A few issues remaining to implement:
    • Link placing doesn't take account of transforms.
    • Translate is not implemented.
    • Some work for transformed boxes in page margins.
    • Testing. Do transforms of MathML, SVG and custom objects?
  • NO-ISSUE-YET - Logging / error handling overhaul - Currently error handling is ad-hoc. For example should we continue on a load failure or fatally throw? I propose to allow this to be configurable by allowing the user to hook logging on a per-run basis and halt on any log message (which will be changed to enum constants) with a poison exception.
  • CSS3 Multi-column Layout #60 - CSS3 Columns - Currently implemented for text only. Need to debug to allow other box types in columns.
  • Can we generate a horizontally scrollable pdf using this library.What would be the steps? #126 - Overflowing pages - Currently content that goes past the right margin is cut off silently. This is mostly a problem with tables. I propose a CSS property that allows cut off content to be printed on the next page. DOCUMENT.
  • Cache + URL resolver doesn't work as expected #204 - Multi run cache - Currently there is a multi-run cache hook method, but the objects stored may not be thread safe. This means it is unsuitable for many use-cases. Propose to remove all caches except font metrics cache.
  • NO-ISSUE-YET - Per run cache - Need to make sure nothing is being placed into a PDF document more than once. For example, is an img from the img tag and a background image from the same url embedded twice?
  • UNICODE font justification support #83 - Unicode font justification fix - There is a fix in September improvements by @backslash47 #143 but we are waiting for PDF-BOX 2.0.9 to implement it.
  • dir="auto" does not affect the table tag #123 - RTL table layout - Altering table layout to correct RTL scares me but there have been a couple of requests so should try.
  • NO-ISSUE-YET - Remove remnants of configuration class and move all config to builders. There are still some config values that are coming from various file locations.
  • Padding/ Margin values with relative units not working #145 - Padding with percentages not working - It appears that it is resolving padding percentage values with a zero base value.
  • NO-ISSUE-YET - Make sure all dependencies are up to date. Do this after test system introduced.
  • Automatic visual testing #208 - Semi automatic testing. Propose some sort of semi-auto testing with image diff. This would allow you to run before and after changes to make sure nothing has been broken. Unfortunately, we can't have one-true-source of reference results as reportedly font-handling, etc can change slightly between JREs.
  • NO-ISSUE-YET - Java2D cleanup. Make sure all Java2D functionality is in the Java2D module and delete broken code samples and tools. Also make sure Java2D RTL works.
  • NO-ISSUE-YET - Documentation. Review and complete the template author's guide, integration guide, create comparison with other solutions such as Flying Saucer, headless-browsers, etc.
  • Large HTML File conversion to PDF hangs. #180 - Performance and memory improvements - IN PROGRESS.
  • September improvements by @backslash47 #143 - Other improvements from this pull-request.
  • NO-ISSUE-YET - Floating elements escape elements with overflow:hidden set.

Hopefully, most of the other open issues can wait for subsequent releases. NOTE: There will be several more release candidate version before version 1.

I'd appreciate feedback from anybody, especially @rototor. Any other issues that need to be addressed before version 1?

@danfickle danfickle added this to the v1.0.0 milestone Jan 16, 2018
@danfickle danfickle self-assigned this Jan 16, 2018
@dilworks
Copy link

The very first thing I would suggest: add instructions for non-Maven users to the Integration Guide: I myself use Ant (from Netbeans). This includes making available a full list of dependencies (PDFBox, and whatever misc .jar it's required currently - for example graphics2d which took me for surprise during my initial tests)

Obviously we need to get v1 out first properly so we can have downloadable releases and the like.

Also I second the logging overhaul.

@vipcxj
Copy link
Contributor

vipcxj commented Feb 1, 2018

It seems that there are some issure with transparent embedded svg, The transparent background will be displayed in black. The same issure occured when I use batik to convert svg to bmp myself. However, there is no problem when converting to png. The batik does not provide a converter for transforming svg to bmp, so I write a custom one according that for transforming svg to png. In the end, I decide link the extern png as a workaround.

@rototor
Copy link
Contributor

rototor commented Feb 1, 2018

@vipcxj Can you share the SVG which does not work for you with me? I would like to fix this bug (which is in https://github.com/rototor/pdfbox-graphics2d, as there the whole Graphics2d->PDF mapping is happening).

@vipcxj
Copy link
Contributor

vipcxj commented Feb 1, 2018

I will give you the SVG when I go to work tomorrow. It's a watermark
update: this is the content of the svg.

<?xml version="1.0" encoding="UTF-8" ?>
<svg width="512" height="512" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
  <style type="text/css">text { fill: gray; font-family: Avenir, Arial, Helvetica, sans-serif; }</style>
  <defs>
    <pattern id="twitterhandle" patternUnits="userSpaceOnUse" width="400" height="200">
      <text y="30" font-size="40" id="name">TEST WATERMARK</text>
    </pattern>
    <pattern xlink:href="#twitterhandle">
      <text y="120" x="200" font-size="30" id="occupation">test watermark</text>
    </pattern>
    <pattern id="combo" xlink:href="#twitterhandle" patternTransform="rotate(-45)">
      <use xlink:href="#name" />
      <use xlink:href="#occupation" />
    </pattern>
  </defs>
  <rect width="100%" height="100%" fill="url(#combo)" />
</svg>

@rototor
Copy link
Contributor

rototor commented Feb 2, 2018

@vipcxj I've just released pdfbox-graphics2d version 0.11 which fixes this problem. PdfBoxGraphics2D did not handle the PatternPaint of Batik SVG. You can manually depend on this version or wait till it is integrated here.

@Bigdatha
Copy link

Bigdatha commented Feb 3, 2018

I would be delighted to see some improvement to my issue #119 ... the proposed workaround using floating containers works nine times out of ten, but not as perfect as everything else in this amazing project (at least for me).

Regards
Bigdatha

@achuinard
Copy link
Contributor

@dilworks haven't heard of anyone using either Ant or NetBeans in years...

@dilworks
Copy link

dilworks commented Mar 6, 2018

@achuinard uh... I do. And it's still quite popular here in Latin America.

Not everybody likes Maven or Eclipse, and there is nothing wrong with that.

danfickle added a commit that referenced this issue Mar 12, 2018
Created two DTDs, one with just character entities for XHTML and one
combined character entities for XHTML and MathML. Also:
+ Minor refactoring of entity resolver and catalog.
+ Documented the two new doctypes in author’s guide.
+ Made sure that other doctypes resolve to the empty string.
@dilworks
Copy link

dilworks commented Mar 20, 2018

Just wondering: has anyone done performance benchmarks? As there are quite a lot of us looking at this project as a long-term replacement for good ol' FS+iText, matching the performance of that should be a goal.

I've only done some quick testing with simple reports (basically tables, nothing fancy), and I've found openhtmltopdf to be as much as 50% slower than FS+iText, and I have no clue on where could be the bottlenecks (here? in PDFBox?).

@rototor
Copy link
Contributor

rototor commented Mar 21, 2018

@dilworks This is likely caused by PDFBox or its dependency FontBox. Are you using many custom fonts? FontBox is a little bit slow when parsing fonts...

@dilworks
Copy link

Well, my reports are very simple - I'm using the PDF defaults (Times, Helvetica), not even external ones!

danfickle added a commit that referenced this issue Mar 27, 2018
+ Allow user to create their own PDDocument with memory settings of
their choice.
+ Fix silly bug in bidi splitter that was taking more than half the
time in my sample document (according to VisualVM).
@danfickle
Copy link
Owner Author

Thanks @dilworks

You inspired me to create a large document and run VisualVM while it was processing. It immediately highlighted a silly bug in the BIDI splitter which is now fixed (above). This was taking well over half the run-time. The next culprit to look at is createInlineBox. Any ideas on why that is so slow?

Before:
screen shot 2018-03-27 at 7 21 17 pm

After:
screen shot 2018-03-27 at 7 20 19 pm

Embarrassingly, the BIDI splitter should not even run when not configured, which I'll fix in a future commit.

@Rob46
Copy link

Rob46 commented Mar 29, 2018

Is there any reason why this can't run on the modern Google App Engine Standard env Java8? It removes a ton of restrictions from the older java7 environment (no more whitelist of jars, most APIs should work).

I'd be happy to test if nobody has.

@achuinard
Copy link
Contributor

achuinard commented Mar 29, 2018 via email

@dilworks
Copy link

dilworks commented Apr 2, 2018

@danfickle Good starting point. I did a few test runs with a 18-page test document (will try to clean it up from any proprietary/private info to provide a public test sample of the reports I generate) with nothing but default fonts and rather simple tables. 25 runs for each converter, measuring times (although not resource usage, but then, we rarely generate hundreds-of-pages reports so that represents one of my most frequent use cases)

Here are the test results:
benchmark_pdfgen.xlsx

So far, I've found the performance gap between FS and OH to be around 30%.

(LOL at GitHub that doesn't support OpenDocument documents!)

Now I'll try with the really heavy hundred-of-pages CPU-draining workloads :)

@dilworks
Copy link

dilworks commented Apr 2, 2018

Testcase:
testcase_fs_oh.tar.gz

Forgot to tell my setup: this is my dev laptop (a quite ancient Core 2 Duo P8600 with 6GB DDR2 RAM and 500GB of good ol' spinning rust storage) running both generators inside a J2EE container (WildFly 11.0)

danfickle added a commit that referenced this issue Apr 3, 2018
@danfickle
Copy link
Owner Author

Thanks @dilworks

That is helpful, when I work on the collapse whitespace function tomorrow, I'll run your test case before and after and see if we can get a good improvement. Could we continue talk of performance improvements in #180?

@dilworks
Copy link

dilworks commented Apr 3, 2018

All right then!

And once again, thanks for improving the library!

@Fancellu2
Copy link

More docs please. More examples.

Great project though, works nicely for me, just would like to know all the things I can do and more importantly can't do

@backslash47
Copy link
Contributor

Implementing flexbox layout (#69) will be a huge improvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants