Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can WeasyPrint produce a large and heavy documents? #1824

Closed
survtur opened this issue Mar 1, 2023 · 11 comments · Fixed by #1829
Closed

Can WeasyPrint produce a large and heavy documents? #1824

survtur opened this issue Mar 1, 2023 · 11 comments · Fixed by #1829
Labels
performance Too slow renderings
Milestone

Comments

@survtur
Copy link
Contributor

survtur commented Mar 1, 2023

I want to use WeasyPrint to produce a document with a lot of pages with rasterized images, all different, with high resolution. This kind of documents weight 2Gb or even more. Will WeasyPrint produce it incrementally? Or it will prepare it consuming a lot of memory and then save to disk?

I looked at the code. Seems like there is no generators that would prepare document incrementally. Am I right that whole document should fit into memory?

@survtur survtur changed the title Can WeasyPrint produce a large and heavy document? Can WeasyPrint produce a large and heavy documents? Mar 1, 2023
@liZe
Copy link
Member

liZe commented Mar 1, 2023

Hi!

I looked at the code. Seems like there is no generators that would prepare document incrementally. Am I right that whole document should fit into memory?

Unfortunately, you are, and there are many reasons for that. The last steps of the PDF generation (including internal links) require the whole document to be available in memory. We also have many cases where the last page have to be rendered once before re-rendering the first one, for example when the document includes the total number of pages on each page.

It’s not impossible to find workarounds, but that would probably require a lot of complex code.

The "real" fix would be to use less memory. WeasyPrint shouldn’t require the very large amount of memory it currently uses. Hunting for useless memory uses takes a lot of time, can be very rewarding when we find a way to save a lot of memory, but actually leads to nothing interesting most of the time :/. To be honest, the major memory use improvements done by WeasyPrint in the last years are mainly caused by the amazing improvements of the Python interpreter.

If most of your memory is eaten by images, it could be possible to find a way to not keep these images in memory, but use temporary files instead. I’ve never deeply checked what’s done with images, it could be useful to look carefully at images.py. If you know a little bit of Python, you can even try to find a dirty workaround to see if it helps.

Also note that, depending on your document, it could also be possible to split the HTML file to generate different PDF files, and then concatenate them using another PDF edition library.

I want to use WeasyPrint to produce a document with a lot of pages with rasterized images, all different, with high resolution. This kind of documents weight 2Gb or even more.

Then you’ll probably have to either buy a lot of RAM, split your HTML files, or find a way to make WeasyPrint use dramatically less RAM 😄.

@survtur
Copy link
Contributor Author

survtur commented Mar 1, 2023

I agree that whole document should be formed in order to get counters etc. But also it does not have to keep in memory whole images. It need just dimensions.

I could free some memory by stop keeping pillow_images in memory. I'll just keep its filename so I could read its content in a proper moment. But for HTML.render() function only dimensions are required.

I do some tests on document with 13700 images totalling 13,7GB

Original WeasyPrint ate about 10.7G or my RAM and died with "9: SIGKILL". It happend somewhere at 1378 page.
After stop keeping them in memory it took 1.1G of RAM for whole rendered document. It was placed 1828 pages.
Just renredered, not written out.

Now I see that creating PDF by pydyf module eats a lot of memory. I see if i can postpone keeping image in memory till that final writing-to-file moment...

@liZe
Copy link
Member

liZe commented Mar 1, 2023

Thanks a lot for your comment.

I could free some memory by stop keeping pillow_images in memory. I'll just keep its filename so I could read its content in a proper moment. But for HTML.render() function only dimensions are required.

👍

Original WeasyPrint ate about 10.7G or my RAM and died with "9: SIGKILL". It happend somewhere at 1378 page.
After stop keeping them in memory it took 1.1G of RAM for whole rendered document. It was placed 1828 pages.
Just renredered, not written out.

That’s really interesting. If we find a way to avoid the problem of the PDF generation (and it’s probably possible, even if we have to change pydyf’s API), that would be a great way to save memory.

Now I see that creating PDF by pydyf module eats a lot of memory. I see if i can postpone keeping image in memory till that final writing-to-file moment...

Good luck, don’t hesitate to ping us if you are stuck!

@survtur
Copy link
Contributor Author

survtur commented Mar 2, 2023

I've solved my problem. My whole process consumes max 1.3G of RAM. It produces 3GB .pdf on output. Original images were 13GB but JPEG recompression makes them smaller (I think I'll make a switch to disable that compression).

What have been done:

  1. weasyprint.images.RasterImage does not contain full pillow_image. Now it contains smaller object that can load image from disk on demand.
  2. pydyf._to_bytes() asks for bytes from that object when it is time to write data to final pdf.
  3. weasyprint.document.Document.write_pdf() now writes to disk file, instead of io.BinaryIO (memory)

I made very tiny changes to pydyf and some changes to WeasyPrint. What so you think I should do now? Should I publish it somewhere?

@survtur
Copy link
Contributor Author

survtur commented Mar 2, 2023

I made forks of WeasyPrint and pydyf, and I made commits with my changes.

I would be happy if some of my code will be useful and arrive at WeasyPrint code.

@liZe
Copy link
Member

liZe commented Mar 3, 2023

I made forks of WeasyPrint and pydyf, and I made commits with my changes.

That’s great, thanks a lot! 💜

I think that we may be able to provide this feature with a clean API, and even try to do so without changing pydyf. Are we allowed to include some of your code into WeasyPrint?

@survtur
Copy link
Contributor Author

survtur commented Mar 3, 2023

Are we allowed to include some of your code into WeasyPrint?

Yes of course. Having my code in so popular project is a reason to be proud.

@survtur
Copy link
Contributor Author

survtur commented Mar 3, 2023

try to do so without changing pydyf.

I see that my DelayedPillowImage can be based on pydyf.Object class with data property. This datashould work the same way as my first get_bytes attempt. So pydyf changes not required.

I've pushed new commit to my weasyprint fork.

@Kozea Kozea deleted a comment from Khnsimeon29 Mar 4, 2023
@liZe
Copy link
Member

liZe commented Mar 5, 2023

Thanks a lot!

I’ve pushed a new save-memory branch that you can try. The code is not tested nor documented yet, but may already work well for you.

Without changing anything in the way you use WeasyPrint, it should be able to save a lot of memory: Pillow images are now created, transformed into ready-to-store images and then forgotten, so that we avoid keeping them too long. That’s by far the main improvement according to my tests.

There’s also a new --cache-folder <folder> option that keeps the images on disk instead of memory. It saves a few more megabytes for my use cases, but may save much more for your documents.

Don’t hesitate to report some feedback! I’m really interested in knowing if you hit a speed penalty (I don’t, and I’m a bit surprised about that), and if the --cache-folder option is worth it for you (or if the first improvement is already enough).

@survtur
Copy link
Contributor Author

survtur commented Mar 6, 2023

Good. Now it can produce documents with infinite size.
Just note that draw_first_line() calls for RasterImage constructor. It requires cache parameter. There should be something. It seems that it is emoji-related thing.

P.S.: Unfortunately I still can't use It without my own upgrades. I think I'll explain it in separate issue.

@liZe
Copy link
Member

liZe commented Mar 6, 2023

Just note that draw_first_line() calls for RasterImage constructor. It requires cache parameter. There should be something. It seems that it is emoji-related thing.

Thanks.

P.S.: Unfortunately I still can't use It without my own upgrades. I think I'll explain it in separate issue.

👍🏽

@liZe liZe mentioned this issue Mar 6, 2023
@liZe liZe added the performance Too slow renderings label Mar 6, 2023
@liZe liZe added this to the 59.0 milestone Mar 6, 2023
netbsd-srcmastr referenced this issue in NetBSD/pkgsrc Jun 30, 2023
Version 59.0
------------

Released on 2023-05-11.

This version also includes the changes from unstable b1 version listed
below.

Bug fixes:

* `#1864 <https://github.com/Kozea/WeasyPrint/issues/1864>`_:
  Handle overflow for svg and symbol tags in SVG images
* `#1867 <https://github.com/Kozea/WeasyPrint/pull/1867>`_:
  Remove duplicate compression of attachments
* `d0ad5c1 <https://github.com/Kozea/WeasyPrint/commit/d0ad5c1>`_:
  Override use tag children instead of drawing their references
* `93df1a5 <https://github.com/Kozea/WeasyPrint/commit/93df1a5>`_:
  Don’t resize the same image twice when the --dpi option is set
* `#1874 <https://github.com/Kozea/WeasyPrint/pull/1874>`_:
  Drawn underline and overline behind text


Version 59.0b1
--------------

Released on 2023-04-14.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

Command-line API:

* The ``--optimize-size`` option and its short equivalent ``-O`` have been
  deprecated. To activate or deactivate different size optimizations, you can
  now use:

  * ``--uncompressed-pdf``,
  * ``--optimize-images``,
  * ``--full-fonts``,
  * ``--hinting``,
  * ``--dpi <resolution>``, and
  * ``--jpeg-quality <quality>``.

* A new ``--cache-folder <folder>`` option has been added to store temporary
  data in the given folder on the disk instead of keeping them in memory.

Python API:

* Global rendering options are now given in ``**options`` instead of dedicated
  parameters, with slightly different names. It means that the signature of the
  ``HTML.render()``, ``HTML.write_pdf()`` and ``Document.write_pdf()`` has
  changed. Here are the steps to port your Python code to v59.0:

  1. Use named parameters for these functions, not positioned parameters.
  2. Rename some the parameters:

     * ``image_cache`` becomes ``cache`` (see below),
     * ``identifier`` becomes ``pdf_identifier``,
     * ``variant`` becomes ``pdf_variant``,
     * ``version`` becomes ``pdf_version``,
     * ``forms`` becomes ``pdf_forms``.

* The ``optimize_size`` parameter of ``HTML.render()``, ``HTML.write_pdf()``
  and ``Document()`` has been removed and will be ignored. You can now use the
  ``uncompressed_pdf``, ``full_fonts``, ``hinting``, ``dpi`` and
  ``jpeg_quality`` parameters that are included in ``**options``.

* The ``cache`` parameter can be included in ``**options`` to replace
  ``image_cache``. If it is a dictionary, this dictionary will be used to store
  temporary data in memory, and can be even shared between multiple documents.
  If it’s a folder Path or string, WeasyPrint stores temporary data in the
  given temporary folder on disk instead of keeping them in memory.

New features:

* `#1853 <https://github.com/Kozea/WeasyPrint/pull/1853>`_,
  `#1854 <https://github.com/Kozea/WeasyPrint/issues/1854>`_:
  Reduce PDF size, with financial support from Code & Co.
* `#1824 <https://github.com/Kozea/WeasyPrint/issues/1824>`_,
  `#1829 <https://github.com/Kozea/WeasyPrint/pull/1829>`_:
  Reduce memory use for images
* `#1858 <https://github.com/Kozea/WeasyPrint/issues/1858>`_:
  Add an option to keep hinting information in embedded fonts

Bug fixes:

* `#1855 <https://github.com/Kozea/WeasyPrint/issues/1855>`_:
  Fix position of emojis in justified text
* `#1852 <https://github.com/Kozea/WeasyPrint/issues/1852>`_:
  Don’t crash when line can be split before trailing spaces
* `#1843 <https://github.com/Kozea/WeasyPrint/issues/1843>`_:
  Fix syntax of dates in metadata
* `#1827 <https://github.com/Kozea/WeasyPrint/issues/1827>`_,
  `#1832 <https://github.com/Kozea/WeasyPrint/pull/1832>`_:
  Fix word-spacing problems with nested tags

Documentation:

* `#1841 <https://github.com/Kozea/WeasyPrint/issues/1841>`_:
  Add a paragraph about unsupported calc() function


Version 58.1
------------

Released on 2023-03-07.

Bug fixes:

* `#1815 <https://github.com/Kozea/WeasyPrint/issues/1815>`_:
  Fix bookmarks coordinates
* `#1822 <https://github.com/Kozea/WeasyPrint/issues/1822>`_,
  `#1823 <https://github.com/Kozea/WeasyPrint/pull/1823>`_:
  Fix vertical positioning for absolute replaced elements

Documentation:

* `#1814 <https://github.com/Kozea/WeasyPrint/pull/1814>`_:
  Fix broken link pointing to samples


Version 58.0
------------

Released on 2023-02-17.

This version also includes the changes from unstable b1 version listed
below.

Bug fixes:

* `#1807 <https://github.com/Kozea/WeasyPrint/issues/1807>`_:
  Don’t crash when out-of-flow box is split in out-of-flow parent
* `#1806 <https://github.com/Kozea/WeasyPrint/issues/1806>`_:
  Don’t crash when fixed elements aren’t displayed yet in aborted line
* `#1809 <https://github.com/Kozea/WeasyPrint/issues/1809>`_:
  Fix background drawing for out-of-the-page transformed boxes


Version 58.0b1
--------------

Released on 2023-02-03.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

New features:

* `#61 <https://github.com/Kozea/WeasyPrint/issues/61>`_,
  `#1796 <https://github.com/Kozea/WeasyPrint/pull/1796>`_:
  Support PDF forms, with financial support from Personalkollen
* `#1173 <https://github.com/Kozea/WeasyPrint/issues/1173>`_:
  Add style for form fields

Bug fixes:

* `#1777 <https://github.com/Kozea/WeasyPrint/issues/1777>`_:
  Detect JPEG/MPO images as normal JPEG files
* `#1771 <https://github.com/Kozea/WeasyPrint/pull/1771>`_:
  Improve SVG gradients


Version 57.2
------------

Released on 2022-12-23.

Bug fixes:

* `0f2e377 <https://github.com/Kozea/WeasyPrint/commit/0f2e377>`_:
  Print annotations with PDF/A
* `0e9426f <https://github.com/Kozea/WeasyPrint/commit/0e9426f>`_:
  Hide annotations with PDF/UA
* `#1764 <https://github.com/Kozea/WeasyPrint/issues/1764>`_:
  Use reference instead of stream for annotation appearance stream
* `#1783 <https://github.com/Kozea/WeasyPrint/pull/1783>`_:
  Fix multiple font weights for @font-face declarations


Version 57.1
------------

Released on 2022-11-04.

Dependencies:

* `#1754 <https://github.com/Kozea/WeasyPrint/pull/1754>`_:
  Pillow 9.1.0 is now needed

Bug fixes:

* `#1756 <https://github.com/Kozea/WeasyPrint/pull/1756>`_:
  Fix rem font size for SVG images
* `#1755 <https://github.com/Kozea/WeasyPrint/issues/1755>`_:
  Keep format when transposing images
* `#1753 <https://github.com/Kozea/WeasyPrint/issues/1753>`_:
  Don’t use deprecated ``read_text`` function when ``files`` is available
* `#1741 <https://github.com/Kozea/WeasyPrint/issues/1741>`_:
  Generate better manpage
* `#1747 <https://github.com/Kozea/WeasyPrint/issues/1747>`_:
  Correctly set target counters in pages’ absolute elements
* `#1748 <https://github.com/Kozea/WeasyPrint/issues/1748>`_:
  Always set font size when font is changed in line
* `2b05137 <https://github.com/Kozea/WeasyPrint/commit/2b05137>`_:
  Fix stability of font identifiers

Documentation:

* `#1750 <https://github.com/Kozea/WeasyPrint/pull/1750>`_:
  Fix documentation spelling


Version 57.0
------------

Released on 2022-10-18.

This version also includes the changes from unstable b1 version listed
below.

New features:

* `a4fc7a1 <https://github.com/Kozea/WeasyPrint/commit/a4fc7a1>`_:
  Support image-orientation

Bug fixes:

* `#1739 <https://github.com/Kozea/WeasyPrint/issues/1739>`_:
  Set baseline on all flex containers
* `#1740 <https://github.com/Kozea/WeasyPrint/issues/1740>`_:
  Don’t crash when currentColor is set on root svg tag
* `#1718 <https://github.com/Kozea/WeasyPrint/issues/1718>`_:
  Don’t crash with empty bitmap glyphs
* `#1736 <https://github.com/Kozea/WeasyPrint/issues/1736>`_:
  Always use the font’s vector variant when possible
* `eef8b4d <https://github.com/Kozea/WeasyPrint/commit/eef8b4d>`_:
  Always set color and state before drawing
* `#1662 <https://github.com/Kozea/WeasyPrint/issues/1662>`_:
  Use a stable key to store stream fonts
* `#1733 <https://github.com/Kozea/WeasyPrint/issues/1733>`_:
  Don’t remove attachments when adding internal anchors
* `3c4fa50 <https://github.com/Kozea/WeasyPrint/commit/3c4fa50>`_,
  `c215697 <https://github.com/Kozea/WeasyPrint/commit/c215697>`_,
  `d275dac <https://github.com/Kozea/WeasyPrint/commit/d275dac>`_,
  `b04bfff <https://github.com/Kozea/WeasyPrint/commit/b04bfff>`_:
  Fix many bugs related to PDF/UA structure

Performance:

* `dfccf1b <https://github.com/Kozea/WeasyPrint/commit/dfccf1b>`_:
  Use faces as fonts dictionary keys
* `0dc12b6 <https://github.com/Kozea/WeasyPrint/commit/0dc12b6>`_:
  Cache add_font to avoid calling get_face too often
* `75e17bf <https://github.com/Kozea/WeasyPrint/commit/75e17bf>`_:
  Don’t call process_whitespace twice on many children
* `498d3e1 <https://github.com/Kozea/WeasyPrint/commit/498d3e1>`_:
  Optimize __missing__ functions

Documentation:

* `863b3d6 <https://github.com/Kozea/WeasyPrint/commit/863b3d6>`_:
  Update documentation of installation on macOS with Homebrew


Version 57.0b1
--------------

Released on 2022-09-22.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

New features:

* `#1704 <https://github.com/Kozea/WeasyPrint/pull/1704>`_:
  Support PDF/UA, with financial support from Novareto
* `#1454 <https://github.com/Kozea/WeasyPrint/issues/1454>`_:
  Support variable fonts

Bug fixes:

* `#1058 <https://github.com/Kozea/WeasyPrint/issues/1058>`_:
  Fix bullet position after page break, with financial support from OpenZeppelin
* `#1707 <https://github.com/Kozea/WeasyPrint/issues/1707>`_:
  Fix footnote positioning in multicolumn layout, with financial support from Code & Co.
* `#1722 <https://github.com/Kozea/WeasyPrint/issues/1722>`_:
  Handle skew transformation with only one parameter
* `#1715 <https://github.com/Kozea/WeasyPrint/issues/1715>`_:
  Don’t crash when images are truncated
* `#1697 <https://github.com/Kozea/WeasyPrint/issues/1697>`_:
  Don’t crash when attr() is used in text-decoration-color
* `#1695 <https://github.com/Kozea/WeasyPrint/pull/1695>`_:
  Include language information in PDF metadata
* `#1612 <https://github.com/Kozea/WeasyPrint/issues/1612>`_:
  Don’t lowercase letters when capitalizing text
* `#1700 <https://github.com/Kozea/WeasyPrint/issues/1700>`_:
  Fix crash when rendering footnote with repagination
* `#1667 <https://github.com/Kozea/WeasyPrint/issues/1667>`_:
  Follow EXIF metadata for image rotation
* `#1669 <https://github.com/Kozea/WeasyPrint/issues/1669>`_:
  Take care of floats when remvoving placeholders
* `#1638 <https://github.com/Kozea/WeasyPrint/issues/1638>`_:
  Use the original box when breaking waiting children
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Too slow renderings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants