Support gzip in range request / Explicitly set accept-encoding: identity #11027

joepio · 2019-07-31T12:02:47Z

I'm using react-pdf, which in turn uses pdf.js (awesome library, guys!), for a web application that shows open governmental meeting documents (demo).
I noticed that for large PDF files, loading takes a while, and it seems like pdf.js downloads the entire file before rendering the first page.

The docs mention that PDF.js fetches only the necessary data, if the server supports Content-Range header / Range requests.

The server (Google Cloud Storage) seems to support ranges (example file).

GET https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 HTTP/1.1
Range: bytes=0-1999

The maintainer of react-pdf told me it should work If we simply pass a URL to pdfjs.getDocument, but it does not.

Is it true that getDocument should use the Range header if only a URL is passed?

Am I missing something? Should I manually create a PDFDataRangeTransport? Thanks in advance for your time!

The text was updated successfully, but these errors were encountered:

Snuffleupagus · 2019-07-31T12:40:03Z

for a web application that shows open governmental meeting documents (demo).

This appears to link to localhost, so the demo isn't really working :-)

The maintainer of react-pdf told me it should work If we simply pass a URL to pdfjs.getDocument, but it does not.

Does it work correctly if you use the PDF.js library directly, to rule out a bug in the "react-pdf" library?
Please understand that it's very difficult for anyone here to provide support for other libraries, which embed PDF.js, since we know nothing about them.

Is it true that getDocument should use the Range header if only a URL is passed?

Yes, assuming that the server both supports Range Requests and is correctly configured.[1]

Should I manually create a PDFDataRangeTransport?

Most likely not, since this isn't really relevant to using Range Requests and would also require you to write/implement all the code to actually provide the necessary PDF data to the PDF.js API.[2]

[1] With the exception of files smaller than 2 * 65536 bytes, which probably isn't relevant here.

[2] PDFDataRangeTransport was originally implemented for use with the built-in Firefox PDF viewer.

joepio · 2019-07-31T13:28:09Z

Thanks for your quick reply! I've updated the URL not to link to localhost 🙈.

Here's a codepen that uses pdfs.getDocument directly: https://codepen.io/joepio/pen/ymbrGg?editors=0010

It, too, waits for the entire PDF to be downloaded before it renders anything. I don't see range headers in the requests by pdf.js.

Snuffleupagus · 2019-08-01T10:55:56Z

Here's a codepen that uses pdfs.getDocument directly: https://codepen.io/joepio/pen/ymbrGg?editors=0010

When putting a break-point at

pdf.js/src/display/network_utils.js

Line 47 in d909b86

if (getResponseHeader('Accept-Ranges') !== 'bytes') {

(or the corresponding line in the built pdf.js file), the result is null which would thus suggest that the server isn't configured properly. Hence this indicates a problem with the server, which we cannot really support, rather than the PDF.js library itself.

Also, in your example you're using an older and non-official version of the PDF.js library and it's recommended to always use the latest available release.

timvandermeij · 2019-08-01T21:30:40Z

Closing as answered since most likely PDF.js is not at fault here. If you can show it is, then please provide more details and we'll reopen.

joepio · 2019-08-02T14:53:21Z

Thanks for the quick replies and the help!

I've tested the server, but it seems to support the headers:

HEAD https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 HTTP/1.1

HTTP/1.1 200 OK
....
Accept-Ranges: bytes
....

I've updated the codepen example to use the latest version of PDF.js (2.2.228). It seems to have the same behavior as before.

When I use the network inspector, I don't see a HEAD request at all, the only request to the PDF file is a simple GET, without any Data-Range headers.

I'd love to see a working example that successfully shows the first page before the document is loaded.

Snuffleupagus · 2019-08-02T16:28:51Z

It actually appears that CodePen somehow adds to the problem here, since it appears to strip some of the relevant response headers!?

When opening https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 directly in Firefox with the built-in PDF Viewer, which is based on PDF.js, Range Requests aren't supported either. However, here it's at least possible to debug properly and this is what I've found:

There's actually an Accept-Ranges response header, set to bytes as expected.
The Content-Encoding response header is however set to gzip, which explains the problems you're facing since it needs to be identity for Range Requests to work in the PDF.js library.

joepio · 2019-08-02T16:33:31Z

Nice find, thanks!

Perhaps pdf.js should set an Accept-Encoding: idenity header? Or support gzip in range headers? I don't know how hard this is to develop, and how important it should be.

Unfortunately, google's Storage API does not handle the Accept-Encoding header correctly - it ignores the Range header when Accept-Encoding is set to identity. But that is not an issue with pdf.js, of course. EDIT: I've submitted this to Google Cloud Issue tracker. To clarify: this bug does not cause the issue of this topic, but it might cause another one that could appear after this issue is solved.

GET https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 HTTP/1.1
Range: bytes=0-1999
Accept-Encoding: identity

joepio · 2019-08-05T09:51:53Z

The Content-Encoding response header is however set to gzip, which explains the problems you're facing since it needs to be identity for Range Requests to work in the PDF.js library.

It looks like the Accept-Encoding header that pdf.js sends in the request specifically allows for gzip. The server replies with content-encoding: gzip, but PDF.js needs it to be identity.

So I think PDF.js either should support gzip, or it should set Accept-Encoding: idenity.

@timvandermeij Now it does seem like pdf.js might be at fault here. Could you perhaps re-open the issue?

Snuffleupagus · 2019-08-05T10:05:36Z

It looks like the Accept-Encoding header that pdf.js sends in the request specifically allows for gzip.

Searching through the PDF.js library for the string "Accept-Encoding" yields zero results, so I honestly don't understand how the PDF.js library could be at fault here!?

joepio · 2019-08-05T10:11:34Z

It appears that the Accept-Encoding of the request is set by Firefox as a default in range requests.

Searching through the PDF.js library for the string "Accept-Encoding" yields zero results, so I honestly don't understand how the PDF.js library could be at fault here!?

The lack of an explicit Accept-Encoding, along with no support for gzip might be the issue here. If PDF.js cannot deal with gzip in range requests and requires identity, it should explicitly set that in the header.

Snuffleupagus · 2019-08-05T10:30:01Z

The amount of work, and the added complexity, required to actually support gzip may not be worth it in general, and I believe that not doing so may have been a concious decision based on e.g. this IRC response.

Whether it's desirable to modify the PDF.js library to explicitly set the Accept-Encoding: identity header, someone else will have to weigh-in on.

However, as you're hopefully already aware of it's already possible to provide custom httpHeaders when calling getDocument; see

pdf.js/src/display/api.js

Line 133 in be70ee2

* @property {Object} httpHeaders - Basic authentication headers.

joepio · 2019-08-05T11:39:57Z

If that IRC response is true:

jamesrobinson: if this server would disable gzip encoding for PDFs, it would appear lots faster
In general PDFs are already compressed and don't need further compression

Than there is even more reason to explicitly set the Accept-Encoding: identity header.

timvandermeij · 2019-08-05T20:29:19Z

Let's reopen this. @yurydelendik @brendandahl Do you have any comments on #11027 (comment)?

Snuffleupagus · 2020-02-02T16:28:13Z

According to the specifications, see e.g. the XMLHttpRequest specification which links to the relevant part of the Fetch specification, the Accept-Encoding cannot be set (other than the by browser itself); see also https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_header_name
Hence it won't be possible to add Accept-Encoding: identity to the network requests made by the PDF.js library.

With regards to adding support for gzip in the various stream implementations, that would risk adding a lot of additional complexity to the relevant code (not to mention the time/effort required to implement/review/test something like that).
Furthermore, the problem in this issue seem to be limited to a particular server rather than be a widespread issue.

All-in-all, it seems that WONTFIX probably is the appropriate resolution for this issue.

joepio mentioned this issue Jul 31, 2019

PDF Data Range not working correctly ontola/openbesluitvorming#46

Open

timvandermeij added the other label Jul 31, 2019

timvandermeij closed this as completed Aug 1, 2019

joepio changed the title ~~Using Range requests with pdfjs.getDocument~~ Support gzip in range request / Explicitly set accept-encoding: identity Aug 5, 2019

joepio mentioned this issue Aug 5, 2019

View first page before entire document is loaded - support range header wojtekmaj/react-pdf#419

Closed

5 tasks

timvandermeij reopened this Aug 5, 2019

timvandermeij added core and removed other labels Aug 5, 2019

timvandermeij closed this as completed Feb 3, 2020

timvandermeij removed the core label Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gzip in range request / Explicitly set accept-encoding: identity #11027

Support gzip in range request / Explicitly set accept-encoding: identity #11027

joepio commented Jul 31, 2019 •

edited

Loading

Snuffleupagus commented Jul 31, 2019

joepio commented Jul 31, 2019

Snuffleupagus commented Aug 1, 2019 •

edited

Loading

timvandermeij commented Aug 1, 2019

joepio commented Aug 2, 2019

Snuffleupagus commented Aug 2, 2019

joepio commented Aug 2, 2019 •

edited

Loading

joepio commented Aug 5, 2019 •

edited

Loading

Snuffleupagus commented Aug 5, 2019

joepio commented Aug 5, 2019 •

edited

Loading

Snuffleupagus commented Aug 5, 2019 •

edited

Loading

joepio commented Aug 5, 2019

timvandermeij commented Aug 5, 2019

Snuffleupagus commented Feb 2, 2020

Support gzip in range request / Explicitly set accept-encoding: identity #11027

Support gzip in range request / Explicitly set accept-encoding: identity #11027

Comments

joepio commented Jul 31, 2019 • edited Loading

Snuffleupagus commented Jul 31, 2019

joepio commented Jul 31, 2019

Snuffleupagus commented Aug 1, 2019 • edited Loading

timvandermeij commented Aug 1, 2019

joepio commented Aug 2, 2019

Snuffleupagus commented Aug 2, 2019

joepio commented Aug 2, 2019 • edited Loading

joepio commented Aug 5, 2019 • edited Loading

Snuffleupagus commented Aug 5, 2019

joepio commented Aug 5, 2019 • edited Loading

Snuffleupagus commented Aug 5, 2019 • edited Loading

joepio commented Aug 5, 2019

timvandermeij commented Aug 5, 2019

Snuffleupagus commented Feb 2, 2020

joepio commented Jul 31, 2019 •

edited

Loading

Snuffleupagus commented Aug 1, 2019 •

edited

Loading

joepio commented Aug 2, 2019 •

edited

Loading

joepio commented Aug 5, 2019 •

edited

Loading

joepio commented Aug 5, 2019 •

edited

Loading

Snuffleupagus commented Aug 5, 2019 •

edited

Loading