Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gzip in range request / Explicitly set accept-encoding: identity #11027

Closed
joepio opened this issue Jul 31, 2019 · 14 comments
Closed

Support gzip in range request / Explicitly set accept-encoding: identity #11027

joepio opened this issue Jul 31, 2019 · 14 comments

Comments

@joepio
Copy link

joepio commented Jul 31, 2019

I'm using react-pdf, which in turn uses pdf.js (awesome library, guys!), for a web application that shows open governmental meeting documents (demo).
I noticed that for large PDF files, loading takes a while, and it seems like pdf.js downloads the entire file before rendering the first page.

The docs mention that PDF.js fetches only the necessary data, if the server supports Content-Range header / Range requests.

The server (Google Cloud Storage) seems to support ranges (example file).

GET https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 HTTP/1.1
Range: bytes=0-1999

The maintainer of react-pdf told me it should work If we simply pass a URL to pdfjs.getDocument, but it does not.

Is it true that getDocument should use the Range header if only a URL is passed?

Am I missing something? Should I manually create a PDFDataRangeTransport? Thanks in advance for your time!

@Snuffleupagus
Copy link
Collaborator

for a web application that shows open governmental meeting documents (demo).

This appears to link to localhost, so the demo isn't really working :-)

The maintainer of react-pdf told me it should work If we simply pass a URL to pdfjs.getDocument, but it does not.

Does it work correctly if you use the PDF.js library directly, to rule out a bug in the "react-pdf" library?
Please understand that it's very difficult for anyone here to provide support for other libraries, which embed PDF.js, since we know nothing about them.

Is it true that getDocument should use the Range header if only a URL is passed?

Yes, assuming that the server both supports Range Requests and is correctly configured.[1]

Should I manually create a PDFDataRangeTransport?

Most likely not, since this isn't really relevant to using Range Requests and would also require you to write/implement all the code to actually provide the necessary PDF data to the PDF.js API.[2]


[1] With the exception of files smaller than 2 * 65536 bytes, which probably isn't relevant here.

[2] PDFDataRangeTransport was originally implemented for use with the built-in Firefox PDF viewer.

@joepio
Copy link
Author

joepio commented Jul 31, 2019

Thanks for your quick reply! I've updated the URL not to link to localhost 🙈.

Here's a codepen that uses pdfs.getDocument directly: https://codepen.io/joepio/pen/ymbrGg?editors=0010

It, too, waits for the entire PDF to be downloaded before it renders anything. I don't see range headers in the requests by pdf.js.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Aug 1, 2019

Here's a codepen that uses pdfs.getDocument directly: https://codepen.io/joepio/pen/ymbrGg?editors=0010

When putting a break-point at

if (getResponseHeader('Accept-Ranges') !== 'bytes') {
(or the corresponding line in the built pdf.js file), the result is null which would thus suggest that the server isn't configured properly. Hence this indicates a problem with the server, which we cannot really support, rather than the PDF.js library itself.

Also, in your example you're using an older and non-official version of the PDF.js library and it's recommended to always use the latest available release.

@timvandermeij
Copy link
Contributor

Closing as answered since most likely PDF.js is not at fault here. If you can show it is, then please provide more details and we'll reopen.

@joepio
Copy link
Author

joepio commented Aug 2, 2019

Thanks for the quick replies and the help!

I've tested the server, but it seems to support the headers:

HEAD https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 HTTP/1.1
HTTP/1.1 200 OK
....
Accept-Ranges: bytes
....

I've updated the codepen example to use the latest version of PDF.js (2.2.228). It seems to have the same behavior as before.

When I use the network inspector, I don't see a HEAD request at all, the only request to the PDF file is a simple GET, without any Data-Range headers.

I'd love to see a working example that successfully shows the first page before the document is loaded.

@Snuffleupagus
Copy link
Collaborator

It actually appears that CodePen somehow adds to the problem here, since it appears to strip some of the relevant response headers!?

When opening https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 directly in Firefox with the built-in PDF Viewer, which is based on PDF.js, Range Requests aren't supported either. However, here it's at least possible to debug properly and this is what I've found:

  • There's actually an Accept-Ranges response header, set to bytes as expected.
  • The Content-Encoding response header is however set to gzip, which explains the problems you're facing since it needs to be identity for Range Requests to work in the PDF.js library.

@joepio
Copy link
Author

joepio commented Aug 2, 2019

Nice find, thanks!

Perhaps pdf.js should set an Accept-Encoding: idenity header? Or support gzip in range headers? I don't know how hard this is to develop, and how important it should be.

Unfortunately, google's Storage API does not handle the Accept-Encoding header correctly - it ignores the Range header when Accept-Encoding is set to identity. But that is not an issue with pdf.js, of course. EDIT: I've submitted this to Google Cloud Issue tracker. To clarify: this bug does not cause the issue of this topic, but it might cause another one that could appear after this issue is solved.

GET https://storage.googleapis.com/ori-static/api.notubiz.nl/document/6131301 HTTP/1.1
Range: bytes=0-1999
Accept-Encoding: identity

@joepio
Copy link
Author

joepio commented Aug 5, 2019

The Content-Encoding response header is however set to gzip, which explains the problems you're facing since it needs to be identity for Range Requests to work in the PDF.js library.

It looks like the Accept-Encoding header that pdf.js sends in the request specifically allows for gzip. The server replies with content-encoding: gzip, but PDF.js needs it to be identity.

So I think PDF.js either should support gzip, or it should set Accept-Encoding: idenity.

@timvandermeij Now it does seem like pdf.js might be at fault here. Could you perhaps re-open the issue?

@Snuffleupagus
Copy link
Collaborator

It looks like the Accept-Encoding header that pdf.js sends in the request specifically allows for gzip.

Searching through the PDF.js library for the string "Accept-Encoding" yields zero results, so I honestly don't understand how the PDF.js library could be at fault here!?

@joepio
Copy link
Author

joepio commented Aug 5, 2019

It appears that the Accept-Encoding of the request is set by Firefox as a default in range requests.

Searching through the PDF.js library for the string "Accept-Encoding" yields zero results, so I honestly don't understand how the PDF.js library could be at fault here!?

The lack of an explicit Accept-Encoding, along with no support for gzip might be the issue here. If PDF.js cannot deal with gzip in range requests and requires identity, it should explicitly set that in the header.

@joepio joepio changed the title Using Range requests with pdfjs.getDocument Support gzip in range request / Explicitly set accept-encoding: identity Aug 5, 2019
@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Aug 5, 2019

The amount of work, and the added complexity, required to actually support gzip may not be worth it in general, and I believe that not doing so may have been a concious decision based on e.g. this IRC response.

Whether it's desirable to modify the PDF.js library to explicitly set the Accept-Encoding: identity header, someone else will have to weigh-in on.


However, as you're hopefully already aware of it's already possible to provide custom httpHeaders when calling getDocument; see

* @property {Object} httpHeaders - Basic authentication headers.

@joepio
Copy link
Author

joepio commented Aug 5, 2019

If that IRC response is true:

jamesrobinson: if this server would disable gzip encoding for PDFs, it would appear lots faster
In general PDFs are already compressed and don't need further compression

Than there is even more reason to explicitly set the Accept-Encoding: identity header.

@timvandermeij timvandermeij reopened this Aug 5, 2019
@timvandermeij timvandermeij added core and removed other labels Aug 5, 2019
@timvandermeij
Copy link
Contributor

Let's reopen this. @yurydelendik @brendandahl Do you have any comments on #11027 (comment)?

@Snuffleupagus
Copy link
Collaborator

According to the specifications, see e.g. the XMLHttpRequest specification which links to the relevant part of the Fetch specification, the Accept-Encoding cannot be set (other than the by browser itself); see also https://developer.mozilla.org/en-US/docs/Glossary/Forbidden_header_name
Hence it won't be possible to add Accept-Encoding: identity to the network requests made by the PDF.js library.

With regards to adding support for gzip in the various stream implementations, that would risk adding a lot of additional complexity to the relevant code (not to mention the time/effort required to implement/review/test something like that).
Furthermore, the problem in this issue seem to be limited to a particular server rather than be a widespread issue.

All-in-all, it seems that WONTFIX probably is the appropriate resolution for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants