Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Add support for concatenated gzip files. #6442

Closed
wants to merge 7 commits into from

Conversation

eendeego
Copy link

No description provided.

@Nodejs-Jenkins
Copy link

Thank you for contributing this pull request! Here are a few pointers to make sure your submission will be considered for inclusion.

Commit eendeego/node@d69b312 has the following error(s):

  • First line of commit message must be no longer than 50 characters
  • Commit message must indicate the subsystem this commit changes

Commit eendeego/node@7934cc7 has the following error(s):

  • First line of commit message must be no longer than 50 characters
  • Commit message must indicate the subsystem this commit changes

The following commiters were not found in the CLA:

  • Luis Reis

You can fix all these things without opening another issue.

Please see CONTRIBUTING.md for more information

@eendeego
Copy link
Author

Already Signed the CLA as "Luis Miguel Fonseca dos Reis"

@indutny
Copy link
Member

indutny commented Oct 31, 2013

Thanks for contributing this!

Perhaps, it'd be better if zlib would be able to do "partial" decompression and return the offset it has stopped at? /cc @isaacs

@eendeego
Copy link
Author

That can be done, but there must be a way to know the Z_STREAM_END status and issue the necessary inflateReset (which I think is already visible in JS space.)

@rlidwka
Copy link

rlidwka commented Oct 31, 2013

What the behaviour is when there is a garbage at the end, something like gzip(a)+gzip(b)+plaintext? There is no test for it.

gzip for example decompresses it properly, but prints out a warning to stdout:

gzip: 1: decompression OK, trailing garbage ignored

@eendeego
Copy link
Author

eendeego commented Nov 4, 2013

Hi @indutny, I've been looking at some form to emit data events for each [zlib] stream termination. I have these issues:

  1. The code for the inflateReset/inflate should actually be a loop.
    If the loop was implemented in the Process function (like the if/inflate is now), that thread would block until all the last buffer received by Write was consumed. I think this is the opposite behavior from what I understand from your comment. This would also make the After function awkward because it would have to buffer multiple chunks and eventually an error (as referred by @rlidwka).

  2. I've tried to move the inflateReset/inflate code from Process to After, just after the call to JS land (see below), but I'm having trouble because the 'end' event is being emitted/received before processing the last [zlib] stream on the test.

Z_STREAM_END handling on node_zlib.cc / After function:

// [...]
MakeCallback(env, handle, env->callback_string(), ARRAY_SIZE(args), args);

if (ctx->err_ == Z_STREAM_END && ctx->strm_.avail_in > 0) {
  ctx->err_ = inflateReset(&ctx->strm_);
  uv_work_t* work_req = &(ctx->work_req_);
  uv_queue_work(ctx->env()->event_loop(),
                work_req,
                ZCtx::Process,
                ZCtx::After);
}

ctx->Unref();
// [...]

Here's the stack trace for the failed test:

assert.js:98
  throw new assert.AssertionError({
        ^
AssertionError: "123\n456\n" == "123\n456\n789\n"
    at Gunzip.<anonymous> (/Users/lmfr/work/node-build/node/test/simple/test-zlib-from-multiple-gzip.js:70:16)
    at Gunzip.EventEmitter.emit (events.js:125:20)
    at _stream_readable.js:896:16
    at process._tickCallback (node.js:316:11)

Note that the test fails exactly with the same error even if Zlib.prototype.close's code is replaced by a return statement.

  1. It seems that even if this worked well, I would be spawning too many threads just to be able to send chunks ending with Z_STREAM_END return codes.

  2. It seems that JS can only be called from the After function.

What am I doing/assuming wrong ?

@indutny
Copy link
Member

indutny commented Nov 4, 2013

Wow, thanks for trying it out!

I think that it should be ok if it would just return offset that it stopped at, after that zlib.js can invoke inflate again with buf.slice(offset) and continue doing it's filthy stuff.

So, please don't call it in a loop and let js handle it. This should be much simpler than writing all this logic in C++.

Please, let me know if you need any other hints and/or help!

@eendeego
Copy link
Author

eendeego commented Nov 7, 2013

This last commit moves stream end handling to JS land - I'm not sure the test on the _transform() callback is the proper one. It seems that being able to check for a Z_STREAM_END would be the best choice, but ctx->err_ is not exposed to JS code.

To be able to handle the garbage example as given by @rlidwka, I had to create a new send_error var on the worker After callback due to race conditions.

Waiting for feedback...

@indutny
Copy link
Member

indutny commented Dec 5, 2013

Sorry, I know it took awhile for me to get there. But I'm still not ready to give you any feedback. I don't really enjoy that it does it automatically, but I'm not sure how it should look API-wise. I'll figure it out a bit later.

self._buffer,
self._offset,
self._chunkSize);
newReq.callback = callback; // this same function

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will lint complain about not having two spaces between the code and the comment, or is that just for cpp?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just the same as every comment in the same file - actually that line it's copied from some lines above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. wasn't worried, but also couldn't remember which lint would check for that.

@trevnorris
Copy link

This'll need to be rebased.

@eendeego
Copy link
Author

eendeego commented Dec 6, 2013

@trevnorris Apart from the // this same function comment, the spacing incidents were derived from my sloppiness. Those are fixed in the last commit

Sorry.

@indutny
Copy link
Member

indutny commented Dec 6, 2013

@trevnorris I don't really enjoy how it automatically continues decompressing it without letting user know that it does it. I think it should be explicit.

In other words, ideally, I think that there should be a stream.completed set to true or something like this. And if user will call stream.read() again - it'll continue reading from the place where it stopped doing it.

The main problem is that this mode is not really supported by streams. @isaacs if you have time - may I ask your advice on this topic?

@eendeego
Copy link
Author

eendeego commented Dec 6, 2013

@indutny Implementations on other languages also do this transparently. How about some (ignorable) event sent to the client code ?

@indutny
Copy link
Member

indutny commented Jan 8, 2014

@luismreis: I had a conversation with @tjfontaine, I think we are going to land it. @tjfontaine was going to review it for one more time. Assigning issue to him

@ghost ghost assigned tjfontaine Jan 8, 2014
@indutny
Copy link
Member

indutny commented Jan 23, 2014

@tjfontaine reminder

@@ -510,6 +510,25 @@ Zlib.prototype._transform = function(chunk, encoding, cb) {
return;
}

if (availInAfter > 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a sane way to refactor this such that these two clauses can be combined instead of duplicated?

@tjfontaine
Copy link

I'd like to land this, but am hoping you're able to refactor to reduce some of the duplication

@indutny
Copy link
Member

indutny commented Feb 18, 2014

@luismreis heya, any news?

@seishun
Copy link

seishun commented Mar 22, 2014

@indutny if it doesn't continue to decompress automatically, then how should it work with the synchronous zlib functions? Or have you agreed with the automatic approach?

@indutny
Copy link
Member

indutny commented Mar 22, 2014

Yes, I agreed with it after all.

@seishun
Copy link

seishun commented Mar 22, 2014

Alright, I'll see if I can rebase and refactor this and make a new PR.

@eendeego
Copy link
Author

Hi @indutny, I haven't got much time recently. I'd like to pick it up, but I'm short on time.

@seishun
Copy link

seishun commented Mar 23, 2014

It seems this PR makes it an error to have trailing garbage. According to gzip.org, it's not unusual for tar.gz files to be padded with zeroes. Also, I don't see a way to retrieve the decompressed data when using the sync functions if it throws on garbage.

On the other hand, Python's implementation raises an exception when there's trailing garbage in gzip (but not zlib) and of course doesn't return anything. But then again, Python's API is terrible anyway in my opinion.

How do you think we should approach this? How do other implementations do it?

@eendeego
Copy link
Author

Yes, it throws an error, I know that I was reproducing behavior from something else, but can't recall now.

// OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
// USE OR OTHER DEALINGS IN THE SOFTWARE.

// test unzipping a file that was created by contactenating multiple gzip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/contactenating/concatenating/

@bnoordhuis
Copy link
Member

Some nits but otherwise LGTM.

@eendeego
Copy link
Author

Hey, @bnoordhuis, @trevnorris, thank you very much for the comments. I've made all the suggested changes except for adding the documentation.

Why ? Because per the RFC this isn't a extra feature, it's something that should have always been there. However, it's amazing to google for this kind of issue and find out that every language's zlib implementation had the same problem at some point.

So... if you still think this documentation is necessary, I suggest adding the following paragraph to doc/api/zlib.markdown just before the Examples title:

The deflate feature can transparently decompress concatenated gzip streams
as specified in section 2.2 of the GZip RFC.

@eendeego
Copy link
Author

@trevnorris, @tjfontaine, @indutny ... ping...

@eugeneware
Copy link

I agree with @luismreis . This is not really a new feature. It's a bug fix. Concatenated Gzips data sets are a valid gzip file and stream.

I got caught with this bug when trying to uncompress the web data from http://commoncrawl.org which uses this aspect of the gzip file format to enable indexing into a gzip file.

The python code that common-crawl supply supports this form of gzip streaming, I was a little surprised to find that node.js didn't.

But a big thanks to @luismreis for this PR!

@trevnorris
Copy link

@indutny What's your take? I'm alright with the change.

@eendeego
Copy link
Author

eendeego commented Nov 7, 2014

Sorry for the spam, but, ... @trevnorris, @tjfontaine, @indutny ... ping...

@eugeneware
Copy link

Yes please! It's now been over 1 year since this original PR was submitted.

@bnoordhuis
Copy link
Member

It LGTM last time I looked at it but I can't land it here. I can land it in node-forward but that repo is private for one or two more days.

@indutny
Copy link
Member

indutny commented Nov 7, 2014

LGTM too.

@trevnorris
Copy link

Overall LGTM.

Will this land cleanly on v0.12?

@eendeego
Copy link
Author

eendeego commented Nov 8, 2014

It should, it was recently rebased.

@indutny
Copy link
Member

indutny commented Nov 9, 2014

Landed in node-forward/node@1183ba4, thank you!

@eendeego
Copy link
Author

eendeego commented Nov 9, 2014

NO! Thank you!

@eugeneware
Copy link

Thank you! Will this bug fix be applied to v0.10 as well?

@misterdjules misterdjules modified the milestones: 0.11.15, v0.12 Nov 21, 2014
chrisdickinson pushed a commit that referenced this pull request Dec 16, 2014
Reviewed-By: Fedor Indutny <fedor@indutny.com>
PR-URL: #6442
@chrisdickinson
Copy link

Landed in 6f6a979 for v0.12. Thanks!

@eugeneware
Copy link

Thanks @chrisdickinson. Will this bug fix be rebased on to the 0.10 branch too?

@misterdjules
Copy link

@luismreis @chrisdickinson Should we re-open this PR so that we can fix the regression it caused (see #8962) and land it again in a later milestone?

In the meantime, I'm removing it from the 0.11.15 milestone.

@misterdjules misterdjules removed this from the 0.11.15 milestone Jan 7, 2015
@eendeego
Copy link
Author

eendeego commented Jan 8, 2015

@misterdjules I don't really know what to say... According to the (gzip) spec we should be restarting the gunzip process each time there's more data, but, apparently, npm depends on the behavior of terminating on the first terminating gzip stream.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.