Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZLib support for decoding Zip files #26332

Closed
kingces95 opened this issue Feb 27, 2019 · 7 comments
Closed

ZLib support for decoding Zip files #26332

kingces95 opened this issue Feb 27, 2019 · 7 comments
Labels
question Issues that look for answers. zlib Issues and PRs related to the zlib subsystem.

Comments

@kingces95
Copy link

kingces95 commented Feb 27, 2019

I'd like to write a zip file decompressor in node. To do that, I'd need to be able to inflate a deflated stream without knowing it's length in advance (some zip files do not include the deflated stream's length until after the stream).

I'm not entirely sure, but I don't think node currently allows this. The popular npm unzip modules all search for a magic signature to find the end of the deflate stream. This is incorrect (per Mark Adler) as the magic signature might appear as compressed data in the deflate stream itself. I've linked to some of those examples in the SO discussion with Mark Adler on SO here.

I think node strives to expose the deflate logic as a stream transform. A stream transform expects to to consume each of the data chunks passed to it. So there's no way, via that abstraction, to have the stream self report it's own termination mid chunk.

I believe this is the reference implementation: https://github.com/madler/zlib/blob/master/examples/gzappend.c

@bnoordhuis bnoordhuis added question Issues that look for answers. zlib Issues and PRs related to the zlib subsystem. labels Feb 27, 2019
@bnoordhuis
Copy link
Member

I don't see why that wouldn't work. Node.js checks for Z_STREAM_END and when it sees that status code, it resets the stream and starts inflating the remaining bytes in the current chunk.

If you're saying that's not working for you, can you post a minimal standalone test case? Or do you mean you want Node.js to stop after the Z_STREAM_END without consuming the remaining bytes?

@addaleax
Copy link
Member

Does checking stream.bytesWritten after a chunk has been written help? That value provides you with information about how much data the zlib library actually consumed, and will stop increasing once the end of the stream has been reached.

It doesn’t quite work this way for gunzip, because those streams try to continue reading after EOF has been reached, in case another member of the gzip file follows. But I don’t think that’s what you’re looking for anyway?

@kingces95
Copy link
Author

kingces95 commented Feb 27, 2019

I was looking for a way to know when the stream terminated without being otherwise told by one of the zip headers and I couldn't figure it out. :( So maybe stream.bytesWritten is the answer. I'll look more deeply at that. It's documentation seems to indicate it's a measure of bytes pushed and not back pressure from zlib, no?

The zlib.bytesWritten property specifies the number of bytes written to the engine, before the bytes are processed (compressed or decompressed, as appropriate for the derived class).

Is there an example that determines when the stream terminates using Z_STREAM_END or stream.bytesWritten? I couldn't find one in the docs. And the unzip NPM libs I scanned either searched for the magic sig (here, here), claimed a zip couldn't be scanned (here) which I'm pretty sure is not true (Mark Adler verified this back on SO), or claimed to stream but actually read the file backwards to get the directory (here). I think the directory is for random access whereas I want to truly stream; I want to inflate the file while its coming down over the wire.

Also, could you point me to where Node.js checks for Z_STREAM_END? I see it defined here but not used anywhere else in any JS code. Is the implementation done in C?

Thanks in advance for any pointers!

@kingces95
Copy link
Author

kingces95 commented Feb 27, 2019

Ah! Here is where it node.js checks for Z_STREAM_END!

      while (strm_.avail_in > 0 &&
             mode_ == GUNZIP &&
             err_ == Z_STREAM_END &&
             strm_.next_in[0] != 0x00) {
        // Bytes remain in input buffer. Perhaps this is another compressed
        // member in the same archive, or just trailing garbage.
        // Trailing zero bytes are okay, though, since they are frequently
        // used for padding.

        ResetStream();
        err_ = inflate(&strm_, flush_);
      }

And it looks like it's only applied in GUNZIP mode. gzip is a format for single compressed files (per Mark Adler here).

Eh. But here's UNZIP

Decompress either a Gzip- or Deflate-compressed stream by auto-detecting the header.

That's looking good... except, like the docs say, that's for GZIP format not ZIP format. And the code scans for GZIP magic headers not ZIP magic headers.

#define GZIP_HEADER_ID1 0x1f
#define GZIP_HEADER_ID2 0x8b

But let's see how this stream.bytesWritten is computed...


Mr. Adler points us to this as the reference implementation of inflate and here he supplies a heavily annotated version of this file.

@kingces95
Copy link
Author

var zlib = require('zlib');
var assert = require('assert');

const input = Buffer.from('01234567890123456789');

zlib.deflate(input, (err, deflatedBuffer) => {
  assert(!err);

  deflatedBuffer = Buffer.concat([deflatedBuffer, Buffer.from([1,2,3,4,5,6])]);

  var numberRead = 0;
  var buffers = [];

  var stream = zlib.createInflate()
    .on('data', function(chunk) {
      buffers.push(chunk);
      numberRead += chunk.length;
    })
    .on('end', function() {
      var result = Buffer.concat(buffers, numberRead);
      this.close();
      console.log(result.toString());
      console.log(stream.bytesWritten);
      })
    .end(deflatedBuffer);
});

Returns

01234567890123456789
20

So bytesWritten is 20 and not 26 so I know those extra 6 bytes I added were not part of the deflate stream.

@addaleax
Copy link
Member

@kingces95 Glad you got something that works for you!

That being said, I think it might be a bug that the zlib stream doesn’t emit the 'end' event if you change .end() to .write(). It essentially forces you to see whether you’ve reached the end of the data stream while still writing data to it, I think – I’ll try to look into that.

(That’s something you’d have to work around when supporting all existing Node.js versions, though.)

addaleax added a commit to addaleax/node that referenced this issue Feb 28, 2019
Report end-of-stream when decompressing when we detect it,
and do not wait until the writable side of a zlib stream
is closed as well.

Refs: nodejs#26332
@addaleax
Copy link
Member

(See #26363 for a patch that should help here.)

BridgeAR pushed a commit to BridgeAR/node that referenced this issue Mar 3, 2019
Report end-of-stream when decompressing when we detect it,
and do not wait until the writable side of a zlib stream
is closed as well.

Refs: nodejs#26332

PR-URL: nodejs#26363
Refs: nodejs#26332
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: James M Snell <jasnell@gmail.com>
BridgeAR pushed a commit that referenced this issue Mar 4, 2019
Report end-of-stream when decompressing when we detect it,
and do not wait until the writable side of a zlib stream
is closed as well.

Refs: #26332

PR-URL: #26363
Refs: #26332
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: James M Snell <jasnell@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues that look for answers. zlib Issues and PRs related to the zlib subsystem.
Projects
None yet
Development

No branches or pull requests

3 participants