Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async / stream support #195

Merged
merged 33 commits into from
Sep 13, 2015
Merged

async / stream support #195

merged 33 commits into from
Sep 13, 2015

Conversation

dduponchel
Copy link
Collaborator

This pull request replaces #141.
The current API is synchronous : if JSZip takes too much time to finish its task, the page crashes
(it freezes during the task anyway). This commit does a the following :

  • rewrite the code into workers which are asynchronous
  • replace the public methods
  • add nodejs stream support
  • break the compatibility with existing code

Note : I created a branch jszip_v3 which I target : a change this big can't go
to the current v2.

Workers

A worker is like a nodejs stream but with some differences. On the good side :

  • it works on IE 6-9 without any issue / polyfill
  • it weights less than the full dependencies bundled with browserify
  • it forwards errors (no need to declare an error handler EVERYWHERE)

On the bad side :
It adds new code to maintain, fix, optimize.

A chunk is an object with 2 attributes : meta and data. The former is an
object containing anything (percent for example), see each worker for more
details. The latter is the real data (String, Uint8Array, etc).

Public API

I've updated my gh-pages with the latest API documentation.

Getters are replaced by nodeStream(type) and async(type), generate() is replaced by generateNodeStream and generateAsync. The "async" methods return a Promise, the "stream" methods return a Stream3 (from nodejs) if available.
They have an update callback to get access to metadata like the
name of the current file being compressed or the operation progress (with percent).

With this pull request, the API will be :

// generate
zip.generateAsync(...) // return a Promise of the content
zip.generateNodeStream(...) // return the node stream
zip.generateInternalStream(...) // return the internal stream to let the user write glue code

// getters
zip.async("arraybuffer") // return a Promise
zip.nodeStream("arraybuffer") // return a nodejs stream
zip.internalStream(...) // return the internal stream to let the user write glue code

The previous sync methods now throw exceptions.

Nodejs stream support

file(name, data) now accepts a nodejs stream as data. The "nodeStream" methods generates streams.

Breaking changes

The undocumented JSZip.compressions object changed : the object now returns
workers to do the job, the previous methods are not used anymore.

A lot of sync methods has been removed.

Optimize the call to file(name) : instead of using filter (in O(n)),
fetch the entry directly (in O(1)).
The index went too far, causing the optimizer to drop the compiled code.
Because of the try/catch, the JIT compiler can't do its magic.
This patch moves some of the hot code into separate functions : those
functions can be compiled, the try/catch remains in the main loop.
When loading a zip file with unicode path/comment, we were combining two costly
operations : read data as (binary) string and then convert the binary string
to a real string. This patch skip the first string conversion.
The previous code used a closure to keep a reference to the compressed
content and pointers to the beginning/end of the file in this content.
Then we had some complicated code to extract the content and handle the
corner cases.
This new version adds two new ("private" with a _) methods to the
ZipObject instances : _compress and _decompress. A ZipObject instance
will have two states : compressed (with a CompressedObject) and
decompressed (with a string, uint8array, etc).
With this patch, no more closure : for each file we create a
sub{string,array} for the data and then create a CompressedObject.
The transformation uint8array -> string (to check against the signature)
can be costly if repeated for a lot of entries. This commit uses the
reader to check the signature, speeding things with a Uint8ArrayReader.
The separation between the generation of the CompressedObjects and the
actual use of them will help for the next commit, adding generateAsync().
The "content is empty" check can be done at the insertion of the content
and not when compressing it.
This commit addresses the timeout issue. The current API is synchronous : if
JSZip takes too much time to finish its task, the page crashes
(it freezes during the task anyway). This commit does a the following :

- rewrite the code into workers which can be asynchronous
- add the needed public methods
- add nodejs stream support
- break the compatibility with existing code

Workers
-------

A worker is like a nodejs stream but with some differences. On the good side :

- it works on IE 6-9 without any issue / polyfill
- it weights less than the full dependencies bundled with browserify
- it forwards errors (no need to declare an error handler EVERYWHERE)

On the bad side :
To get sync AND async methods on the public API without duplicating a lot of
code, this class has `isSync` attribute and some if/then to choose between
doing stuff now, or using an async callback. It is dangerously close to
releasing Zalgo (see http://blog.izs.me/post/59142742143/designing-apis-for-asynchrony for more).

A chunk is an object with 2 attributes : `meta` and `data`. The former is an
object containing anything (`percent` for example), see each worker for more
details. The latter is the real data (String, Uint8Array, etc).

Public API
----------

Each method generating data (generate, asText, etc) gain a stream sibling :
generateStream, asTextStream, etc.
This will need a solid discussion because I'm not really satified with this.

Nodejs stream support
---------------------

With this commit, `file(name, data)` accepts a nodejs stream as data. It also
adds a `asNodejsStream()` on the StreamHelper.

Breaking changes
----------------

The undocumented JSZip.compressions object changed : the object now returns
workers to do the job, the previous methods are not used anymore.

Not broken yet, but the the `checkCRC32` (when loading a zip file, it
synchronously check the crc32 of every files) will need to be replaced.
This commit updates the saucelabs dependencies to the latest and update
the configuration :

- wait up to 10minutes, IE 6 is slow
- launch max 3 parallel jobs
- add IE6, IE11, safari 7, safari 8
Grunt-browserify adde in v3.2.0 a `banner` option, use it. Also ignore
the lib/nodejs/* files : they only make sense with streams.
This was referenced Feb 17, 2015
When paused, a nodejs stream CAN and WILL emit "end" events.
@dduponchel
Copy link
Collaborator Author

Keeping the sync methods will be too much troubles, I'll remove them.

That would give

zip.file("content.txt").async("uint8array", function onComplete(){...});
zip.file("content.txt").async({
    type: "uint8array",
    onUpdate: function () {...} // allow progress update notifications
}, function onComplete(){...});

zip.file("content.txt").stream("uint8array").on("data", ...); // our simple stream
zip.file("content.txt").nodeStream("uint8array").on("data", ...); // nodejs stream

zip.generateAsync({...}, function onComplete(){...});
zip.generateStream({...}).on("data", ...);
zip.generateNodeStream({...}).on("data", ...);

The old methods will throw an exception like "This method has been removed in JSZip 3.0, please check the upgrade guide.".

I'll remove the synchronous .load() method too : a sync checkCRC32 option isn't possible anymore and an async method allows more things (loading from a Blob, from a nodejs file descriptor, etc).

For the async operations, ES6 Promises look like the right choice (instead of manually adding a callback) but that means adding more bytes in the build.

@Stuk Does this sound reasonable ?

@andrewvarga
Copy link

Is this going to be part of the stable release some time? (Or any other async (webworker) support?

@dduponchel
Copy link
Collaborator Author

I'm still working on it... When I have enough time. I may have some time this weekend to continue working on the async load.

@andrewvarga
Copy link

Let me know if I can help, I could actually make it work in a webworker really easily by just adding some wrapper methods on top of jszip.min.js like:

self.onmessage = function(message) {
    var messageData = message.data;
    var command = messageData.command;
    var args = messageData.args;
    var requestId = messageData.requestId;

    switch (command) {
        case "load":
                   jsZip = new JSZip();
            jsZip.load(args[0]);

            self.postMessage({
                requestId: requestId
            });
                       break;
        case "parse":
           // ... etc

This way I didn't need to touch the actual JSZIP code at all, it just allows to call its functions in a webworker.

This commit removes most of the sync methods (everything but the load
method).
The new getters are :

- zip.async(type) : es6 promises
- zip.stream(type) : nodejs Stream3
- zip.internalStream(type) : our internal stream (useful to create wrappers).

The new generate methods are :
- zip.generateAsync() : es6 promises
- zip.generateStream() : nodejs Stream3
- zip.generateInternalStream() : our internal stream

With the "stream" keyword, everyone will think of the nodejs stream :
it's safer to use "internalStream" for our own.
The old sync versions will throw an exception.

Also fix the stream version we use : we use Stream3, not the stream
offered by the current node runtime.

Also changed in generate : prefer "binarystring" over "string".
That way, JSZip#generate will be consistent with ZipObject#async.
"string" is still supported because a zip as a text don't make sense.
This commit remove the sync "load()" method. It has been replaced by
"loadAsync()" which returns a Promise.
@dduponchel
Copy link
Collaborator Author

I've updated the pull request with async methods and I've updated the description. I've updated my gh-pages with the latest API documentation.
The next step will be to merge this with master.

@andrewvarga doesn't importScripts solve this issue ? Making JSZip Worker aware may give a strange API.

@balaclark
Copy link

Looks great, async is very necessary for large files.

setTimeout has 4ms minimum delay while on a recent browser a chunk takes
~0.2ms of processing. removing these 4ms make a huge difference.

Main downside: the polyfill globally declares setImmediate.
@dduponchel
Copy link
Collaborator Author

Up-to-date with v2.5.0 (I moved the branch jszip_v3 to this tag). @Stuk are you ok with this pull request ?

Browsers are also adding streams, see [1], [2], [3]. While the spirit is
the same, the API is different from nodejs streams. To avoid future
confusion, I prefer renaming `generateStream` to `generateNodeStream`
and `stream` to `nodeStream` (we already have `nodebuffer` as type).

  [0]: https://streams.spec.whatwg.org/
  [1]: https://blog.wanderview.com/blog/2015/06/19/intent-to-implement-streams-in-firefox/
  [2]: https://www.chromestatus.com/features/5804334163951616
When I migrated the tests from `asText()` to `async("string")`, I typed
a lot of `async("text")`. I strongly suspect that other users will have
the issue too while migrating.
@dduponchel
Copy link
Collaborator Author

I've prefixed stream methods with "node": browsers are also adding streams, see 1, 2, 3. While the spirit is the same, the API is different from nodejs streams. To avoid future confusion, I prefer renaming generateStream to generateNodeStream and stream to nodeStream (we already have nodebuffer as type).

return zipFileWorker;
};

/**
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ZipFileWorker be put in its own file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Split the generate.js file into generate/index.js and
generate/ZipFileWorker.js.
The error event wasn't sent upward, the parent workers were left paused
(or running).
Also make ".pause()" and ".resume()" return boolean, extending the
default behavior is easier with it.
The module `asap` doesn't update the global scope. The only downside: we
can't cancel the async call anymore. Instead, we now let the async call
happen and return early if the stream is paused or finished.
@Stuk
Copy link
Owner

Stuk commented Sep 12, 2015

👍 is this ready to me merged then?

@dduponchel
Copy link
Collaborator Author

@Stuk this pull request is ready, I'll continue on #224.

Stuk added a commit that referenced this pull request Sep 13, 2015
@Stuk Stuk merged commit c1260c8 into Stuk:jszip_v3 Sep 13, 2015
@0cv
Copy link

0cv commented Nov 13, 2015

damned, this is amazing, seriously. v2.5 just failed and crashed after 1 minute or so on few hundreds files weighting between 10KB to 500KB each and this new version just did the whole job in half a second! Unless there are some bugs (which I have not found on my few basic tests), it would make sense it becomes the new official version.

Small info: grunt (which I had to use to produce the dist v3.0) is somehow outdated with its uglify dependency and throws an error. Updating uglify to the latest version simply fixed the problem.

@dduponchel dduponchel mentioned this pull request Feb 10, 2016
@dduponchel dduponchel deleted the async branch March 24, 2016 22:13
@dduponchel dduponchel mentioned this pull request Apr 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants