-
-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing for BGZIP'd streaming files #139
Comments
This question is a bit out of project scope, but i recommend split all to smaller steps:
Also, if you are on server size - just use node's zlib instead, pako is good for browsers only. PS. You may wish to consider use JSzip, it's better suited for end users. This library is a bit low level thing. |
Hi @puzrin, thanks for the response. I'm sorry, I don't think I've communicated the issue properly. I believe this is a bug in pako. I believe that pako does not stream decompress block compressed data properly. I'll try and address each of your concerns below:
I use the server size From what I understand, Here is a simpler server side program purely for illustrative purposes. I understand that this is server-side but I'm trying to expose a bug in pako so I'm providing a simple example program exposing the bug in pako. As I've said above I want this to be in browser, not server side so I can't use node's
Running the above code on a file that has been block compressed (pako-block-decompress-failure.txt.gz) produces the following output:
I understand that this library is most likely a volunteer effort on your part and so I would completely understand if this fell in a "don't fix" category purely for lack of interest or time but this is within scope of I am unfamiliar with |
Got it (with simple example from last post). I think, problem is here Line 273 in 893381a
Let me explain. Pako consists of 2 parts:
When we implemented wrappers, we could not find what to do if stream consists of multiple parts (probably returns multiple Z_STREAM_END). That's not widely used mode. /cc @Kirill89 could you take a look? |
That's a minimal sample to reproduce: const pako = require('pako');
let input = require('fs').readFileSync('./pako-block-decompress-failure.txt.gz');
let output = pako.inflate(input);
console.log(`size = ${output.length}`); // => size = 65280 !!! |
After quick look - seems your data really generates Z_STREAM_END status before end. Sure, wrapper can be fixed for this case, but i don't know how. |
The relevant file (afaict) in the most current version (as of this writing) of node is node_zlib.cc (line 302):
|
Yeah, i've seen it. Could not make quick hack to work. Seems it's better to wait for weekend, when Kirill can take a look. |
I checked the same file on original zlib code and found the same behavior ( Also I found very interesting implementation of wrapper for This is possible fix: c60b97e After that fix one test becomes broken, but I don't understand why (need your help to solve). Code to reproduce same behavior: // READ FILE
size_t file_size;
Byte *file_buf = NULL;
uLong buf_size;
FILE *fp = fopen("/home/Kirill/Downloads/pako-fail-test-data.txt.gz", "rb");
fseek(fp, 0, SEEK_END);
file_size = ftell(fp);
rewind(fp);
buf_size = file_size * sizeof(*file_buf);
file_buf = malloc(buf_size);
fread(file_buf, file_size, 1, fp);
// INIT ZLIB
z_stream d_stream;
d_stream.zalloc = Z_NULL;
d_stream.zfree = Z_NULL;
d_stream.opaque = (voidpf)0;
d_stream.next_in = file_buf;
d_stream.avail_in = (uInt)buf_size;
int err = inflateInit2(&d_stream, 47);
CHECK_ERR(err, "inflateInit");
// Inflate
uLong chunk_szie = 5000;
Byte* chunk = malloc(chunk_szie * sizeof(Byte));
do {
memset(chunk, 0, chunk_szie);
d_stream.next_out = chunk;
d_stream.avail_out = (uInt)chunk_szie;
err = inflate(&d_stream, Z_NO_FLUSH);
printf("inflate(): %s\n", (char *)chunk);
if (err == Z_STREAM_END) {
// inflateReset(&d_stream);
break;
}
} while (d_stream.avail_in);
err = inflateEnd(&d_stream);
CHECK_ERR(err, "inflateEnd"); |
Any update on this issue? I'm running into this problem as well. |
c60b97e that needs additionas conditions update after cycle but one more test fails after that. |
The failing tests in that PR reproduce the "too far back" error @abetusk was seeing: |
@rbuels look at this lines: Lines 279 to 281 in c60b97e
It seems, this should be removed, because Z_STREAM_END is processed inside loop and should not finalize deflate. But after removing that line, one test fails, and that's the main reason why @Kirill89 's commit was postponed. |
After discussion with @puzrin, I was able in #146 to write a couple of tests that decompress the bgzip files with pako as-is. The code for doing it can be seen at https://github.com/nodeca/pako/pull/146/files#diff-04f4959c7d84f7da8f54fbf6b0f50553R23 |
Thanks for all the work on this. I just ran into this bug, and would appreciate a new release when the pull request is merged. |
We're seeing this issue quite a bit (I wonder if bioinformaticians just really like gzipping in blocks!) Making the I put these changes up at https://github.com/onecodex/pako and I'm happy to restructure or make a PR if that's helpful (thanks for Kirill89's |
@bovee I'll be happy to accept correct PR. As far as i remember, #145 was rejected because touched original I have absolutely no idea (forgot everything) what this test does, but if it exists, it can not be "just skipped". As soon as anyone can resolve this tail, PR will be accepted. |
@bovee, By splitting the Here is what I believe to be the original paper by Heng Li on Tabix (Tabix has now been subsumed into |
For the bioinformaticians in the thread, just going to say that I ended up
coding around this issue and eventually releasing
https://github.com/GMOD/bgzf-filehandle and https://github.com/GMOD/tabix-js
for accessing BGZIP and tabix files, respectively.
…On Fri, Mar 1, 2019 at 12:37 PM Abe ***@***.***> wrote:
@bovee <https://github.com/bovee>, bgzip allows for random access to
large gzip files. In bioinformatics, there's often a need to access large
files efficiently and at random (from 100Mb to 5Gb or more, compressed,
representing a whole genome in some format, for example). Vanilla gzip
requires to decompress all previous elements before getting at some
position.
By splitting the gzip file into blocks, you can create an index which can
then be used to allow for efficient random access. The resulting bgzipd
files are a bit bigger compressing without block (i.e. just vanilla gzip)
but most of the benefits of compression are still retained while still
allowing for efficient random access to the file. There's the added benefit
that a bgzipd file should look like a regular gzip file so all the
"standard" tools should still work to decompress it.
Here is what I believe to be the original paper by Heng Li on Tabix
<https://academic.oup.com/bioinformatics/article/27/5/718/262743> (Tabix
has now been subsumed into htslib if I'm not mistaken).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#139 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEgFS3Op1zm7Xq4vK8RQ6z9zMnEb5k4ks5vSY-JgaJpZM4URinR>
.
|
@rbuels I'm working on reading local fastq.gz files in the browser and stumbled upon this issue. Haven't been able to get pako to work so far. Is there currently a working solution for streaming bgzf files in the browser? EDIT: I need streaming because the files are large. I don't (and can't) need to store the entire file in memory, just need to stream through all the lines to gather some statistics. |
We implemented `@gmod/bgzf-filehandle`, which wraps pako. We use in
JBrowse. https://www.npmjs.com/package/@gmod/bgzf-filehandle
…On Fri, Jan 31, 2020 at 3:11 PM Anders Pitman ***@***.***> wrote:
@rbuels <https://github.com/rbuels> I'm working on reading local fastq.gz
files in the browser and stumbled upon this issue. Haven't been able to get
pako to work so far. Is there currently a working solution for streaming
bgzf files in the browser?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#139?email_source=notifications&email_token=AAASAFMPEOUBVJB36WTOYNDRASV2XA5CNFSM4FCGFHI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQJYIQ#issuecomment-580951074>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAASAFIESY53TSQZCLRHJBLRASV2XANCNFSM4FCGFHIQ>
.
|
pako seems to be working on my files. Not sure what I'm doing differently to not trigger this bug. |
I've rewritten wrappers. hose works with our old multistream fixture, and with data generated with Z_SYNC_FLUSH. But still fails with provided bgzip file, now with UPD. Hmm... works if i vary |
In case it helps, I have a Streams API wrapper, which I modified to support the multiple streams in a file. This supports both regular gzipped files, and bgzip ones. Basically before pushing more data into the inflator, we check if it hit the end of stream, and rebuffer the remaining input. class PakoInflateTransformer {
constructor() {
this.controller = null;
this.decoder = new pako.Inflate();
let self = this;
this.decoder.onData = (chunk) => {
self.propagateChunk(chunk);
}
}
propagateChunk(chunk) {
if (!this.controller) {
throw "cannot propagate output chunks with no controller";
}
//console.log('Inflated chunk with %d bytes.', chunk.byteLength);
this.controller.enqueue(chunk);
}
start(controller) {
this.controller = controller;
}
transform(chunk, controller) {
//console.log('Pako received chunk with %d bytes.', chunk.byteLength);
this.resetIfAtEndOfStream();
this.decoder.push(chunk);
}
flush() {
//console.log('Pako flushing,');
this.resetIfAtEndOfStream();
this.decoder.push([], true);
}
resetIfAtEndOfStream() {
// The default behaviour doesn't handle multiple streams
// such as those produced by bgzip. If the decoder thinks
// it has ended, but there's available input, save the
// unused input, reset the decoder, and re-push the unused input.
//
while (this.decoder.ended && this.decoder.strm.avail_in > 0) {
let strm = this.decoder.strm;
let unused = strm.input.slice(strm.next_in);
//console.log(`renewing the decoder with ${unused.length} bytes.`);
this.decoder = new pako.Inflate();
let self = this;
this.decoder.onData = (chunk) => {
self.propagateChunk(chunk);
}
this.decoder.push(unused, Z_SYNC_FLUSH);
}
}
} |
@drtconway, could you provide a working self contained example? @puzrin, any progress on this? |
Here's a stand-alone HTML document. Select a file - it uses suffix matching to guess if it's compressed or uncompressed, and it reads the chunks. <html>
<head>
<title>Uncompress Streams</title>
<script src="https://cdn.jsdelivr.net/pako/1.0.3/pako.min.js"></script>
<script src="https://unpkg.com/web-streams-polyfill/dist/polyfill.js"></script>
<script>
// Define Z_SYNC_FLUSH since it's not exported from pako.
//
const Z_SYNC_FLUSH = 2;
class PakoInflateTransformer {
constructor() {
this.controller = null;
this.decoder = new pako.Inflate();
let self = this;
this.decoder.onData = (chunk) => {
self.propagateChunk(chunk);
}
}
propagateChunk(chunk) {
if (!this.controller) {
throw "cannot propagate output chunks with no controller";
}
//console.log('Inflated chunk with %d bytes.', chunk.byteLength);
this.controller.enqueue(chunk);
}
start(controller) {
this.controller = controller;
}
transform(chunk, controller) {
//console.log('Pako received chunk with %d bytes.', chunk.byteLength);
this.resetIfAtEndOfStream();
this.decoder.push(chunk);
}
flush() {
//console.log('Pako flushing,');
this.resetIfAtEndOfStream();
this.decoder.push([], true);
}
resetIfAtEndOfStream() {
// The default behaviour doesn't handle multiple streams
// such as those produced by bgzip. If the decoder thinks
// it has ended, but there's available input, save the
// unused input, reset the decoder, and re-push the unused input.
//
while (this.decoder.ended && this.decoder.strm.avail_in > 0) {
let strm = this.decoder.strm;
let unused = strm.input.slice(strm.next_in);
//console.log(`renewing the decoder with ${unused.length} bytes.`);
this.decoder = new pako.Inflate();
let self = this;
this.decoder.onData = (chunk) => {
self.propagateChunk(chunk);
}
this.decoder.push(unused, Z_SYNC_FLUSH);
}
}
}
function blobToReadableStream(blob) {
let reader = blob.stream().getReader();
return new ReadableStream({
start(controller) {
function push() {
reader.read().then(({done, value}) => {
if (done) {
controller.close();
return;
}
controller.enqueue(value);
push();
})
}
push();
}
});
}
function getReader(source) {
var fileStream = blobToReadableStream(source);
if (source.name.endsWith(".gz") || source.name.endsWith(".bgz")) {
fileStream = fileStream.pipeThrough(new TransformStream(new PakoInflateTransformer));
}
return fileStream.getReader();
}
var readTheFile = async function(event) {
var inp = event.target;
let reader = getReader(inp.files[0]);
let n = 0;
let s = 0;
while (true) {
const { done, value } = await reader.read();
if (done) {
break;
}
let l = value.byteLength;
n += 1;
s += l;
}
let resElem = document.getElementById('results');
let add = function(txt) {
let para = document.createElement('p');
resElem.appendChild(para);
para.appendChild(document.createTextNode(txt));
}
let m = s / n;
add(`number of chunks: ${n}`);
add(`mean size: ${m}`);
}
</script>
</head>
<body>
<div>
<input type='file' onchange='readTheFile(event)' />
</div>
<div id="results">
</div>
</body>
</html> |
Hmm. I changed the version of pako that I was using from 1.0.3 to 2.0.4 and it fails now. Is there an obvious thing that changed that would break the code? From my initial investigation it looks like it hits the condition to reset at the end of the first stream, but the recovery doesn't work correctly any more. |
@drtconway wrapper changed significantly but multistream test exist Lines 60 to 77 in 0398fad
|
Ok. I'm looking in the debugger, and I see that it's hitting the error condition "invalid distance too far back" in I've attached the gzipped file. Note because some of these files can be 100s MB to GBs in size, we really want to use the streaming API rather than compress in one chunk. SRR1301936.trio.genotype.soi.vep.post_filter.vcfanno.rare.cann.vcf.gz |
i modified the unit test above to use this data, and it fails. |
From the
Is this the problem? |
Not sure. See this (disabled) test Lines 69 to 77 in 0398fad
|
this is likely a different manifestation of the same problem but saw an issue where pako did work on a file with 1.0 but stopped working on 2.0, reproducible repo here https://github.com/cmdcolin/pako_error the origin file is not bgzip but may have some multipart stuff perhaps, moved from #250 |
Hi all, thanks for the wonderful library!
Unfortunately I think I've found a bug. Files compressed with
bgzip
(block gzip) are failing when trying to usepako
to do streaming decompression.The file pako-fail-test-data.txt.gz is an example file that is able to trigger what I believe to be an error. The file itself is 65,569 bytes big, which is just larger than what I assume to be a block size relevant to bgzip (somewhere around 65280?). Here is a small shell session that has some relevant information:
Here is some sample code that should decompress the whole file, but doesn't. My apologies for it not being elegant, I'm still learning and I kind of threw a bunch of things together to get something that I believe triggers the error:
I did not indicate an end block (that is I did not do
inflator.push(data.false)
anywhere) and there are maybe other problems with this in how the data blocks are read fromfs
but I hope you'll forgive this sloppiness in the interest of simplicity to illuminate the relevant issue.Running this does successfully decompress a portion of the file but then stops at what I believe to the first block. Here are some shell commands that might be enlightening:
Running another simple example using
browser-zlib
triggers an error outright:And when run via
node stream-example-2.js
, the error produces is:I assume this is a
pako
error asbrowserify-zlib
usespako
underneath so my apologies if this isbrowserify-zlib
error and has nothing to do withpako
.As a "control", the following code works without issue:
bgzip
is used to allow for random access to gzipped files. The resulting block compressed file is a bit bigger than using straightgzip
compression but the small compressed file size inflation is often worth it for the ability to efficiently access arbitrary positions in the uncompressed data.My specific use case is that I want to process a large text file (compressed
~115Mb
,~650Mb
uncompressed with other files being even larger). Loading the complete file, either compressed or uncompressed, is not an option either because of memory exhaustion or straight up memory restrictions in JavaScript. I only need to process the data in a streaming manner (that is, I only need to look at the data once and then am able to mostly discard it) so this is why I was looking into this option. The bioinformatics community uses this method quite a bit (bgzip
is itself part oftabix
which is part of a bioinformatics library calledhtslib
) so it would be nice ifpako
supported this use case.If there is another library I should be using to allow for stream processing of compressed data in the browser, I would welcome any suggestions.
The text was updated successfully, but these errors were encountered: