Feature request: better compression of 16,24,32,64 bit data types #1492

neurolabusc · 2019-01-06T01:35:51Z

zstd is extremely impressive, both in its speed and compression ratio. However, it is surprising that it does not attempt to detect whether simple byte-swizzling can improve performance. Tools like blosc exploit this. In the example below I have example data that stores 32-bit uints and 32-bit floats in a 3:1 ratio. I compare zstd with the raw data and then when the data is bytes are swizzled so that the order goes from 12341234... to 1111...222...333...444... using zstd-1.3.8. The same data gets dramatically better compression if the bytes are re-ordered.

$ zstd RAW.mz3
RAW.mz3         : 56.80%   (2578036 => 1464255 bytes, RAW.mz3.zst)   
$ zstd Swizzle.mz3
Swizzle.mz3     : 38.53%   (2578036 => 993357 bytes, Swizzle.mz3.zst)

I realize there as other tools like Blosc that do this, but having an option to have zstandard detect if each block would benefit from swizzling would be great for data scientists as zstd is quickly becoming the new standard compression algorithm. Niavely, it seems easy to detect if swizzling will improve a block: as the compressor compresses the block it can calculate the variance using Welford's Online algorithm. This would measure the accuracy of using a 1-back (default compression) with 2-back (uint16), 3-back (RGB), 4-back (uint32, single), and 8-back (double) predictor. Obviously, if one of these has a dramatically lower variance it suggests that the data should be swizzled and compressed. This would seem to have a penalty (e.g. half the speed for compression) and the swizzling would need to be stored in the header. However, it seems like as an option it would be great to have this built into the standard. Most scientists (and I think most developers) want to save data in the simplest method possible, so they do not consider byte-swizzling data. Having an option for zstd to automatically take advantage of these patterns would be tremendous. In many cases it might even accelerate decompression/transmission/reading as the size of the compressed data stream is reduced.

The text was updated successfully, but these errors were encountered:

terrelln · 2019-01-08T22:42:34Z

We are interested in exploring this topic.

We can't bake support into the zstd compressed format, since it is already fixed, but we could provide the tools inside of the library/CLI to work with a "wrapper" format that applies transformations before sending data to zstd, or something along those lines.

There are a lot of transformations we could apply to data, including bit/byte-swizzling. Multiple transformations could also be applied in succession, for instance delta-coding followed by byte-swizzling. This is a pretty large search space. We want to have a good answer to how we select which transformations we try at runtime, since every transformation we evaluate costs compression speed, which we care dearly about.

ealgase · 2019-01-10T07:38:41Z

I doubt .bzstd is taken.

neurolabusc · 2019-01-10T13:04:25Z

Great. I would encourage the developers of these filters to include DICOM medical images in their test datasets. While the DICOM image format does include image compression transfer syntaxes, valid DICOM tools are not required to support the compressed formats. Therefore, the vast majority of DICOM images as raw data. For CT scans this is almost always 16-bit integers with the range -1024..1024, and with MRI the data is typically stored as 16-bit integers with a range of 0..4096 (though 16 bit ADC is starting to creep in). In both cases, neighboring pixels tend to be higher correlated with the most significant bits. Sample DICOM images are widely available. As one example, this page includes samples from all the major vendors. Having a future popular lossless file format helping these would dramatically improve the transmission of these images. Just as a basic example, the medical universities I work often transfer these images using Box, which downloads them as the popular deflate encoded zip format. A filtered zstd would be faster to compress, faster to transmit and faster to decompress.

terrelln · 2019-01-11T03:16:58Z

Thanks for the pointer to a sample set @neurolabusc! We'll definitely consider it when thinking about this problem.

Cyan4973 · 2019-10-23T17:42:34Z

Providing filters to deal with numerical data types, especially fixed ones, is a good idea.
But it's also an external topic, adding one layer of logic on top of zstd.
This could be dealt with, but in a separate project, which would use zstd as a dependency, and provide additional control for numerical types. More importantly, it would have to support its own format.

Closing.

terrelln added the feature request label Jan 8, 2019

Cyan4973 closed this as completed Oct 23, 2019

neurolabusc mentioned this issue Feb 11, 2022

Add support to NeuroJSON (.jnii and .bnii) files (to development branch) rordenlab/dcm2niix#579

Closed

This was referenced Mar 30, 2022

add lzma compressed bmsh file and benchmark neurolabusc/MeshFormatsJS#3

Merged

Reference datasets tee-ar-ex/trx-python#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: better compression of 16,24,32,64 bit data types #1492

Feature request: better compression of 16,24,32,64 bit data types #1492

neurolabusc commented Jan 6, 2019

terrelln commented Jan 8, 2019 •

edited by Cyan4973

Loading

ealgase commented Jan 10, 2019

neurolabusc commented Jan 10, 2019

terrelln commented Jan 11, 2019

Cyan4973 commented Oct 23, 2019

Feature request: better compression of 16,24,32,64 bit data types #1492

Feature request: better compression of 16,24,32,64 bit data types #1492

Comments

neurolabusc commented Jan 6, 2019

terrelln commented Jan 8, 2019 • edited by Cyan4973 Loading

ealgase commented Jan 10, 2019

neurolabusc commented Jan 10, 2019

terrelln commented Jan 11, 2019

Cyan4973 commented Oct 23, 2019

terrelln commented Jan 8, 2019 •

edited by Cyan4973

Loading