Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: better compression of 16,24,32,64 bit data types #1492

Closed
neurolabusc opened this issue Jan 6, 2019 · 5 comments
Closed

Comments

@neurolabusc
Copy link

zstd is extremely impressive, both in its speed and compression ratio. However, it is surprising that it does not attempt to detect whether simple byte-swizzling can improve performance. Tools like blosc exploit this. In the example below I have example data that stores 32-bit uints and 32-bit floats in a 3:1 ratio. I compare zstd with the raw data and then when the data is bytes are swizzled so that the order goes from 12341234... to 1111...222...333...444... using zstd-1.3.8. The same data gets dramatically better compression if the bytes are re-ordered.

$ zstd RAW.mz3
RAW.mz3         : 56.80%   (2578036 => 1464255 bytes, RAW.mz3.zst)   
$ zstd Swizzle.mz3
Swizzle.mz3     : 38.53%   (2578036 => 993357 bytes, Swizzle.mz3.zst) 

I realize there as other tools like Blosc that do this, but having an option to have zstandard detect if each block would benefit from swizzling would be great for data scientists as zstd is quickly becoming the new standard compression algorithm. Niavely, it seems easy to detect if swizzling will improve a block: as the compressor compresses the block it can calculate the variance using Welford's Online algorithm. This would measure the accuracy of using a 1-back (default compression) with 2-back (uint16), 3-back (RGB), 4-back (uint32, single), and 8-back (double) predictor. Obviously, if one of these has a dramatically lower variance it suggests that the data should be swizzled and compressed. This would seem to have a penalty (e.g. half the speed for compression) and the swizzling would need to be stored in the header. However, it seems like as an option it would be great to have this built into the standard. Most scientists (and I think most developers) want to save data in the simplest method possible, so they do not consider byte-swizzling data. Having an option for zstd to automatically take advantage of these patterns would be tremendous. In many cases it might even accelerate decompression/transmission/reading as the size of the compressed data stream is reduced.

@terrelln
Copy link
Contributor

terrelln commented Jan 8, 2019

We are interested in exploring this topic.

We can't bake support into the zstd compressed format, since it is already fixed, but we could provide the tools inside of the library/CLI to work with a "wrapper" format that applies transformations before sending data to zstd, or something along those lines.

There are a lot of transformations we could apply to data, including bit/byte-swizzling. Multiple transformations could also be applied in succession, for instance delta-coding followed by byte-swizzling. This is a pretty large search space. We want to have a good answer to how we select which transformations we try at runtime, since every transformation we evaluate costs compression speed, which we care dearly about.

@ealgase
Copy link

ealgase commented Jan 10, 2019

I doubt .bzstd is taken.

@neurolabusc
Copy link
Author

Great. I would encourage the developers of these filters to include DICOM medical images in their test datasets. While the DICOM image format does include image compression transfer syntaxes, valid DICOM tools are not required to support the compressed formats. Therefore, the vast majority of DICOM images as raw data. For CT scans this is almost always 16-bit integers with the range -1024..1024, and with MRI the data is typically stored as 16-bit integers with a range of 0..4096 (though 16 bit ADC is starting to creep in). In both cases, neighboring pixels tend to be higher correlated with the most significant bits. Sample DICOM images are widely available. As one example, this page includes samples from all the major vendors. Having a future popular lossless file format helping these would dramatically improve the transmission of these images. Just as a basic example, the medical universities I work often transfer these images using Box, which downloads them as the popular deflate encoded zip format. A filtered zstd would be faster to compress, faster to transmit and faster to decompress.

@terrelln
Copy link
Contributor

Thanks for the pointer to a sample set @neurolabusc! We'll definitely consider it when thinking about this problem.

@Cyan4973
Copy link
Contributor

Providing filters to deal with numerical data types, especially fixed ones, is a good idea.
But it's also an external topic, adding one layer of logic on top of zstd.
This could be dealt with, but in a separate project, which would use zstd as a dependency, and provide additional control for numerical types. More importantly, it would have to support its own format.

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants