Support limiting the frame size via zstd command line tool #2121

pizzard · 2020-05-08T19:13:34Z

Is your feature request related to a problem? Please describe.
We use zstd to compress log files which are then later read and replayed. Due to the fact the files are pretty large (maybe even larger then the replay systems available memory when replaying multiple of them), in our writer we currently flush (force end frame) after reaching a uncompressed input size using the zstd API. This partions our files into frames of 16MB ucompressed size. The loader can then load the file frame by frame, dropping the data from the previous frame when it is no longer needed. It also can search for a given point in in time in our ordered input by implementing a binary search within the undecompressed frames, searching and finding the next location to jump to, locating that particular frame, decompressing that frame and so on. By only loosing very little compression ratio the memory consumption of random access jumps can be reduced to a fixed amount and the speed increased drastically.
There is one problem though. Sometimes data is written in uncompressed form and one wants to compress the data afterwards. This is convieniently done using the zstd command line tool which sadly then compresses all my data into to a huge frame, which then renders my binary search useless.

Describe the solution you'd like
An option for the command line tool --max-frame-size=X which allows limiting the output frame size to X. I don't care whether this is a limit to the compressed frame size or the uncompressed frame size, as I can adjust X accordingly.

Describe alternatives you've considered
We could write our own zstd command line tool which does this, which we rather want to avoid. I tried different tricks with the streaming API, which didn't work.

Additional context
The zstd API allows me to force the end of a frame, which I use in my application to write frames of certain size by counting uncompressed input and forcing this.

felixhandte · 2020-08-06T20:19:32Z

@pizzard, if the zstd CLI supported this splitting of frames, how would you find the frame boundaries when doing your binary search? Would you want each frame to be a separate file? Or would you maintain a seek table?

You may want to look at the seekable format (spec, code), although that is not a command-line tool.

Alternatively, could you use the split tool to partition your logs and then compress each chunk with zstd?

pizzard · 2020-08-08T13:26:21Z

@felixhandte very valid question there, let me describe the layout of our algorithm and how we do things a bit better.
When we create log files we make sure that alls frames are below a certain frame size limit (16Mb in our case) and use the correct API too ensure the actual compressed and uncompressed frame sizes are contained in the headers.

When then reading the file, I just mmap the file into memory. Then I build up a header index table, by skipping through the file once and decoding all the headers with the zstd header reading function. as the compressed data size is present, I can simply skip from header header and read all of them in. As no decompression is happening, this is very fast,

When then reading the file, just consecutively decompress the file frame by frame as needed. When a jump to a random position is needed, I just use the lookup table to find the right chunk (by using the uncompressed size counts), decompress it and continue from there.
So I only need memory for 2 decompressed frames (ony active, one prefetched for more speed).

What currently happens is, when some uses the command-line tool it created one frame with all contents in it. Then it doesn't work with the loading, it just loads everything in.

My intuition was actually that the cmd line tool does this because it is more space-efficient. But at least on our logfiles, limiting the frame size to 16MB counter-intuitively better compressed the files than making one big frame. This persistet for different binary file layouts and different compression levels. The change only was a few percent, but I'll take it.

devinrsmith · 2023-04-06T15:40:06Z

I think adding seekable format support via the CLI would be great. Essentially, the option to set the seekable "Maximum Frame Size" parameter.

Cyan4973 added the feature request label May 8, 2020

felixhandte self-assigned this Jun 15, 2020

martinellimarco mentioned this issue Nov 29, 2020

Adding support for zstd mxmlnkn/ratarmount#40

Closed

mxmlnkn mentioned this issue Jan 6, 2022

Usability for non-tar archives martinellimarco/t2sz#3

Closed

phord mentioned this issue Mar 8, 2023

Decode multiple frames if present in each file KillingSpark/zstd-rs#31

Merged

mxmlnkn mentioned this issue Apr 7, 2024

Hello from fsspec piskvorky/smart_open#579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support limiting the frame size via zstd command line tool #2121

Support limiting the frame size via zstd command line tool #2121

pizzard commented May 8, 2020 •

edited

Loading

felixhandte commented Aug 6, 2020

pizzard commented Aug 8, 2020

devinrsmith commented Apr 6, 2023

Support limiting the frame size via zstd command line tool #2121

Support limiting the frame size via zstd command line tool #2121

Comments

pizzard commented May 8, 2020 • edited Loading

felixhandte commented Aug 6, 2020

pizzard commented Aug 8, 2020

devinrsmith commented Apr 6, 2023

pizzard commented May 8, 2020 •

edited

Loading