Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support limiting the frame size via zstd command line tool #2121

Open
pizzard opened this issue May 8, 2020 · 3 comments
Open

Support limiting the frame size via zstd command line tool #2121

pizzard opened this issue May 8, 2020 · 3 comments
Assignees

Comments

@pizzard
Copy link

pizzard commented May 8, 2020

Is your feature request related to a problem? Please describe.
We use zstd to compress log files which are then later read and replayed. Due to the fact the files are pretty large (maybe even larger then the replay systems available memory when replaying multiple of them), in our writer we currently flush (force end frame) after reaching a uncompressed input size using the zstd API. This partions our files into frames of 16MB ucompressed size. The loader can then load the file frame by frame, dropping the data from the previous frame when it is no longer needed. It also can search for a given point in in time in our ordered input by implementing a binary search within the undecompressed frames, searching and finding the next location to jump to, locating that particular frame, decompressing that frame and so on. By only loosing very little compression ratio the memory consumption of random access jumps can be reduced to a fixed amount and the speed increased drastically.
There is one problem though. Sometimes data is written in uncompressed form and one wants to compress the data afterwards. This is convieniently done using the zstd command line tool which sadly then compresses all my data into to a huge frame, which then renders my binary search useless.

Describe the solution you'd like
An option for the command line tool --max-frame-size=X which allows limiting the output frame size to X. I don't care whether this is a limit to the compressed frame size or the uncompressed frame size, as I can adjust X accordingly.

Describe alternatives you've considered
We could write our own zstd command line tool which does this, which we rather want to avoid. I tried different tricks with the streaming API, which didn't work.

Additional context
The zstd API allows me to force the end of a frame, which I use in my application to write frames of certain size by counting uncompressed input and forcing this.

@felixhandte
Copy link
Contributor

@pizzard, if the zstd CLI supported this splitting of frames, how would you find the frame boundaries when doing your binary search? Would you want each frame to be a separate file? Or would you maintain a seek table?

You may want to look at the seekable format (spec, code), although that is not a command-line tool.

Alternatively, could you use the split tool to partition your logs and then compress each chunk with zstd?

@pizzard
Copy link
Author

pizzard commented Aug 8, 2020

@felixhandte very valid question there, let me describe the layout of our algorithm and how we do things a bit better.
When we create log files we make sure that alls frames are below a certain frame size limit (16Mb in our case) and use the correct API too ensure the actual compressed and uncompressed frame sizes are contained in the headers.

When then reading the file, I just mmap the file into memory. Then I build up a header index table, by skipping through the file once and decoding all the headers with the zstd header reading function. as the compressed data size is present, I can simply skip from header header and read all of them in. As no decompression is happening, this is very fast,

When then reading the file, just consecutively decompress the file frame by frame as needed. When a jump to a random position is needed, I just use the lookup table to find the right chunk (by using the uncompressed size counts), decompress it and continue from there.
So I only need memory for 2 decompressed frames (ony active, one prefetched for more speed).

What currently happens is, when some uses the command-line tool it created one frame with all contents in it. Then it doesn't work with the loading, it just loads everything in.

My intuition was actually that the cmd line tool does this because it is more space-efficient. But at least on our logfiles, limiting the frame size to 16MB counter-intuitively better compressed the files than making one big frame. This persistet for different binary file layouts and different compression levels. The change only was a few percent, but I'll take it.

@devinrsmith
Copy link

I think adding seekable format support via the CLI would be great. Essentially, the option to set the seekable "Maximum Frame Size" parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants