Don't shrink window log when streaming with a dictionary #2451

terrelln · 2021-01-04T21:52:40Z

When creating a dictionary keep the same behavior as before.
Assume the source size is 513 bytes when adjusting parameters.
When calling ZSTD_getCParams() or ZSTD_adjustCParams() use
the same logic as case 4 (not attaching a dictionary).
When attaching a dictionary keep the same behavior of ignoring
the dictionary size. When streaming this will select the
largest parameters and not adjust them down. But, the CDict
will use the correctly sized parameters, which seems like the
right tradeoff.
When not attaching a dictionary (either forced not to, or
using a prefix dictionary) we select parameters based on the
dictionary size + source size, and assume the source size is
small, which is the same behavior as before. But, now we don't
adjust the window log (and hash and chain log) down when the
source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or forceWindow is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

I've added a test case that catches this bug. It compresses using
a dictionary, without setting the pledged src size, and with a large
source. See the changes to results.csv in the
"Don't shrink window log when streaming with a dictionary"
commit.

I've also added a test to fuzzer.c to check that ZSTD_getCParams()
and ZSTD_adjustCParams() don't shrink the window log down.

terrelln · 2021-01-04T23:19:22Z

@Cyan4973 when a user calls ZSTD_getCParams(1, ZSTD_CONTENTSIZE_UNKNOWN, 1100) do you think that we should keep the legacy behavior of shrinking the windowLog down to 2KB, or change the behavior?

So far I've left it alone. But I'm thinking we should change ZSTD_getCParams() (case 2) to match the ZSTD_cpm_noAttachDict (case 4).

Cyan4973 · 2021-01-04T23:43:02Z

I agree with the consistency argument.
So it means that 2) should behave like the new 4).

terrelln · 2021-01-04T23:50:31Z

I've made that change and added a test for ZSTD_getCParams() and ZSTD_adjustCParams().

* Add a test that runs without a pledgedSrcSize and with a dictionary. * Add github.tar data with uses the github dictionary while compressing github.tar, instead of each file individually.

Fixes facebook#2442. 1. When creating a dictionary keep the same behavior as before. Assume the source size is 513 bytes when adjusting parameters. 2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep the same behavior as before. 3. When attaching a dictionary keep the same behavior of ignoring the dictionary size. When streaming this will select the largest parameters and not adjust them down. But, the CDict will use the correctly sized parameters, which seems like the right tradeoff. 4. When not attaching a dictionary (either forced not to, or using a prefix dictionary) we select parameters based on the dictionary size + source size, and assume the source size is small, which is the same behavior as before. But, now we don't adjust the window log (and hash and chain log) down when the source size is unknown. When the source size is unknown all cdicts should attach, except when the user disables attaching, or `forceWindow` is used. This means that when streaming with a CDict we end up in the good case where we get small CDict parameters, and large source parameters. TODO: Add a streaming + dictionary regression test case.

Treat ZSTD_getCParams() and ZSTD_adjustCParams() in the same way we treat streaming compression. Choose parameters based on the dictionary size + source size, and assume the source size is small if unkown. But, don't shrink the window log down in ZSTD_adjustCParams_internal().

phiresky · 2021-01-24T15:56:35Z

Thanks, with these changes it seems to be mostly ok:

input 159219 bytes, without dict 29518 bytes, streaming api with dict 24177 bytes, simple api with dict 23866 bytes.

For some reason it's still slightly larger when using the streaming api with ref_cdict as opposed to using ZSTD_compress_usingCDict with a buffer. Is that expected?

terrelln · 2021-01-24T19:10:08Z

For some reason it's still slightly larger when using the streaming api with ref_cdict as opposed to using ZSTD_compress_usingCDict with a buffer. Is that expected?

You should expect small differences in compressed size between the single-buffer and streaming compression. When zstd knows the source size (like in single-buffer mode), it will optimize the compression parameters for that size. When using streaming, without setting the pledged source size, zstd uses the "generic" parameters, so it will likely compress slightly worse.

facebook-github-bot added the CLA Signed label Jan 4, 2021

Cyan4973 approved these changes Jan 4, 2021

View reviewed changes

terrelln mentioned this pull request Jan 4, 2021

Compression ratio regression in dictionary + streaming API mode (src size unknown) #2442

Closed

terrelln force-pushed the adjust-dict-2 branch from d13d1cd to 6ce1fee Compare January 4, 2021 23:13

terrelln force-pushed the adjust-dict-2 branch from 58bd626 to 58476bc Compare January 4, 2021 23:50

terrelln added 3 commits January 4, 2021 15:54

[test][regression] Add no source size with dictionary test

a98a6e2

* Add a test that runs without a pledgedSrcSize and with a dictionary. * Add github.tar data with uses the github dictionary while compressing github.tar, instead of each file individually.

Cyan4973 approved these changes Jan 4, 2021

View reviewed changes

terrelln merged commit a077a6a into facebook:dev Jan 5, 2021

felixhandte mentioned this pull request Mar 2, 2021

Release ZStandard v1.4.9 #2515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't shrink window log when streaming with a dictionary #2451

Don't shrink window log when streaming with a dictionary #2451

terrelln commented Jan 4, 2021 •

edited

Loading

terrelln commented Jan 4, 2021

Cyan4973 commented Jan 4, 2021

terrelln commented Jan 4, 2021

phiresky commented Jan 24, 2021

terrelln commented Jan 24, 2021

Don't shrink window log when streaming with a dictionary #2451

Don't shrink window log when streaming with a dictionary #2451

Conversation

terrelln commented Jan 4, 2021 • edited Loading

terrelln commented Jan 4, 2021

Cyan4973 commented Jan 4, 2021

terrelln commented Jan 4, 2021

phiresky commented Jan 24, 2021

terrelln commented Jan 24, 2021

terrelln commented Jan 4, 2021 •

edited

Loading