Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't shrink window log when streaming with a dictionary #2451

Merged
merged 3 commits into from
Jan 5, 2021

Conversation

terrelln
Copy link
Contributor

@terrelln terrelln commented Jan 4, 2021

Fixes #2442.

  1. When creating a dictionary keep the same behavior as before.
    Assume the source size is 513 bytes when adjusting parameters.
  2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() use
    the same logic as case 4 (not attaching a dictionary).
  3. When attaching a dictionary keep the same behavior of ignoring
    the dictionary size. When streaming this will select the
    largest parameters and not adjust them down. But, the CDict
    will use the correctly sized parameters, which seems like the
    right tradeoff.
  4. When not attaching a dictionary (either forced not to, or
    using a prefix dictionary) we select parameters based on the
    dictionary size + source size, and assume the source size is
    small, which is the same behavior as before. But, now we don't
    adjust the window log (and hash and chain log) down when the
    source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or forceWindow is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

I've added a test case that catches this bug. It compresses using
a dictionary, without setting the pledged src size, and with a large
source. See the changes to results.csv in the
"Don't shrink window log when streaming with a dictionary"
commit.

I've also added a test to fuzzer.c to check that ZSTD_getCParams()
and ZSTD_adjustCParams() don't shrink the window log down.

@terrelln
Copy link
Contributor Author

terrelln commented Jan 4, 2021

@Cyan4973 when a user calls ZSTD_getCParams(1, ZSTD_CONTENTSIZE_UNKNOWN, 1100) do you think that we should keep the legacy behavior of shrinking the windowLog down to 2KB, or change the behavior?

So far I've left it alone. But I'm thinking we should change ZSTD_getCParams() (case 2) to match the ZSTD_cpm_noAttachDict (case 4).

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 4, 2021

I agree with the consistency argument.
So it means that 2) should behave like the new 4).

@terrelln
Copy link
Contributor Author

terrelln commented Jan 4, 2021

I've made that change and added a test for ZSTD_getCParams() and ZSTD_adjustCParams().

* Add a test that runs without a pledgedSrcSize and with a dictionary.
* Add github.tar data with uses the github dictionary while compressing
  github.tar, instead of each file individually.
Fixes facebook#2442.

1. When creating a dictionary keep the same behavior as before.
   Assume the source size is 513 bytes when adjusting parameters.
2. When calling ZSTD_getCParams() or ZSTD_adjustCParams() keep
   the same behavior as before.
3. When attaching a dictionary keep the same behavior of ignoring
   the dictionary size. When streaming this will select the
   largest parameters and not adjust them down. But, the CDict
   will use the correctly sized parameters, which seems like the
   right tradeoff.
4. When not attaching a dictionary (either forced not to, or
   using a prefix dictionary) we select parameters based on the
   dictionary size + source size, and assume the source size is
   small, which is the same behavior as before. But, now we don't
   adjust the window log (and hash and chain log) down when the
   source size is unknown.

When the source size is unknown all cdicts should attach, except
when the user disables attaching, or `forceWindow` is used. This
means that when streaming with a CDict we end up in the good case
where we get small CDict parameters, and large source parameters.

TODO: Add a streaming + dictionary regression test case.
Treat ZSTD_getCParams() and ZSTD_adjustCParams() in the same way
we treat streaming compression. Choose parameters based on the
dictionary size + source size, and assume the source size is small
if unkown. But, don't shrink the window log down in
ZSTD_adjustCParams_internal().
@terrelln terrelln merged commit a077a6a into facebook:dev Jan 5, 2021
@phiresky
Copy link

Thanks, with these changes it seems to be mostly ok:

input 159219 bytes, without dict 29518 bytes, streaming api with dict 24177 bytes, simple api with dict 23866 bytes.

For some reason it's still slightly larger when using the streaming api with ref_cdict as opposed to using ZSTD_compress_usingCDict with a buffer. Is that expected?

@terrelln
Copy link
Contributor Author

For some reason it's still slightly larger when using the streaming api with ref_cdict as opposed to using ZSTD_compress_usingCDict with a buffer. Is that expected?

You should expect small differences in compressed size between the single-buffer and streaming compression. When zstd knows the source size (like in single-buffer mode), it will optimize the compression parameters for that size. When using streaming, without setting the pledged source size, zstd uses the "generic" parameters, so it will likely compress slightly worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compression ratio regression in dictionary + streaming API mode (src size unknown)
4 participants