Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory monotonicity test over srcSize #2538

Merged
merged 2 commits into from
Mar 22, 2021

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Mar 16, 2021

@terrelln and I were discussing about monotonicity of memory usage as source size increases, as well as compression level increases. This PR adds a test for the former case: mem usage monotonically increases w/ srcSize.

It actually appears that we don't currently do this. Between the 16 KB and 128 KB parameters for level 1, if the source file is > 16 KB and <= 128 KB, we actually use less memory compared to source files <= 16 KB. The parameters in zstd_compress.c confirm this wLog: 14 cLog: 14 hLog: 15 -> wLog: 17 cLog: 12 hLog:13 is a decrease in memory usage.

Is this something we would want to fix? Do we also want to maintain memory usage monotonicity as compression level increases?

@terrelln
Copy link
Contributor

Is that the only instance of non-monotonicity for source size?

@terrelln
Copy link
Contributor

terrelln commented Mar 16, 2021

I think we need monotonicity in source size, at the very least least when reporting ZSTD_estimateCCtxSize() et al. And for simplicity, it seems like we should guarantee it in the parameters as well.

Right now if you use ZSTD_estimateCCtxSize(1), you could not compress a source of <= 16 KB with ZSTD_compressCCtx() (unless I'm missing something).

The only reason BtrFS (which pre-allocates workspaces) isn't bitten by this is because they use streaming. When the smaller parameters are selected, the window is also shrunk enough to make up for the lack of table space.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 16, 2021

Mem usage also decreases for level 4 as we cross the 128 KB parameter boundary. 1559264 bytes -> 1297120 bytes reported by estimateCCtxSize_usingCParams(). These appear to be the only two.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 16, 2021

Right now if you use ZSTD_estimateCCtxSize(1), you could not compress a source of <= 16 KB with ZSTD_compressCCtx() (unless I'm missing something).

I think this actually might be fine? ZSTD_estimateCCtxSize() assumes ZSTD_CONTENTSIZE_UNKNOWN and therefore should select the > 256K parameter set. But definitely for the specific case tested in the unit test, generating CParams (level 1, >16K, <=128K src), then creating a static CCtx based on a size estimate for that and trying to compress <= 16K, could go badly.

@terrelln
Copy link
Contributor

I think this actually might be fine? ZSTD_estimateCCtxSize() assumes ZSTD_CONTENTSIZE_UNKNOWN and therefore should select the > 256K parameter set. But definitely for the specific case tested in the unit test, generating CParams, then creating a static CCtx based on a size estimate for that, could go badly.

{ /* "default" - for any srcSize > 256 KB */
/* W, C, H, S, L, TL, strat */
{ 19, 12, 13, 1, 6, 1, ZSTD_fast }, /* base for negative levels */
{ 19, 13, 14, 1, 7, 0, ZSTD_fast }, /* level 1 */

{ /* for srcSize <= 16 KB */
/* W, C, H, S, L, T, strat */
{ 14, 12, 13, 1, 5, 1, ZSTD_fast }, /* base for negative levels */
{ 14, 14, 15, 1, 5, 0, ZSTD_fast }, /* level 1 */

When using ZSTD_estimateCCtxSize(1) the window size isn't counted (it is with ZSTD_estimateCStreamSize()). So the <= 16 KB params use more memory than the > 256 KB params for single-pass compression.

@terrelln
Copy link
Contributor

terrelln commented Mar 17, 2021

I think monotonicity in the table wrt source size makes sense. Here is the scenario I'm imagining:

auto cparams = ZSTD_getCParams(1, 0, 0);
auto size = ZSTD_estimateCCtxSize_usingCParams(cparams);
void* wksp = malloc(size);

...

auto params = ZSTD_getParams(1, srcSize, 0);
auto cctx = ZSTD_initStaticCCtx();
ZSTD_compress_advanced(cctx, dst, dstCapacity, src, srcSize, params);

This shows up in the Linux kernel in crypto/zstd.c. It can't "tune" the parameters for the source size because of this limitation.

https://github.com/torvalds/linux/blob/1df27313f50a57497c1faeb6a6ae4ca939c85a7d/crypto/zstd.c#L35-L36
https://github.com/torvalds/linux/blob/1df27313f50a57497c1faeb6a6ae4ca939c85a7d/crypto/zstd.c#L155

You might say, just use a compression level instead. Currently the kernel is limited to using cparams because of its API (though we could try to update it after the version update). And, I would intuitively expect that for all level and srcSize memory(ZSTD_getCParams(level, 0, 0)) >= memory(ZSTD_getCParams(level, srcSize, 0)).

@terrelln
Copy link
Contributor

So the reason why level 1 with 16KB source is better with a larger hash table seems to be branch misses. With 32K entries in the hash table, collisions are rare. So you generally either end up with a match, or end up with < lowLimit.

With 8K entries compression has about 5.5% missed branches. With 32K entries it has about 4.0% missed branches. With 64K entries it has 3.6% missed branches and is slightly faster than hashLog=15.

@terrelln
Copy link
Contributor

@senhuang42 do you have the patch for estimateCCtxSize() ready?

@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 19, 2021

@senhuang42 do you have the patch for estimateCCtxSize() ready?

Thanks for the reminder - I actually was looking at this again, and I actually wasn't able to repro any issue with the existing estimateCCtxSize(1) - looking at the code, doesn't it take windowLog into account here or am I missing something?

size_t const windowSize = MAX(1, (size_t)MIN(((U64)1 << cParams->windowLog), pledgedSrcSize));
size_t const blockSize = MIN(ZSTD_BLOCKSIZE_MAX, windowSize);
U32 const divider = (cParams->minMatch==3) ? 3 : 4;
size_t const maxNbSeq = blockSize / divider;
size_t const tokenSpace = ZSTD_cwksp_alloc_size(WILDCOPY_OVERLENGTH + blockSize)
+ ZSTD_cwksp_alloc_size(maxNbSeq * sizeof(seqDef))
+ 3 * ZSTD_cwksp_alloc_size(maxNbSeq * sizeof(BYTE));

If windowLog < 17, we effectively would reduce tokenSpace, so in this case the 16 K params would use less mem than the >256K params (though the matchstate would still be larger) since the the increase in matchstate size from the hash and chainlog are offset by the smaller tokenSpace from small enough windowLog.

Currently, adjusting ZSTD_estimateCCtxSize_internal() so it assumes 16 KB, I get 211680 bytes, but for ZSTD_CONTENTSIZE_UNKNOWN: 576224 bytes. So it seems my patch doesn't actually do anything currently, but makes the size estimation a bit safer in case the default params got adjusted to a point where this might be an issue?

@senhuang42 senhuang42 force-pushed the monotonicity_test branch 3 times, most recently from 5c2de40 to dff4a0e Compare March 21, 2021 23:11
@terrelln
Copy link
Contributor

Didn't 128KB level 4 use more memory than 256KB level 4? In which case the looping is necessary because tokenSpace would be the same.

@senhuang42
Copy link
Contributor Author

Didn't 128KB level 4 use more memory than 256KB level 4? In which case the looping is necessary because tokenSpace would be the same.

128K L4 used more than 256K L4, but not more than >256K L4, so estimateCCtxSize(4) would still be fine in that case, I think.

But I think the looping would still be nice to have (row hash might introduce oddities?)

@senhuang42 senhuang42 merged commit c48889f into facebook:dev Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants