-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding shrinking flag for cover and fastcover #1656
Conversation
I haven't reviewed yet, but can you please add a test case in |
Also, lets add a test case in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good, just some nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits, but 2 changes:
- I think there is a bug when no smaller dictionary is found.
- Please switch to
--train-cover=shrink=1
instead of--shrink-dict
. See howsplit
is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update programs/zstd.1.md
, which is the man page for zstd, to say what shrink
does. Search for --train-cover
and --train-fastcover
to see where to update.
lib/dictBuilder/cover.c
Outdated
|
||
BYTE *const largestDictbuffer = (BYTE * const)malloc(dictContentSize); | ||
BYTE *const candidateDictBuffer = (BYTE * const)malloc(dictContentSize); | ||
double regressionTolerance = params.shrinkDict + 1.00; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked taking the parameter as an unsigned
, but we should distinguish:
- Auto shrinking enabled vs disabled. If you need fast dictionary training, you don't want auto shrinking.
It can be disabled by default. - The regression we allow. A regression of 0 could make sense if you want to take a smaller dictionary only if it is better.
I think we could take two flags shrinkDict
and shrinkDictMaxRegression
, both unsigned
. It was a mistake to take split
as a double, so I don't want to repeat it here.
On the library side, we can default to disabled, and the regression defaults to 0.
On the CLI, we can default to disabled, but we could say if someone sets --train-fastcover=shrink
we enable with default to a regression of 1, but --train-fastcover=shrink=0
or --train-fastcover=shrink=2
should both still work.
lib/dictBuilder/cover.c
Outdated
dict, dictBufferCapacity); | ||
dictBufferCapacity = selection.dictSize; | ||
totalCompressedSize = selection.totalCompressedSize; | ||
memcpy(dict, selection.dictContent, dictBufferCapacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need dict
. If we move the declaration of selection
to the top scope, we can refactor COVER_best_finish()
to take a COVER_dictSelection
instead of the three parameters (dict, totalCompressedSize, dictBufferCapacity).
COVER_dictSelection_t selection = COVER_dictSelectionError(ERROR(GENERIC));
/* ... */
{
const size_t tail = COVER_buildDictionary(/* ... */);
selection = COVER_selectDict(/* ... */);
}
_cleanup:
COVER_best_finish(data->best, parameters, selection); /* Copies the dict out of selection */
COVER_dictSelectionFree(selection);
``
programs/zstdcli.c
Outdated
@@ -311,10 +311,15 @@ static unsigned parseCoverParameters(const char* stringPtr, ZDICT_cover_params_t | |||
params->splitPoint = (double)splitPercentage / 100.0; | |||
if (stringPtr[0]==',') { stringPtr++; continue; } else break; | |||
} | |||
if (longCommandWArg(&stringPtr, "shrink=")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets allow shrink
to set a default regression amount of 1
.
if (longCommandWArg(&stringPtr, "shrink")) {
unsigned regression = kDefaultRegression;
if (stringPtr[0] == '=') {
stringPtr++;
regression = readU32FromChar(&stringPtr);
}
/* ... */
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the CI is broken.
lib/dictBuilder/zdict.h
Outdated
@@ -94,7 +94,8 @@ typedef struct { | |||
unsigned steps; /* Number of steps : Only used for optimization : 0 means default (40) : Higher means more parameters checked */ | |||
unsigned nbThreads; /* Number of threads : constraint: 0 < nbThreads : 1 means single-threaded : Only used for optimization : Ignored if ZSTD_MULTITHREAD is not defined */ | |||
double splitPoint; /* Percentage of samples used for training: Only used for optimization : the first nbSamples * splitPoint samples will be used to training, the last nbSamples * (1 - splitPoint) samples will be used for testing, 0 means default (1.0), 1.0 when all samples are used for both training and testing */ | |||
double shrinkDict; /* Shrink dictionaries to select the smallest dictionary within the % specified by the auto-shrink-dict flag: 0 disables shrinking */ | |||
unsigned shrinkDict; /* Train dictionaries to shrink in size starting from the minimum size and selects the smallest dictionary within the max regression. 0 means no shrinking and 1 means shrinking */ | |||
unsigned shrinkDictMaxRegression; /* Sets the value that determines how much worse a dictionary can be than the dictionary of max dictionary size */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what this value means? E.g. A smaller dict is allowed to be shrinkDictMaxRegression
worse than the full sized dictionary.
programs/zstdcli.c
Outdated
@@ -179,8 +179,8 @@ static int usage_advanced(const char* programName) | |||
DISPLAY( "\n"); | |||
DISPLAY( "Dictionary builder : \n"); | |||
DISPLAY( "--train ## : create a dictionary from a training set of files \n"); | |||
DISPLAY( "--train-cover[=k=#,d=#,steps=#,split=#,shrink=#] : use the cover algorithm with optional args\n"); | |||
DISPLAY( "--train-fastcover[=k=#,d=#,f=#,steps=#,split=#,accel=#,shrink=#] : use the fast cover algorithm with optional args\n"); | |||
DISPLAY( "--train-cover[=k=#,d=#,steps=#,split=#,shrink] : use the cover algorithm with optional args\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shrink[=#]
brackets means optional arguments
programs/zstdcli.c
Outdated
@@ -377,7 +394,8 @@ static ZDICT_cover_params_t defaultCoverParams(void) | |||
params.d = 8; | |||
params.steps = 4; | |||
params.splitPoint = 1.0; | |||
params.shrinkDict = 0.0; | |||
params.shrinkDict = 0; | |||
params.shrinkDictMaxRegression = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kDefaultRegression
programs/zstdcli.c
Outdated
@@ -390,7 +408,8 @@ static ZDICT_fastCover_params_t defaultFastCoverParams(void) | |||
params.steps = 4; | |||
params.splitPoint = 0.75; /* different from default splitPoint of cover */ | |||
params.accel = DEFAULT_ACCEL; | |||
params.shrinkDict = 0.0; | |||
params.shrinkDict = 0; | |||
params.shrinkDictMaxRegression = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kDefaultRegression
free(data); | ||
COVER_map_destroy(&activeDmers); | ||
if (dict) { | ||
free(dict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're leaking dict
.
free(data); | ||
free(segmentFreqs); | ||
free(dict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're leaking dict
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A one line fix and this looks good to go!
lib/dictBuilder/cover.c
Outdated
@@ -929,6 +934,111 @@ void COVER_best_finish(COVER_best_t *best, size_t compressedSize, | |||
} | |||
} | |||
|
|||
COVER_dictSelection_t COVER_dictSelectionError(size_t error) { | |||
COVER_dictSelection_t selection = { NULL, error, error }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to set dictSize == 0
here.
In COVER_best_finish()
when totalCompressedSize
is an error, but still < -1
we will allocate space for the dictionary. We will try to allocate ~2^64 bytes, which will probably fail, but still we don't want to do that.
The dictionary size should always be the right number here, which is 0
.
Created a flag that allows the dictionary builder to select the smallest dictionary that is within a 1.01 regression of the largest dictionary.