Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the check on dictionary capacity #52

Closed
wants to merge 1 commit into from
Closed

Remove the check on dictionary capacity #52

wants to merge 1 commit into from

Conversation

bodgit
Copy link

@bodgit bodgit commented Dec 11, 2022

I'm using your xz package in my 7-zip reader package to decompress LZMA and LZMA2 streams. I've had some reports of some archives being unreadable with a lzma: dictionary capacity too small error.

I thought I could override the default ReaderConfig and set a bigger dictionary capacity however it seems the check that triggers this error fires before the capacity is overridden by the ReaderConfig.DictCap value. The default DictCap value is much bigger than MinDictCap anyway, (8 MB vs 4 KB IIRC).

Removing this check allows the ReaderConfig.DictCap value to override whatever was in the stream header and it's guaranteed to be at least MinDictCap thanks to ReaderConfig.Verify().

The unreadable LZMA streams were then readable.

The `ReaderConfig.DictCap` value is already range checked and guaranteed
to be at least `MinDictCap` so the archive will always have a dictionary
capacity of `MinDictCap` if a smaller value was in the header.
@ulikunitz
Copy link
Owner

Thank you for bringing that to my attention. Returning an error is indeed wrong, because the LZMA specification clearly states that a value smaller than 4096 byte should be set to 4096 byte. Since I want to do what the spec says I will not apply your pull request, but make sure that your approach of overriding the dictionary size works. I will also add a test case. You can expect a v0.5.11 in the next week. (I will also change the logic in my newer rewrite branch.)

I have one question though, would the 4096 dictionary size value work with your LZMA streams or do you require higher dictionary sizes? I would like to understand the real life situation.

@bodgit
Copy link
Author

bodgit commented Dec 12, 2022

I have one question though, would the 4096 dictionary size value work with your LZMA streams or do you require higher dictionary sizes? I would like to understand the real life situation.

I'll try and find out what the value is actually set to. It's either going to be somewhere between 0 and 4096 or maybe its just 0. The archives in question are apparently quite old so it might just be an old/buggy 7-zip version that didn't encode the stream header correctly. I've got a test case made with a current version of 7-zip that uses LZMA which contains two streams, one for the archive metadata and another for the file contents and the dictionary is set to 4096 and 49152 respectively.

I'm not really familiar with the LZMA algorithm so I don't know what the behaviour is if the size is set too small. If the streams are setting a value that is non-zero, but less than 4096, then I suspect rounding up to 4096 would work fine. If it's unset, i.e. 0 because it wasn't encoded properly, then the library default of 8 MB seems to work fine but I'm not sure how to find an optimal value automatically.

@ulikunitz
Copy link
Owner

The LZ method supports copies of segments of the file preceding the current position of the file to be decompressed. The dictionary size describes the size of the sliding window from which the segments can be copied. If the window is larger than the dictionary size required, no issue will occur, if it is too small, the file will not be successful decoded. The window doesn't need to be larger than the stream to be decompressed. The archive metadata may be shorter than 4096 byte and so the data size might be used as window size, which would then be smaller than 4096 byte.

I suggest that you simply use the new implementation. If it's works without overriding "dictCap" than there is no issue, if you still need to override the value, your example stream is non-compliant with the LZMA specification.

ulikunitz added a commit that referenced this pull request Dec 12, 2022
As Matt Dainty (@bodgit) reported there is an issue if the header of the
LZMA stream is less than the minimum dictionary size of 4096 byte. The
specification of the LZMA format says that in that case a dictionary
size of 4096 byte should be used, our code returns an error.

This commit changes the behavior and adds a simple test case to test for
the right behavior.

Fixes [#52](#52)
@ulikunitz
Copy link
Owner

I released v0.5.11. The version should fix your issue.

@ulikunitz
Copy link
Owner

Hi, I would be interested to know whether the release works for you and whether it works without setting DictCap in the config parameter.

@ulikunitz
Copy link
Owner

Since it has been confirmed that 0.5.11 solves the issue, I close this pull request.

@ulikunitz ulikunitz closed this Dec 18, 2022
@bodgit
Copy link
Author

bodgit commented Dec 18, 2022

It seems the streams with a dictionary capacity less than 4096 had the capacity set to 1024 or 2048, so ensuring it's at least 4096 seems to work fine.

Thanks for the bug fix!

@bodgit bodgit deleted the dictcap branch December 18, 2022 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants