Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace libsnappy with cramjam #130

Merged
merged 2 commits into from
Feb 7, 2024
Merged

Replace libsnappy with cramjam #130

merged 2 commits into from
Feb 7, 2024

Conversation

gliptak
Copy link
Collaborator

@gliptak gliptak commented Feb 5, 2024

the semantics need to be lined up

#129 #124

@martindurant
Copy link
Member

cc @milesgranger , a long time coming!



class UncompressError(Exception):
pass


py3k = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py2 stuff is for pypy only - maybe that should be dropped too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a followup PR for current Python versions (and dropping Python2). I actually started with #129

@martindurant
Copy link
Member

Looks like the pypy tests need an explicit cast from cramjam buffer to bytes (i.e., bytes(...)).

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 5, 2024

@martindurant I had that earlier and it broke in a different way. let me restore

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 5, 2024

https://github.com/andrix/python-snappy/actions/runs/7787155361/job/21233634917?pr=130

FAILED test_snappy.py::SnappyCompressionTest::test_uncompress_error - cramjam.DecompressionError: failed to fill whole buffer
FAILED test_snappy.py::SnappyValidBufferTest::test_invalid_compressed_buffer - cramjam.DecompressionError: snappy: corrupt input (expected stream header but got unexpected chunk type byte 110)
FAILED test_snappy.py::SnappyStreaming::test_compression - AssertionError: 24 != 6
FAILED test_snappy.py::SnappyStreaming::test_decompression - cramjam.DecompressionError: failed to fill whole buffer

https://github.com/andrix/python-snappy/actions/runs/7787155357/job/21233634312?pr=130

packages/snappy/snappy.py:47: in <module>
    import cramjam
E   ModuleNotFoundError: No module named 'cramjam'

@martindurant
Copy link
Member

For the sdist (and actually any real install), will need to actually add cramjam to required deps in https://github.com/andrix/python-snappy/blob/master/setup.py#L66

@martindurant
Copy link
Member

For pypy, we might need @milesgranger 's help, perhaps the errors ring a bell. Is it possible we are confusing framed Vs raw?

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 5, 2024

https://github.com/andrix/python-snappy/actions/runs/7787502576/job/21234796426?pr=130

    def test_compress_memoryview(self):
        data = b"hello world!"
        expected = snappy.compress(data)
        actual = snappy.compress(memoryview(data))
>       self.assertEqual(actual, expected)
E       AssertionError: cramjam.Buffer<len=30> != cramjam.Buffer<len=30>

@milesgranger
Copy link

Ah, that's a bit unfortunate w/ the buffer comparisons not working. Can just cast by bytes(...) == bytes(...) if you want in the meantime.

I'll try to look more deeply at the other errors. But the De/Compressor classes work on framed snappy format FWIW.

@milesgranger
Copy link

_________________ SnappyCompressionTest.test_uncompress_error __________________

self = <test_snappy.SnappyCompressionTest testMethod=test_uncompress_error>

    def test_uncompress_error(self):
>       self.assertRaises(snappy.UncompressError, snappy.uncompress,
                          "hoa".encode('utf-8'))

test_snappy.py:70: 

This is just a matter of catching cramjam's error and re-raising as python-snappy's error, no?

    def test_compression(self):
        # test that we can add compressed chunks
        compressor = snappy.StreamCompressor()
        data = b"\0" * 50
        compressed_data = snappy.compress(data)
        crc = struct.pack("<L", snappy.snappy._masked_crc32c(data))
        self.assertEqual(crc, b"\x8f)H\xbd")
>       self.assertEqual(len(compressed_data), 6)
E       AssertionError: 24 != 6

This is compressing with framed and comparing against what snappy raw would output (checked, this will be 6 w/ raw format)

And ya, I think the other two are raw vs framed comparison errors.

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 5, 2024

thank you @milesgranger I added bytes cast and exception wrapping

would you have guidance on correcting "framed vs. raw" errors?

@milesgranger
Copy link

would you have guidance on correcting "framed vs. raw" errors?

You bet! python-snappy's compress function is the snappy raw format. In the specification / Rust snappy it's preferred to use the framed format, therefore cramjam's snappy.compress is the framed format.

That ends up looking like this:

In [1]: import cramjam

In [2]: import snappy

In [3]: data = b"\0" * 50

In [4]: snappy.compress(data)
Out[4]: b'2\x00\x00\xc2\x01\x00'

In [5]: bytes(cramjam.snappy.compress(data))
Out[5]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [6]: bytes(cramjam.snappy.compress_raw(data))
Out[6]: b'2\x00\x00\xc2\x01\x00'

So it appears, if python-snappy is to keep it's current API, then behind the covers, snappy.compress ought to be calling cramjam.snappy.compress_raw.

For the De/Compressor classes, I'm less sure how to maintain the current API. I think I took inspiration from lz4 or something, can't really remember. But effectively cramjam's all have a .compress method which adds chunks to the inner compressed stream. Then .finish() to add any header/tail information and return the completed compressed stream. This seems different for python-snappy's impl. Which Compressor.compress returns the head, then subsequent calls add leave out the header:

In [1]: import snappy

In [2]: import cramjam

In [3]: data = b"\0" * 50

In [4]: compressor = snappy.StreamCompressor()

In [5]: compressor.compress(data)
Out[5]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [6]: compressor.compress(data)
Out[6]: b'\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [7]: c = cramjam.snappy.Compressor()

In [8]: c.compress(data)
Out[8]: 50

In [9]: c.compress(data)
Out[9]: 50

In [10]: bytes(c.finish())
Out[10]: b'\xff\x06\x00\x00sNaPpY\x00\r\x00\x00o\xfan\xe2d\x00\x00\xfe\x01\x00\x8a\x01\x00'

In [11]: python_snappy = b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00' + b'\x00\n\x00
    ...: \x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [12]: cramjam.snappy.decompress(python_snappy)
Out[12]: cramjam.Buffer<len=100>

You can see cramjam's is different, but still able to decompress python-snappy's so it seems okay just adding more padding it seems around each chunk and then it's more on the user to concatenate the total stream. Whereas with cramjam it's a bit more "managed". Also I note that add_chunk and compress returns the same data after the first call to compress fyi.

So, that's my understanding. If you wanted to maintain the current API that might be as easy (hacky?) as having the first call to Compressor.compress return exactly what cramjam.snappy.compress gives, then all subsequent calls chop off the header (10 bytes), that seems to produce exactly the same as python-snappy's compressor:

In [1]: import cramjam

In [2]: import snappy

In [3]: data = b"\0" * 50

In [4]: compressor = snappy.StreamCompressor()

In [5]: compressor.compress(data)
Out[5]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [6]: compressor.compress(data)
Out[6]: b'\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [7]: bytes(cramjam.snappy.compress(data))
Out[7]: b'\xff\x06\x00\x00sNaPpY\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

In [8]: bytes(cramjam.snappy.compress(data))[10:]
Out[8]: b'\x00\n\x00\x00\x8f)H\xbd2\x00\x00\xc2\x01\x00'

As a separate note, the first suggestion to cast cramjam.Buffer using bytes works for the test, but maybe it's worthwhile to basically do that everywhere python-snappy returns data? That way current users upgrading won't suddenly see "cramjam.Buffer" showing up when python-snappy always use to return bytes.

Signed-off-by: Gábor Lipták <gliptak@gmail.com>
@gliptak
Copy link
Collaborator Author

gliptak commented Feb 6, 2024

seems to be new python2.7 related error https://github.com/andrix/python-snappy/actions/runs/7802511369/job/21280133020?pr=130

  DEPRECATION: pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
  ERROR: Could not find a version that satisfies the requirement maturin<0.14,>=0.13 (from versions: 0.7.1, 0.7.2, 0.7.6, 0.11.0, 0.11.2, 0.11.4)
  ERROR: No matching distribution found for maturin<0.14,>=0.13

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 6, 2024

thank you @milesgranger I updated with *_raw and bytes() and build is green with no tests updated

@martindurant do you see current tests providing strong coverage for Stream*?

PS also dropped pypy2.7 for above unrelated/environment error

@gliptak gliptak requested a review from martindurant February 6, 2024 16:22
@martindurant
Copy link
Member

do you see current tests providing strong coverage for Stream*?

well apparently not! I am not surprised, and I am prepared to provide only "best effort" here; it would be best if docstrings are updated to say so, however. Were you thinking to drop them altogether?

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 6, 2024

I wasn't thinking of removing functionality. https://github.com/andrix/python-snappy/blob/master/test_snappy.py#L119-L259 seems to be exercising stream functionality to good extent with asserts. one difference I see that tests use add_chunk (which seems to deal with CRC and headers?)

please let me know how you would like to proceed @martindurant (and provide "best effort" docstrings if appropriate)

@martindurant
Copy link
Member

StreamCompressor.compress is an alias for .add_chunk , so that's fine.

@martindurant
Copy link
Member

I believe StreamDe/Compressor could be replaced by cramjam.snappy.De/Compressor. The current code is doing the same framing as before, but calling cramjam's _raw codecs internally, so the exact locations of the chunk boundaries might change, but it produces valid output.

In other words, since there are after all some tests, I think all is well!

Any remaining reservations preventing merge?

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 6, 2024

thank you @martindurant no reservations from my side

@gliptak
Copy link
Collaborator Author

gliptak commented Feb 6, 2024

#111

@martindurant martindurant merged commit 73c7f55 into intake:master Feb 7, 2024
9 checks passed
@gliptak gliptak deleted the cram1 branch February 7, 2024 15:10
@gliptak
Copy link
Collaborator Author

gliptak commented Feb 20, 2024

#123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants