adding support for input encoding and output decoding #50

mattsb42-aws · 2017-10-14T03:39:14Z

adding support for input encoding and output decoding #29

bdonlan · 2017-10-18T22:44:24Z

src/aws_encryption_sdk_cli/internal/encoding.py

+
+        .. warning::
+
+            Because up to two bytes of data must be buffered to ensure correct base64 encoding


This seems a bit restrictive. Relying on close being called isn't enough?

The problem is that if close is not called, then some data (up to two bytes of raw data) might not be written through to the underlying object. Since this is not something that we can ever reasonably rely on something to clean up, like most of other affects of close methods, I wanted some way to make it as hard as possible for someone to shoot themselves in the foot with this. Because we control exactly what happens on enter and exit when this is used as a context manager, if we limit it to that use then we can guarantee that close is always called.

This limits what contexts this can be used in, however. Normally I'd expect bad things to happen if I forget to close a stream, in general.

After offline discussion with other Python devs, I'm inclined to agree. I'm replicating the warning from the write docstring into the class docstring to try and make it more visible to the user.

bdonlan · 2017-10-18T22:44:35Z

src/aws_encryption_sdk_cli/internal/encoding.py

+
+        .. note::
+
+            This does **not** close the wrapped stream.


Is this normal in python?

Normal operation is usually to not close other things that yourself unless told to. That being said, I can see cases where being able to it would be useful to be able to request that the wrapped stream be closed when the wrapping stream closes. Adding this as an optional instantiation parameter.

Note that I'm not saying that it should, I'm just not sure what is pythonic :)

bdonlan · 2017-10-18T22:46:25Z

src/aws_encryption_sdk_cli/internal/encoding.py

+            raise ValueError('I/O operation on closed file.')
+
+        # Load any stashed bytes and clear the buffer
+        _b = self.__write_buffer + b


We use _b to mean "buffer" here...

bdonlan · 2017-10-18T22:46:54Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        _b = None
+        if b is not None:
+            # Calculate number of encoded bytes that must be read to get b raw bytes.
+            _b = int((b - len(self.__read_buffer)) * 4 / 3)


but here _b it means "number of bytes". Use a more descriptive variable name, or at least be consistent with how you use single-letter variables.

Also, can this go negative if __read_buffer has sufficient data to cover the requested size?

Fair point. b on input is used for both because of convention, but we can use more descriptive values internally.

The initial value of _b can be negative, yes. However, when the modifier is applied to read in multiples of 4, it brings it back up to 0 because __read_buffer will never be longer than 4.

bdonlan · 2017-10-18T22:47:18Z

src/aws_encryption_sdk_cli/internal/encoding.py

+            _b = int((b - len(self.__read_buffer)) * 4 / 3)
+            _b += (4 - _b % 4)
+
+        _LOGGER.debug('%s bytes requested: adjusted to %s bytes', b, _b)


These log messages are excessively verbose and will clutter debug logs.

Fair enough; reducing to one ("x requested, reading y").

Generally speaking, I would not recommend leaving in debug logs at the level of individual read calls in an IO layer. The overhead could be quite high since we're calling them in a loop and don't necessarily know what the buffer size being transferred might be.

Fair point. I'm not overly concerned with keeping it in there. Pulling it out.

bdonlan · 2017-10-18T22:52:11Z

src/aws_encryption_sdk_cli/internal/io_handling.py

+    """
+    if should_base64:
+        return Base64IO(stream)
+    return ObjectProxy(stream)


Why not return stream?

It was for consistency so that it would always return a thing of the same type. However, after switching to base Base64IO on IOBase, this will be the case anyway, so switching to just returning stream if not using base64.

bdonlan · 2017-10-18T22:54:13Z

src/aws_encryption_sdk_cli/internal/io_handling.py

+    # Because we can actually know size for files and Base64IO does not support seeking,
+    # set the source length manually for files. This allows enables data key caching when
+    # Base64-decoding a source file.
+    _stream_args['source_length'] = os.path.getsize(source)


When the input is to be decoded from b64, we should be using the decoded size (i.e. size * (64/256)), not the encoded size. This provides a tighter bound on the file size.

Wouldn't that be size * (3 / 4)?

Er, yes. :)

bdonlan · 2017-10-18T22:55:10Z

test/integration/test_i_aws_encryption_sdk_cli.py

+
+    with open(str(b64_ciphertext), 'rb') as b64_ct, open(str(ciphertext), 'wb') as ct:
+        raw_ct = base64.b64decode(b64_ct.read())
+        print('raw_ct bytes:', len(raw_ct))


Remove debugging prints before submitting.

Whoops. This is exactly why #22

bdonlan · 2017-10-18T22:56:44Z

test/integration/test_i_aws_encryption_sdk_cli.py

+
+
+@pytest.mark.skipif(not _should_run_tests(), reason='Integration tests disabled. See test/integration/README.rst')
+def test_file_to_file_base64_decode_only(tmpdir):


I'd suggest making these some kind of parameterized test - covering all three interesting cases (encode, decode, encode/decode) for both encrypt and decrypt.

Good idea. Condensing and parameterizing.

bdonlan · 2017-10-18T22:59:23Z

test/unit/test_encoding.py

+        plaintext_wrapped.write(plaintext_source)
+
+    assert plaintext_stream.getvalue() == plaintext_b64
+


Test also reading/writing 1, 2, 3, and 4 bytes at a time.

Test encoding and decoding binary strings with lengths 0,1,2,3 modulo 4.

Test behavior when the input has embedded whitespace (whatever that behavior should be).

Additional test cases added. For whitespace, I think the route that GNU base64 takes is preferable, so I added the ability to optionally ignore whitespace in the wrapped stream.

mattsb42-aws · 2017-10-20T19:59:23Z

Updated with associated changes as discussed.

mattsb42-aws · 2017-11-01T18:44:00Z

update with close() simplification discussed offline and rebase

bdonlan · 2017-11-01T17:44:17Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        """
+        try:
+            return getattr(self.__wrapped, method_name)()
+        except AttributeError:


This will suppress an AttributeError thrown from the called method itself.

Good point. Moving call outside of try block.

bdonlan · 2017-11-01T17:51:54Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        :type limit: int
+        :rtype: bytes
+        """
+        return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE)


won't this limit the line length to be DEFAULT_BUFFER_SIZE when the limit is unspecified? Shouldn't it load an arbitrarily long line instead?

also couldn't this load more than one line in a a single call?

The canonical behavior is to read until a newline character is found. However, as with the encryption sdk, 1) we have no reason to believe a newline will necessarily be present in the source, and 2) a "line" is not necessarily useful on its own.

So yes, this readline does not strictly speaking return an actual "line". However, the readline method is commonly used to iterate over the contents of file-like objects and is the expected iteration step when using the file-like as an iterator.

To avoid users inadvertently reading the entire source file into memory, we use io.DEFAULT_BUFFER_SIZE as an arbitrary length for the "line" that we return.

Why -1 and not None?

Because that is the canonical default value for readline.

https://docs.python.org/2/library/io.html#io.IOBase.readline

bdonlan · 2017-11-01T17:52:22Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        self.__write_buffer = b''
+
+        # If an even base64 chunk or finalizing the stream, write through.
+        if len(_bytes_to_write) % 3 == 0 or self.__finalize:


It's simpler to remove the __finalize flag and just dump __write_buffer out in close

bdonlan · 2017-11-01T17:53:12Z

src/aws_encryption_sdk_cli/internal/io_handling.py

-    # type: (STREAM_KWARGS, SOURCE, IO) -> None
+def _encoder(stream, should_base64):
+    # type: (IO, bool) -> Union[IO, Base64IO]
+    """Wraps a stream in either a Base64IO transformer or a transparent proxy.


No transparent proxy is involved

Whoops. Missed updating that when I moved away from ProxyObject. Fixed.

bdonlan · 2017-11-01T21:06:22Z

test/unit/test_encoding.py

+
+
+@pytest.mark.parametrize('source_bytes, read_bytes', TEST_CASES)
+def test_base64io_decode(source_bytes, read_bytes):


These tests only check a single operation - they don't test that performing a repeated read or write at a strange buffer size works properly.

Expanding to cover more cases.

bdonlan · 2017-11-01T21:07:45Z

test/unit/test_encoding.py

+        with Base64IO(io.BytesIO(b64_plaintext_with_whitespace)) as decoder:
+            decoder.read(read_bytes)
+
+    excinfo.match(r'Whitespace found in base64-encoded data. Whitespace must be ignored to read this stream.')


Is this ever desired behavior?

This is a fair point. There's not really anything that the caller could do except start over with the flag, and if you're wrapping a stream, the assumption of the rest of the structure is that the entire contents is encoded, so this is really just complicating the flow.

bdonlan · 2017-11-01T21:08:59Z

test/unit/test_encoding.py

+    plaintext_source, b64_plaintext_with_whitespace = build_b64_with_whitespace(source_bytes, 1)
+
+    with Base64IO(io.BytesIO(b64_plaintext_with_whitespace), ignore_whitespace=True) as decoder:
+        test = decoder.read(read_bytes)


Tests are needed for repeated reads with whitespace enabled - particularly the case where our initial read from the stream doesn't have enough bytes to cover the requested data due to the embedded whitespace, as well as the case where the initial read contains only whitespace, or mostly whitespace with the actual data being insufficient to decode any bytes.

Adjusting generated line length to 3 to catch both "read spans lines" and "read is less than a line". Also adding two more cases with majority and entirety of initial read as whitespace.

bdonlan · 2017-11-01T21:10:46Z

test/unit/test_encoding.py

+    (1, io.DEFAULT_BUFFER_SIZE),
+    (io.DEFAULT_BUFFER_SIZE + 99, io.DEFAULT_BUFFER_SIZE * 2)
+))
+def test_base64io_decode_readlines(hint_bytes, expected_bytes_read):


Tests needed around lines longer than the hint (or the default hint)

@pytest.mark.parametrize('hint_bytes, expected_bytes_read', (
(-1, 102400),
(0, 102400),
(1, io.DEFAULT_BUFFER_SIZE),
(io.DEFAULT_BUFFER_SIZE + 99, io.DEFAULT_BUFFER_SIZE * 2)
))

(-1, 102400) < default hint
(1, io.DEFAULT_BUFFER_SIZE) < "line" longer than hint ("line" length is always io.DEFAULT_BUFFER_SIZE)

bdonlan · 2017-11-01T21:12:21Z

test/unit/test_io_handling.py

+        a=sentinel.a,
+        b=sentinel.b
+    )
+    expected_length = int(os.path.getsize(str(source)) * expected_multiplier)


Is this used anywhere?

Yes. See line 331. ;)

mattsb42-aws · 2017-11-02T01:25:02Z

updates to address comments with rebase on master

bdonlan · 2017-11-07T18:39:01Z

I'd suggest avoiding rebasing as long as github indicates it can be automatically merged - now I have to re-review all the changes as I can't see what ones are new :/

lizroth · 2017-11-10T22:06:07Z

src/aws_encryption_sdk_cli/internal/encoding.py

+
+        Because up to two bytes of data must be buffered to ensure correct base64 encoding
+        of all data written, this object **must** be closed after you are done writing to
+        avoid data loss. If used as a context manager, we take care of that for you.


Minor: Up to you, but I might clarify that closing this object does not imply/require closing the underlying stream unless close_wrapped_on_close is set.

I might also note that this is intentionally not reusable and why.

lizroth · 2017-11-10T22:17:38Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        :raises ValueError: if called on Base64IO object outside of a context manager
+        """
+        if self.closed:
+            raise ValueError('I/O operation on closed file.')


Minor: Maybe include a handle or identifier of the object?

Also minor: Isn't this carefully described as a generic stream most other places?

That's the standard error type/message for any operation on a closed file-like.

https://docs.python.org/2/library/io.html#io.IOBase.close

>>> a = io.BytesIO() >>> a.read() b'' >>> a.close() >>> a.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: I/O operation on closed file. >>> >>> b = open('tox.ini') >>> b.tell() 0 >>> b.close() >>> b.tell() Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: I/O operation on closed file.

lizroth · 2017-11-10T22:17:47Z

src/aws_encryption_sdk_cli/internal/encoding.py

+            raise ValueError('I/O operation on closed file.')
+
+        if not self.writable():
+            raise IOError('Stream is not writable')


Minor identifier thing

lizroth · 2017-11-10T22:18:30Z

src/aws_encryption_sdk_cli/internal/encoding.py

+            avoid data loss. If used as a context manager, we take care of that for you.
+
+        :param bytes b: Bytes to write to wrapped stream
+        :raises ValueError: if called on closed Base64IO object


Is the IOError intentionally omitted from the raises clause?

Nope; looks like I forgot to include it. Added. Also removed the old ValueError reference left over from when we were enforcing use inside a context manager.

lizroth · 2017-11-10T22:59:12Z

src/aws_encryption_sdk_cli/internal/encoding.py

+
+
+class Base64IO(io.IOBase):
+    """Wraps a stream, base64-decoding read results before returning them.


It also base64-encodes bytes before writing them?

Yup. Added. I must have written that really early on.

lizroth · 2017-11-10T23:27:38Z

src/aws_encryption_sdk_cli/internal/encoding.py

+
+        # Read encoded bytes from wrapped stream.
+        data = self.__wrapped.read(_bytes_to_read)
+        if any([six.b(char) in data for char in string.whitespace]):


Minor: maybe note this is clearing whitespace out of base64 streams that have been formatted e.g. with line breaks?

lizroth · 2017-11-10T23:28:33Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        :type limit: int
+        :rtype: bytes
+        """
+        return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE)


Why -1 and not None?

lizroth · 2017-11-10T23:29:09Z

src/aws_encryption_sdk_cli/internal/encoding.py

+    def readline(self, limit=-1):
+        # type: (int) -> bytes
+        """Read and return one line from the stream.
+        If limit is specified, at most limit bytes will be read.


+"Otherwise, io.DEFAULT_BUFFER_SIZE"? Or is that a well known convention, no need to redundantly document?

No, that's a good point. The normal expectation is to read until OEL, but as with the encryption SDK streaming classes, we don't have any expectation that the source actually contains an OEL character. Adding a note to that docstring.

lizroth · 2017-11-10T23:32:02Z

src/aws_encryption_sdk_cli/internal/encoding.py

+        """
+        return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE)
+
+    def readlines(self, hint=-1):


Same question about -1 vs. None?

Similar answer: it's the canonical default value:

https://docs.python.org/2/library/io.html#io.IOBase.readlines

lizroth · 2017-11-10T23:54:06Z

test/unit/test_encoding.py

@@ -0,0 +1,339 @@
+# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.


Thoughts on testing reuse of the base64 encoder context manager?

Good idea. Adding.

mattsb42-aws · 2017-11-11T01:33:30Z

Updated with changes to address comments. Because of how GitHub handles rebasing in PRs, I'm going to hold off on a rebase until this PR is approved to ease load on reviewers.

…ngth to encrypt operation - this is required for caching to work when reading from Base64IO

…d removing _b whitelist from pylintrc

…nly reading from buffer

…o close wrapped stream on close

…e and fleshing it out to cover the full IOBase API https://docs.python.org/2/library/io.html#io.IOBase

…Base64IO

…ct behavior if ignoring whitespace or not

…input and NOT encoding output

…lose rather than using __finalize

…d catching errors we want to pass through

…ontext manager exit

mattsb42-aws requested a review from a team October 14, 2017 03:39

bdonlan suggested changes Oct 18, 2017

View reviewed changes

mattsb42-aws requested a review from a team October 24, 2017 18:00

bdonlan suggested changes Nov 1, 2017

View reviewed changes

lizroth reviewed Nov 10, 2017

View reviewed changes

lizroth approved these changes Nov 13, 2017

View reviewed changes

mattsb42-aws added 19 commits November 13, 2017 10:40

adding Base64IO context manager stream transformer

c33654a

adding input decoding and output decoding arguments and handling

e892919

disabling pylint duplicate-code message due to volume of false positives

45471e7

adding integration tests for source decoding and destination encoding

64e795c

updating docs to talk about --decode/--encode

77771b1

disabling seek on Base64IO - it creates too many hairy problems

f4a5baa

adding filesize read when encrypting files to pass expected source le…

a598937

…ngth to encrypt operation - this is required for caching to work when reading from Base64IO

renaming _b variable in Base64IO read/write to be more descriptive an…

06c0399

…d removing _b whitelist from pylintrc

removing errant print statement

6efc171

adding unit test to verify that Base64IO read behaves properly when o…

ea9b841

…nly reading from buffer

removing overly chatty debug log records

f13ef80

properly handling closing wrapped stream and adding creation option t…

e68820c

…o close wrapped stream on close

adding Base64IO.writelines

1c78874

converting Base64IO from subclassing ObjectProxy to subclassing IOBas…

8df803e

…e and fleshing it out to cover the full IOBase API https://docs.python.org/2/library/io.html#io.IOBase

updating io_handling._encoder to return stream directly if not using …

cedab23

…Base64IO

condensing base64 integration tests into single parameterized test

91be46d

updating Base64IO readline and readlines interfaces to match IOBase

92a234f

mypy, linting, and docs tweaks

60738dc

adding additional test cases for Base64IO read and write

7057720

mattsb42-aws added 17 commits November 13, 2017 10:42

adding Base64IO option to ignore whitespace and tests to verify corre…

4fe9937

…ct behavior if ignoring whitespace or not

mypy and docs tweaks

41c079b

relaxing pylint max-attributes to 10

0eda7de

adding tighter bounds for providing file source length when decoding …

c677ea2

…input and NOT encoding output

removing potentially noisy debug log record in Base64IO

1fb5453

enabling Base64IO.write outside of a context manager

9866901

simplifying Base64IO to flush write buffer to __wrapped directly on c…

0a6f947

…lose rather than using __finalize

fixing outdated docstring on io_handling._encoder

36d6616

moving call to wrapped interactive check outside of try block to avoi…

a7b6acd

…d catching errors we want to pass through

removing ability to turn off whitespace ignoring

d7e1ceb

refining whitespace unit tests

7b985cd

refining Base64IO unit test case generation

f2eab56

updating linting rules: large numbers of locals in tests are ok

a64f7f0

refining documentation for Base64IO

ee87928

adding proper b < 0 support for Base64IO.read()

2119369

adding tests to verify that Base64IO is closed and unusable after a c…

97796c1

…ontext manager exit

flake8 formatting fix for separating __future__ imports

b379b1a

mattsb42-aws merged commit a448bfc into aws:master Nov 13, 2017

mattsb42-aws deleted the dev-29 branch November 13, 2017 18:51


		.. warning::

		Because up to two bytes of data must be buffered to ensure correct base64 encoding



		@pytest.mark.skipif(not _should_run_tests(), reason='Integration tests disabled. See test/integration/README.rst')
		def test_file_to_file_base64_decode_only(tmpdir):

		plaintext_wrapped.write(plaintext_source)

		assert plaintext_stream.getvalue() == plaintext_b64



		@pytest.mark.parametrize('source_bytes, read_bytes', TEST_CASES)
		def test_base64io_decode(source_bytes, read_bytes):



		class Base64IO(io.IOBase):
		"""Wraps a stream, base64-decoding read results before returning them.

		@@ -0,0 +1,339 @@
		# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.

adding support for input encoding and output decoding #50

adding support for input encoding and output decoding #50

Conversation

mattsb42-aws commented Oct 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattsb42-aws commented Oct 20, 2017

mattsb42-aws commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattsb42-aws Nov 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattsb42-aws commented Nov 2, 2017

bdonlan commented Nov 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattsb42-aws commented Nov 11, 2017

mattsb42-aws Nov 1, 2017 •

edited

Loading