Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for input encoding and output decoding #50

Merged
merged 36 commits into from
Nov 13, 2017
Merged

adding support for input encoding and output decoding #50

merged 36 commits into from
Nov 13, 2017

Conversation

mattsb42-aws
Copy link
Member

adding support for input encoding and output decoding #29

@mattsb42-aws mattsb42-aws requested a review from a team October 14, 2017 03:39

.. warning::

Because up to two bytes of data must be buffered to ensure correct base64 encoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit restrictive. Relying on close being called isn't enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if close is not called, then some data (up to two bytes of raw data) might not be written through to the underlying object. Since this is not something that we can ever reasonably rely on something to clean up, like most of other affects of close methods, I wanted some way to make it as hard as possible for someone to shoot themselves in the foot with this. Because we control exactly what happens on enter and exit when this is used as a context manager, if we limit it to that use then we can guarantee that close is always called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limits what contexts this can be used in, however. Normally I'd expect bad things to happen if I forget to close a stream, in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline discussion with other Python devs, I'm inclined to agree. I'm replicating the warning from the write docstring into the class docstring to try and make it more visible to the user.


.. note::

This does **not** close the wrapped stream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this normal in python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal operation is usually to not close other things that yourself unless told to. That being said, I can see cases where being able to it would be useful to be able to request that the wrapped stream be closed when the wrapping stream closes. Adding this as an optional instantiation parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not saying that it should, I'm just not sure what is pythonic :)

raise ValueError('I/O operation on closed file.')

# Load any stashed bytes and clear the buffer
_b = self.__write_buffer + b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use _b to mean "buffer" here...

_b = None
if b is not None:
# Calculate number of encoded bytes that must be read to get b raw bytes.
_b = int((b - len(self.__read_buffer)) * 4 / 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but here _b it means "number of bytes". Use a more descriptive variable name, or at least be consistent with how you use single-letter variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can this go negative if __read_buffer has sufficient data to cover the requested size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. b on input is used for both because of convention, but we can use more descriptive values internally.

The initial value of _b can be negative, yes. However, when the modifier is applied to read in multiples of 4, it brings it back up to 0 because __read_buffer will never be longer than 4.

_b = int((b - len(self.__read_buffer)) * 4 / 3)
_b += (4 - _b % 4)

_LOGGER.debug('%s bytes requested: adjusted to %s bytes', b, _b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These log messages are excessively verbose and will clutter debug logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough; reducing to one ("x requested, reading y").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, I would not recommend leaving in debug logs at the level of individual read calls in an IO layer. The overhead could be quite high since we're calling them in a loop and don't necessarily know what the buffer size being transferred might be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I'm not overly concerned with keeping it in there. Pulling it out.

"""
if should_base64:
return Base64IO(stream)
return ObjectProxy(stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for consistency so that it would always return a thing of the same type. However, after switching to base Base64IO on IOBase, this will be the case anyway, so switching to just returning stream if not using base64.

# Because we can actually know size for files and Base64IO does not support seeking,
# set the source length manually for files. This allows enables data key caching when
# Base64-decoding a source file.
_stream_args['source_length'] = os.path.getsize(source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the input is to be decoded from b64, we should be using the decoded size (i.e. size * (64/256)), not the encoded size. This provides a tighter bound on the file size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be size * (3 / 4)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, yes. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


with open(str(b64_ciphertext), 'rb') as b64_ct, open(str(ciphertext), 'wb') as ct:
raw_ct = base64.b64decode(b64_ct.read())
print('raw_ct bytes:', len(raw_ct))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debugging prints before submitting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. This is exactly why #22



@pytest.mark.skipif(not _should_run_tests(), reason='Integration tests disabled. See test/integration/README.rst')
def test_file_to_file_base64_decode_only(tmpdir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest making these some kind of parameterized test - covering all three interesting cases (encode, decode, encode/decode) for both encrypt and decrypt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Condensing and parameterizing.

plaintext_wrapped.write(plaintext_source)

assert plaintext_stream.getvalue() == plaintext_b64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Test also reading/writing 1, 2, 3, and 4 bytes at a time.
  • Test encoding and decoding binary strings with lengths 0,1,2,3 modulo 4.
  • Test behavior when the input has embedded whitespace (whatever that behavior should be).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional test cases added. For whitespace, I think the route that GNU base64 takes is preferable, so I added the ability to optionally ignore whitespace in the wrapped stream.

@mattsb42-aws
Copy link
Member Author

Updated with associated changes as discussed.

@mattsb42-aws mattsb42-aws requested a review from a team October 24, 2017 18:00
@mattsb42-aws
Copy link
Member Author

update with close() simplification discussed offline and rebase

"""
try:
return getattr(self.__wrapped, method_name)()
except AttributeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will suppress an AttributeError thrown from the called method itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Moving call outside of try block.

:type limit: int
:rtype: bytes
"""
return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this limit the line length to be DEFAULT_BUFFER_SIZE when the limit is unspecified? Shouldn't it load an arbitrarily long line instead?

also couldn't this load more than one line in a a single call?

Copy link
Member Author

@mattsb42-aws mattsb42-aws Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical behavior is to read until a newline character is found. However, as with the encryption sdk, 1) we have no reason to believe a newline will necessarily be present in the source, and 2) a "line" is not necessarily useful on its own.

So yes, this readline does not strictly speaking return an actual "line". However, the readline method is commonly used to iterate over the contents of file-like objects and is the expected iteration step when using the file-like as an iterator.

To avoid users inadvertently reading the entire source file into memory, we use io.DEFAULT_BUFFER_SIZE as an arbitrary length for the "line" that we return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -1 and not None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that is the canonical default value for readline.

https://docs.python.org/2/library/io.html#io.IOBase.readline

self.__write_buffer = b''

# If an even base64 chunk or finalizing the stream, write through.
if len(_bytes_to_write) % 3 == 0 or self.__finalize:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simpler to remove the __finalize flag and just dump __write_buffer out in close

# type: (STREAM_KWARGS, SOURCE, IO) -> None
def _encoder(stream, should_base64):
# type: (IO, bool) -> Union[IO, Base64IO]
"""Wraps a stream in either a Base64IO transformer or a transparent proxy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No transparent proxy is involved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Missed updating that when I moved away from ProxyObject. Fixed.



@pytest.mark.parametrize('source_bytes, read_bytes', TEST_CASES)
def test_base64io_decode(source_bytes, read_bytes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests only check a single operation - they don't test that performing a repeated read or write at a strange buffer size works properly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding to cover more cases.

with Base64IO(io.BytesIO(b64_plaintext_with_whitespace)) as decoder:
decoder.read(read_bytes)

excinfo.match(r'Whitespace found in base64-encoded data. Whitespace must be ignored to read this stream.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever desired behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair point. There's not really anything that the caller could do except start over with the flag, and if you're wrapping a stream, the assumption of the rest of the structure is that the entire contents is encoded, so this is really just complicating the flow.

plaintext_source, b64_plaintext_with_whitespace = build_b64_with_whitespace(source_bytes, 1)

with Base64IO(io.BytesIO(b64_plaintext_with_whitespace), ignore_whitespace=True) as decoder:
test = decoder.read(read_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are needed for repeated reads with whitespace enabled - particularly the case where our initial read from the stream doesn't have enough bytes to cover the requested data due to the embedded whitespace, as well as the case where the initial read contains only whitespace, or mostly whitespace with the actual data being insufficient to decode any bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting generated line length to 3 to catch both "read spans lines" and "read is less than a line". Also adding two more cases with majority and entirety of initial read as whitespace.

(1, io.DEFAULT_BUFFER_SIZE),
(io.DEFAULT_BUFFER_SIZE + 99, io.DEFAULT_BUFFER_SIZE * 2)
))
def test_base64io_decode_readlines(hint_bytes, expected_bytes_read):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests needed around lines longer than the hint (or the default hint)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytest.mark.parametrize('hint_bytes, expected_bytes_read', (
(-1, 102400),
(0, 102400),
(1, io.DEFAULT_BUFFER_SIZE),
(io.DEFAULT_BUFFER_SIZE + 99, io.DEFAULT_BUFFER_SIZE * 2)
))

(-1, 102400) < default hint
(1, io.DEFAULT_BUFFER_SIZE) < "line" longer than hint ("line" length is always io.DEFAULT_BUFFER_SIZE)

a=sentinel.a,
b=sentinel.b
)
expected_length = int(os.path.getsize(str(source)) * expected_multiplier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. See line 331. ;)

@mattsb42-aws
Copy link
Member Author

updates to address comments with rebase on master

@bdonlan
Copy link
Contributor

bdonlan commented Nov 7, 2017

I'd suggest avoiding rebasing as long as github indicates it can be automatically merged - now I have to re-review all the changes as I can't see what ones are new :/


Because up to two bytes of data must be buffered to ensure correct base64 encoding
of all data written, this object **must** be closed after you are done writing to
avoid data loss. If used as a context manager, we take care of that for you.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Up to you, but I might clarify that closing this object does not imply/require closing the underlying stream unless close_wrapped_on_close is set.

I might also note that this is intentionally not reusable and why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

:raises ValueError: if called on Base64IO object outside of a context manager
"""
if self.closed:
raise ValueError('I/O operation on closed file.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Maybe include a handle or identifier of the object?

Also minor: Isn't this carefully described as a generic stream most other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the standard error type/message for any operation on a closed file-like.

https://docs.python.org/2/library/io.html#io.IOBase.close

>>> a = io.BytesIO()
>>> a.read()
b''
>>> a.close()
>>> a.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
>>> 
>>> b = open('tox.ini')
>>> b.tell()
0
>>> b.close()
>>> b.tell()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.

raise ValueError('I/O operation on closed file.')

if not self.writable():
raise IOError('Stream is not writable')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor identifier thing

avoid data loss. If used as a context manager, we take care of that for you.

:param bytes b: Bytes to write to wrapped stream
:raises ValueError: if called on closed Base64IO object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the IOError intentionally omitted from the raises clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope; looks like I forgot to include it. Added. Also removed the old ValueError reference left over from when we were enforcing use inside a context manager.



class Base64IO(io.IOBase):
"""Wraps a stream, base64-decoding read results before returning them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also base64-encodes bytes before writing them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Added. I must have written that really early on.


# Read encoded bytes from wrapped stream.
data = self.__wrapped.read(_bytes_to_read)
if any([six.b(char) in data for char in string.whitespace]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: maybe note this is clearing whitespace out of base64 streams that have been formatted e.g. with line breaks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

:type limit: int
:rtype: bytes
"""
return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why -1 and not None?

def readline(self, limit=-1):
# type: (int) -> bytes
"""Read and return one line from the stream.
If limit is specified, at most limit bytes will be read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+"Otherwise, io.DEFAULT_BUFFER_SIZE"? Or is that a well known convention, no need to redundantly document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's a good point. The normal expectation is to read until OEL, but as with the encryption SDK streaming classes, we don't have any expectation that the source actually contains an OEL character. Adding a note to that docstring.

"""
return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE)

def readlines(self, hint=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about -1 vs. None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar answer: it's the canonical default value:

https://docs.python.org/2/library/io.html#io.IOBase.readlines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

@@ -0,0 +1,339 @@
# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on testing reuse of the base64 encoder context manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Adding.

@mattsb42-aws
Copy link
Member Author

Updated with changes to address comments. Because of how GitHub handles rebasing in PRs, I'm going to hold off on a rebase until this PR is approved to ease load on reviewers.

…ngth to encrypt operation - this is required for caching to work when reading from Base64IO
@mattsb42-aws mattsb42-aws merged commit a448bfc into aws:master Nov 13, 2017
@mattsb42-aws mattsb42-aws deleted the dev-29 branch November 13, 2017 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants