-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding support for input encoding and output decoding #50
Conversation
|
||
.. warning:: | ||
|
||
Because up to two bytes of data must be buffered to ensure correct base64 encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit restrictive. Relying on close
being called isn't enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if close
is not called, then some data (up to two bytes of raw data) might not be written through to the underlying object. Since this is not something that we can ever reasonably rely on something to clean up, like most of other affects of close
methods, I wanted some way to make it as hard as possible for someone to shoot themselves in the foot with this. Because we control exactly what happens on enter and exit when this is used as a context manager, if we limit it to that use then we can guarantee that close
is always called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limits what contexts this can be used in, however. Normally I'd expect bad things to happen if I forget to close a stream, in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline discussion with other Python devs, I'm inclined to agree. I'm replicating the warning from the write
docstring into the class docstring to try and make it more visible to the user.
|
||
.. note:: | ||
|
||
This does **not** close the wrapped stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this normal in python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normal operation is usually to not close other things that yourself unless told to. That being said, I can see cases where being able to it would be useful to be able to request that the wrapped stream be closed when the wrapping stream closes. Adding this as an optional instantiation parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm not saying that it should, I'm just not sure what is pythonic :)
raise ValueError('I/O operation on closed file.') | ||
|
||
# Load any stashed bytes and clear the buffer | ||
_b = self.__write_buffer + b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use _b
to mean "buffer" here...
_b = None | ||
if b is not None: | ||
# Calculate number of encoded bytes that must be read to get b raw bytes. | ||
_b = int((b - len(self.__read_buffer)) * 4 / 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but here _b
it means "number of bytes". Use a more descriptive variable name, or at least be consistent with how you use single-letter variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can this go negative if __read_buffer
has sufficient data to cover the requested size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. b
on input is used for both because of convention, but we can use more descriptive values internally.
The initial value of _b
can be negative, yes. However, when the modifier is applied to read in multiples of 4, it brings it back up to 0 because __read_buffer
will never be longer than 4.
_b = int((b - len(self.__read_buffer)) * 4 / 3) | ||
_b += (4 - _b % 4) | ||
|
||
_LOGGER.debug('%s bytes requested: adjusted to %s bytes', b, _b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These log messages are excessively verbose and will clutter debug logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough; reducing to one ("x requested, reading y").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, I would not recommend leaving in debug logs at the level of individual read calls in an IO layer. The overhead could be quite high since we're calling them in a loop and don't necessarily know what the buffer size being transferred might be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I'm not overly concerned with keeping it in there. Pulling it out.
""" | ||
if should_base64: | ||
return Base64IO(stream) | ||
return ObjectProxy(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for consistency so that it would always return a thing of the same type. However, after switching to base Base64IO
on IOBase
, this will be the case anyway, so switching to just returning stream if not using base64.
# Because we can actually know size for files and Base64IO does not support seeking, | ||
# set the source length manually for files. This allows enables data key caching when | ||
# Base64-decoding a source file. | ||
_stream_args['source_length'] = os.path.getsize(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the input is to be decoded from b64, we should be using the decoded size (i.e. size * (64/256)
), not the encoded size. This provides a tighter bound on the file size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be size * (3 / 4)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, yes. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
|
||
with open(str(b64_ciphertext), 'rb') as b64_ct, open(str(ciphertext), 'wb') as ct: | ||
raw_ct = base64.b64decode(b64_ct.read()) | ||
print('raw_ct bytes:', len(raw_ct)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debugging prints before submitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. This is exactly why #22
|
||
|
||
@pytest.mark.skipif(not _should_run_tests(), reason='Integration tests disabled. See test/integration/README.rst') | ||
def test_file_to_file_base64_decode_only(tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest making these some kind of parameterized test - covering all three interesting cases (encode, decode, encode/decode) for both encrypt and decrypt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Condensing and parameterizing.
plaintext_wrapped.write(plaintext_source) | ||
|
||
assert plaintext_stream.getvalue() == plaintext_b64 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Test also reading/writing 1, 2, 3, and 4 bytes at a time.
- Test encoding and decoding binary strings with lengths 0,1,2,3 modulo 4.
- Test behavior when the input has embedded whitespace (whatever that behavior should be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional test cases added. For whitespace, I think the route that GNU base64
takes is preferable, so I added the ability to optionally ignore whitespace in the wrapped stream.
Updated with associated changes as discussed. |
update with |
""" | ||
try: | ||
return getattr(self.__wrapped, method_name)() | ||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will suppress an AttributeError thrown from the called method itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Moving call outside of try block.
:type limit: int | ||
:rtype: bytes | ||
""" | ||
return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this limit the line length to be DEFAULT_BUFFER_SIZE
when the limit is unspecified? Shouldn't it load an arbitrarily long line instead?
also couldn't this load more than one line in a a single call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical behavior is to read until a newline character is found. However, as with the encryption sdk, 1) we have no reason to believe a newline will necessarily be present in the source, and 2) a "line" is not necessarily useful on its own.
So yes, this readline
does not strictly speaking return an actual "line". However, the readline
method is commonly used to iterate over the contents of file-like objects and is the expected iteration step when using the file-like as an iterator.
To avoid users inadvertently reading the entire source file into memory, we use io.DEFAULT_BUFFER_SIZE
as an arbitrary length for the "line" that we return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -1
and not None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that is the canonical default value for readline.
https://docs.python.org/2/library/io.html#io.IOBase.readline
self.__write_buffer = b'' | ||
|
||
# If an even base64 chunk or finalizing the stream, write through. | ||
if len(_bytes_to_write) % 3 == 0 or self.__finalize: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simpler to remove the __finalize
flag and just dump __write_buffer
out in close
# type: (STREAM_KWARGS, SOURCE, IO) -> None | ||
def _encoder(stream, should_base64): | ||
# type: (IO, bool) -> Union[IO, Base64IO] | ||
"""Wraps a stream in either a Base64IO transformer or a transparent proxy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No transparent proxy is involved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops. Missed updating that when I moved away from ProxyObject
. Fixed.
test/unit/test_encoding.py
Outdated
|
||
|
||
@pytest.mark.parametrize('source_bytes, read_bytes', TEST_CASES) | ||
def test_base64io_decode(source_bytes, read_bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests only check a single operation - they don't test that performing a repeated read or write at a strange buffer size works properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding to cover more cases.
test/unit/test_encoding.py
Outdated
with Base64IO(io.BytesIO(b64_plaintext_with_whitespace)) as decoder: | ||
decoder.read(read_bytes) | ||
|
||
excinfo.match(r'Whitespace found in base64-encoded data. Whitespace must be ignored to read this stream.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair point. There's not really anything that the caller could do except start over with the flag, and if you're wrapping a stream, the assumption of the rest of the structure is that the entire contents is encoded, so this is really just complicating the flow.
test/unit/test_encoding.py
Outdated
plaintext_source, b64_plaintext_with_whitespace = build_b64_with_whitespace(source_bytes, 1) | ||
|
||
with Base64IO(io.BytesIO(b64_plaintext_with_whitespace), ignore_whitespace=True) as decoder: | ||
test = decoder.read(read_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are needed for repeated reads with whitespace enabled - particularly the case where our initial read from the stream doesn't have enough bytes to cover the requested data due to the embedded whitespace, as well as the case where the initial read contains only whitespace, or mostly whitespace with the actual data being insufficient to decode any bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusting generated line length to 3 to catch both "read spans lines" and "read is less than a line". Also adding two more cases with majority and entirety of initial read as whitespace.
(1, io.DEFAULT_BUFFER_SIZE), | ||
(io.DEFAULT_BUFFER_SIZE + 99, io.DEFAULT_BUFFER_SIZE * 2) | ||
)) | ||
def test_base64io_decode_readlines(hint_bytes, expected_bytes_read): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests needed around lines longer than the hint (or the default hint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.parametrize('hint_bytes, expected_bytes_read', (
(-1, 102400),
(0, 102400),
(1, io.DEFAULT_BUFFER_SIZE),
(io.DEFAULT_BUFFER_SIZE + 99, io.DEFAULT_BUFFER_SIZE * 2)
))
(-1, 102400)
< default hint
(1, io.DEFAULT_BUFFER_SIZE)
< "line" longer than hint ("line" length is always io.DEFAULT_BUFFER_SIZE
)
a=sentinel.a, | ||
b=sentinel.b | ||
) | ||
expected_length = int(os.path.getsize(str(source)) * expected_multiplier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. See line 331. ;)
updates to address comments with rebase on master |
I'd suggest avoiding rebasing as long as github indicates it can be automatically merged - now I have to re-review all the changes as I can't see what ones are new :/ |
|
||
Because up to two bytes of data must be buffered to ensure correct base64 encoding | ||
of all data written, this object **must** be closed after you are done writing to | ||
avoid data loss. If used as a context manager, we take care of that for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Up to you, but I might clarify that closing this object does not imply/require closing the underlying stream unless close_wrapped_on_close
is set.
I might also note that this is intentionally not reusable and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
:raises ValueError: if called on Base64IO object outside of a context manager | ||
""" | ||
if self.closed: | ||
raise ValueError('I/O operation on closed file.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Maybe include a handle or identifier of the object?
Also minor: Isn't this carefully described as a generic stream most other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the standard error type/message for any operation on a closed file-like.
https://docs.python.org/2/library/io.html#io.IOBase.close
>>> a = io.BytesIO()
>>> a.read()
b''
>>> a.close()
>>> a.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
>>>
>>> b = open('tox.ini')
>>> b.tell()
0
>>> b.close()
>>> b.tell()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: I/O operation on closed file.
raise ValueError('I/O operation on closed file.') | ||
|
||
if not self.writable(): | ||
raise IOError('Stream is not writable') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor identifier thing
avoid data loss. If used as a context manager, we take care of that for you. | ||
|
||
:param bytes b: Bytes to write to wrapped stream | ||
:raises ValueError: if called on closed Base64IO object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the IOError
intentionally omitted from the raises
clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope; looks like I forgot to include it. Added. Also removed the old ValueError
reference left over from when we were enforcing use inside a context manager.
|
||
|
||
class Base64IO(io.IOBase): | ||
"""Wraps a stream, base64-decoding read results before returning them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also base64-encodes bytes before writing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Added. I must have written that really early on.
|
||
# Read encoded bytes from wrapped stream. | ||
data = self.__wrapped.read(_bytes_to_read) | ||
if any([six.b(char) in data for char in string.whitespace]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: maybe note this is clearing whitespace out of base64 streams that have been formatted e.g. with line breaks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
:type limit: int | ||
:rtype: bytes | ||
""" | ||
return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why -1
and not None
?
def readline(self, limit=-1): | ||
# type: (int) -> bytes | ||
"""Read and return one line from the stream. | ||
If limit is specified, at most limit bytes will be read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+"Otherwise, io.DEFAULT_BUFFER_SIZE
"? Or is that a well known convention, no need to redundantly document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's a good point. The normal expectation is to read until OEL, but as with the encryption SDK streaming classes, we don't have any expectation that the source actually contains an OEL character. Adding a note to that docstring.
""" | ||
return self.read(limit if limit > 0 else io.DEFAULT_BUFFER_SIZE) | ||
|
||
def readlines(self, hint=-1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about -1
vs. None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar answer: it's the canonical default value:
https://docs.python.org/2/library/io.html#io.IOBase.readlines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
@@ -0,0 +1,339 @@ | |||
# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on testing reuse of the base64 encoder context manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Adding.
Updated with changes to address comments. Because of how GitHub handles rebasing in PRs, I'm going to hold off on a rebase until this PR is approved to ease load on reviewers. |
…ngth to encrypt operation - this is required for caching to work when reading from Base64IO
…d removing _b whitelist from pylintrc
…nly reading from buffer
…o close wrapped stream on close
…e and fleshing it out to cover the full IOBase API https://docs.python.org/2/library/io.html#io.IOBase
…ct behavior if ignoring whitespace or not
…input and NOT encoding output
…lose rather than using __finalize
…d catching errors we want to pass through
…ontext manager exit
adding support for input encoding and output decoding #29