Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shouldDrainReadBuffer followed up by readDisable(true) and readDisable(false) strange behaviour #12304

Closed
ppadevski opened this issue Jul 27, 2020 · 11 comments · Fixed by #13772
Assignees
Milestone

Comments

@ppadevski
Copy link

The behaviour I'm about to describe requires quite a lot of setup in order to reproduce.

To begin with I'm using v1.14.4.

My listeners use the following configuration.

listener.mutable_per_connection_buffer_limit_bytes()->set_value(16384);

The default value of 1MB just seemed way too big and opened up lots of opportunities for oom so I ended up using 16k.

Then I have a local SSL client in which data in sent in chunks, e.g. 400 followed up by 16500 bytes (e.g. HTTP headers and body).

What can be seen in the trace logs is the following:

2020-07-23T14:56:23.084Z trace [source/common/network/connection_impl.cc:534] [C187] socket event: 3
2020-07-23T14:56:23.084Z trace [source/common/network/connection_impl.cc:622] [C187] write ready
2020-07-23T14:56:23.084Z trace [source/common/network/connection_impl.cc:572] [C187] read ready
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 393
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 15991
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:167] [C187] ssl read 16384 bytes
2020-07-23T14:56:23.084Z trace [source/common/http/http1/codec_impl.cc:475] [C187] parsing 16384 bytes
...
2020-07-23T14:56:23.085Z debug [source/common/http/conn_manager_impl.cc:2344] [C187][S7433621899365225547] Read-disabling downstream stream due to filter callbacks.
2020-07-23T14:56:23.085Z trace [source/common/network/connection_impl.cc:355] [C187] readDisable: enabled=true disable=true state=0
...
2020-07-23T14:56:23.086Z debug [source/common/http/conn_manager_impl.cc:2368] [C187][S7433621899365225547] Read-enabling downstream stream due to filter callbacks.
2020-07-23T14:56:23.086Z trace [source/common/network/connection_impl.cc:355] [C187] readDisable: enabled=false disable=false state=0
2020-07-23T14:56:23.086Z trace [source/common/network/connection_impl.cc:534] [C187] socket event: 2
2020-07-23T14:56:23.086Z trace [source/common/network/connection_impl.cc:622] [C187] write ready

At that point envoy stops reading from the downstream and either the client or the server times out. According to the client the entire request has been sent and according to the server not the entire request was received.

To me it seems like the problem is that whenever we have this situation

2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 393
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:87] [C187] ssl read returns: 15991
2020-07-23T14:56:23.084Z trace [source/extensions/transport_sockets/tls/ssl_socket.cc:167] [C187] ssl read 16384 bytes

only then do we execute the following code:

105 Network::IoResult SslSocket::doRead(Buffer::Instance& read_buffer) {
...
158 if (slices_to_commit > 0) {
159 read_buffer.commit(slices, slices_to_commit);
160 if (callbacks_->shouldDrainReadBuffer()) {
161 callbacks_->setReadBufferReady();
162 keep_reading = false;
163 }
164 }

and in my case I have shouldDrainReadBuffer() == true because

113 bool shouldDrainReadBuffer() override {
114 return read_buffer_limit_ > 0 && read_buffer_.length() >= read_buffer_limit_; // 16k > 0 && 16k == 16k
115 }

and then setReadBufferReady does

121 void setReadBufferReady() override { file_event_->activate(Event::FileReadyType::Read); }

which seems to immediately get eaten by the first readDisable(true)

readDisable: enabled=true disable=true state=0

and when the second readDisable(false) happens

readDisable: enabled=false disable=false state=0

the read event is not activated.

If readDisable(true/false) isn't triggered or the setReadBufferReady() code path is not executed all is well.
I tried plain TCP and wasn't able to reproduce the issue (there's a similar if shouldDrainReadBuffer in it).
I wasn't able to reproduce the issue with an external connection either (maybe my network is slow).
The code snippets that I shared may be a few lines off v1.14.4 as I have a few ERR_clear_error() calls added as I'm using OpenSSL but looking at the behaviour I would not blame them or OpenSSL.
I can't unfortunately share either the client or the server code.

I can reliable reproduce the issue once every 5 minutes or so, so if anyone has a patch in mind I'll be willing to try it out. In the meantime, I'll see if I can come up with something.

@mattklein123
Copy link
Member

cc @antoniovicente @alyssawilk

@mattklein123 mattklein123 added the investigate Potential bug that needs verification label Jul 27, 2020
@antoniovicente
Copy link
Contributor

I think I see the bug in ConnectionImpl::readDisable(false)

the " && read_buffer_.length() > 0" should be removed from the resumption condition.

@antoniovicente
Copy link
Contributor

But this should be an older bug, the issue is that with SSL we can't assume that resetting the fd mask by calling file_event_->setEnabled(Event::FileReadyType::Read | Event::FileReadyType::Write); will result in correct resumption in all cases. There may be bytes to read in SSL's internal buffers

@mattklein123 mattklein123 added area/connection bug help wanted Needs help! and removed investigate Potential bug that needs verification labels Jul 27, 2020
@antoniovicente
Copy link
Contributor

Commit 77cca6b is not in v1.14.4, so this should be the older bug.

https://github.com/envoyproxy/envoy/blob/v1.14.4/source/common/network/connection_impl.cc#L364

@mattklein123 mattklein123 added this to the 1.16.0 milestone Jul 27, 2020
@antoniovicente
Copy link
Contributor

/assign @antoniovicente

@antoniovicente
Copy link
Contributor

/backport

@repokitteh-read-only repokitteh-read-only bot added the backport/review Request to backport to stable releases label Jul 27, 2020
@antoniovicente
Copy link
Contributor

Thinking about this one some more, I think that we could consider this being a bug in SslSocket::doRead instead of ConnectionImpl::disableRead.

SslSocket::doRead does not check for additional bytes in SSL_read internal buffers by calling SSL_pending and doing additional SSL_read calls after callbacks_->shouldDrainReadBuffer() returns true. The consequence is that the SSL connection is left in a state where it is able to generate additional bytes during a future SslSocket::doRead call even if the underlying fd's read buffer is fully drained. The other 2 publicly available implementations of TransportSocket (e.i. raw sockets and ALTS) have the property that doRead can't make progress without additional read bytes from the socket.

Fixing SslSocket::doRead should be straightforward. It is less clear if readDisable(false) should activate(Read) in cases where the read buffer is empty as a way to simplify the transport socket contract. I don't know what non-OSS, private transport socket implementations exist.

@atanchev
Copy link

I cherry-picked commit 77cca6b into our repo of 1.14.4 and it failed to fix the issue. But after removing this condition:

I think I see the bug in ConnectionImpl::readDisable(false)

the " && read_buffer_.length() > 0" should be removed from the resumption condition.

as suggested by @antoniovicente it seems to be working for now. I'll test this a bit more. Just wanted to add some details from our testing so far.

@antoniovicente
Copy link
Contributor

I cherry-picked commit 77cca6b into our repo of 1.14.4 and it failed to fix the issue. But after removing this condition:

I think I see the bug in ConnectionImpl::readDisable(false)
the " && read_buffer_.length() > 0" should be removed from the resumption condition.

as suggested by @antoniovicente it seems to be working for now. I'll test this a bit more. Just wanted to add some details from our testing so far.

Thanks for confirming that more eagerly calling setReadBufferReady(); helps.

I'm going to try to continue working on a fix, with focus on making the SslSocket drain internal buffers on read so that resumption based on fd re-registration works correctly.

@antoniovicente
Copy link
Contributor

Sent out a PR with tests that repo the timeout while processing the last SSL record issue and changes to SslSocket to ensure that that bytes in internal buffers are drained out before returning from doRead.

Sorry for the delays, too many other things to pay attention to. Hopefully this fix will be merged in time for 1.16, and backported to earlier releases if appropriate.

@mattklein123 mattklein123 modified the milestones: 1.16.0, 1.17.0 Oct 7, 2020
lizan pushed a commit that referenced this issue Nov 16, 2020
…on requests and replay them when re-enabling read. (#13772) (#14017)

Fixes SslSocket read resumption after readDisable when processing the SSL record that contains the last bytes of the HTTP message

Risk Level: low
Testing: new unit and integration tests
Docs Changes: n/a
Release Notes: added
Platform Specific Features: n/a
Fixes #12304

Signed-off-by: Antonio Vicente <avd@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>
@cpakulski
Copy link
Contributor

Backported #13772 to rels 1.16, 1.15, 1.14 and 1.13. Removing backport/review label.

@cpakulski cpakulski removed the backport/review Request to backport to stable releases label Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment