Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Stop fake_upstream methods from accidentally succeeding #4232

Merged

Conversation

mkbehr
Copy link
Contributor

@mkbehr mkbehr commented Aug 22, 2018

Description: FakeConnectionBase::waitForDisconnect and FakeHttpConnection::waitForNewStream were returning assertion successes when they timed out, because an AssertionResult constructed with a (non-empty) string counts as a success. Fix that.
Risk Level: Low (test only)
Testing: bazel test //test/...
Docs Changes: n/a
Release Notes: n/a

Constructing an AssertionResult out of a (non-empty) string gives a successful
assertion, which wasn't intended.

Signed-off-by: Michael Behr <mkbehr@google.com>
@mkbehr
Copy link
Contributor Author

mkbehr commented Aug 23, 2018

So, a few of the mac test cases do actually hit the 10-second timeout in that CI run. They're all RouterDownstreamDisconnectBeforeResponseComplete, and they all time out on waitForDisconnect. That test succeeds on Http2UpstreamIntegrationTest and Http2IntegrationTest, but it fails on SslIntegrationTest, UdsUpstreamIntegrationTest, and IntegrationTest.

In the tests where this succeeds, the methods take <100ms, so I wouldn't be surprised if this is spinning forever and the test framework bug stopped us from noticing.

How should we handle this?

@zuercher
Copy link
Member

Hopefully these are the reason why the mac tests seem more flakey than the linux ones. (We currently re-run mac tests one when they fail to paper over this.)

I think we should try to fix these tests. If you don't have a mac dev environment handy, I can take a look on Monday.

@zuercher
Copy link
Member

zuercher commented Aug 28, 2018

Empirically, this patch makes all three of those tests pass:

diff --git a/test/integration/http_integration.cc b/test/integration/http_integration.cc
index 1761bfd00..0d828abe6 100644
--- a/test/integration/http_integration.cc
+++ b/test/integration/http_integration.cc
@@ -592,6 +592,7 @@ void HttpIntegrationTest::testRouterDownstreamDisconnectBeforeResponseComplete(
   codec_client_->close();
 
   if (upstreamProtocol() == FakeHttpConnection::Type::HTTP1) {
+    ASSERT_TRUE(fake_upstream_connection_->close());
     ASSERT_TRUE(fake_upstream_connection_->waitForDisconnect());
   } else {
     ASSERT_TRUE(upstream_request_->waitForReset());

It seems to me that nothing in the HTTP1 stack closes the upstream connection when the downstream codec_client_ is closed in HttpIntegrationTest::testRouterDownstreamDisconnectBeforeResponseComplete. Or at least I can't see where that's supposed to happen. However, something must be closing it or the same error would occur under Linux.

@alyssawilk do you have any ideas?

@alyssawilk
Copy link
Contributor

Ok, so in test we're sending half a request, making sure we establish a connection upstream, and killing the client connection, yeah?

Logically that should result in killing the upstream connection too - if it's halfway through streaming a HTTP/1.1 request it can't be used for a subsequent request. If the router filter doesn't handle that by closing the connection that could be an actual Envoy bug.

It's a bit tricky trying to visually find what happens on unexpected downstream disconnect. I assume we get an router onDestroy with an existing upstream_request_ which should result in resetStream which for HTTP/1.1 should kill the connection, which is what this test appears to be trying to test.

Can either of you pastebin the -l trace logs?

@zuercher
Copy link
Member

Mac run, fails: https://pastebin.com/raw/yuyvrju1
Linux run, passes: https://pastebin.com/raw/UFXEiByJ

@alyssawilk
Copy link
Contributor

I'll dig through traces but I have a theory, which is it's one of those "don't do upstream things without letting the client dispatcher progress" bugs. I bet we're just hanging on the upstream "waitfordisconnect" without the FIN getting shipped somehow.

linux side I do see the router resetting the pool request
[2018-08-28 16:28:04.381][23][debug][router] source/common/router/router.cc:916] [C1][S3959132024349198890] resetting pool request
right after it detects the early close
[2018-08-28 16:28:04.381][23][debug][connection] source/common/network/connection_impl.cc:410] [C1] remote early close

Mac side not so much. It looks like the client socket is getting closed
[2018-08-28 16:24:21.598][3734902][debug][connection] source/common/network/connection_impl.cc:133] [C0] closing socket: 1
but if it's a flushclose maybe the fin is behind some data? I'd try experimentally switching the waitForDisconnect to be like waitForNewStream or waitForHeadersComplete (allow client dispatcher action) and see if that helps

@alyssawilk
Copy link
Contributor

alternately you could add a std::cerr on the actual socket close to see if the client is actually closing it before we go and block on the upstream waitForDisconnect. That'd be a fast way to sanity check my theory.

@zuercher
Copy link
Member

I guess the difference is that the linux version sees "[C1] remote early close" which seems to trigger closing the upstream connection where the Mac version doesn't.

@alyssawilk
Copy link
Contributor

Yeah, but looking more closely at the test I'm a bit more confused why we don't get that disconnect. We wrote the headers but verified they came through so it's not like the FIN might be behind them. I could maybe see the SslIntegrationTest failing if there's some ssl close time delay but that doesn't explain the IntegrationTest failures. But basically it looks like either the FIN isn't being written immediately in the stack of close() or Envoy isn't reading it for some reason, and given I've seen a bunch of things fail with the former failure pattern I'm still betting on it :-P

@zuercher
Copy link
Member

So adding waitForNewStream made the test pass: after ten seconds of waiting for a new stream and failing, it called waitForDisconnect and after 5 more seconds, we got a timeout on the upstream connection that closed it. Later C1 gets closed.

I also hacked waitForDisconnect to take a Dispatcher and that doesn't fix the test either, still no close on C1.

I grabbed this during the test. 57956 is the listener port.

Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)    
tcp4       0      0  127.0.0.1.57956        127.0.0.1.57957        CLOSE_WAIT 
tcp4       0      0  127.0.0.1.57957        127.0.0.1.57956        FIN_WAIT_2 
tcp4       0      0  127.0.0.1.57956        *.*                    LISTEN     

@zuercher
Copy link
Member

(I should add that I removed the waitForNewStream when I added the hacked waitForDisconnect and the netstat dump is from the latter test.)

@zuercher
Copy link
Member

So, I've hacked up ConnectionImpl and FileEventImpl. Seems that the listener side of the downstream connection reads the 62 byte request, and then readDisabled is invoked. The upstream handles the request and writes its partial body and the client side of the downstream is closed. Because I've moved filtering of events into the libevent callback I can see that when that close happens a read event is sent on the listener side of the connection where on Linux we get a close event.

@zuercher
Copy link
Member

Confirmed that if that read event is allowed through, we attempt to read, get 0 bytes and detect the connection is closed. I suppose this is some strange interaction between OSX and libevent?

@zuercher
Copy link
Member

So, after a lot of digging, I've come to the conclusion that libevent doesn't support EV_FEATURE_EARLY_CLOSE on OSX. If I modify dispatcher_impl.cc to require the feature it fails to initialize libevent (I've verified that it succeeds under Linux). From reading the definition of EV_CLOSED and the detect_early_close_-related comments in ConnectionImpl, it seems that's exactly what we're depending on in this test and it was just never working correctly and not detected before this change.

It's possible there's some way to gain support for that under OSX, but I'll have to some digging.

@zuercher
Copy link
Member

The EV_FEATURE_EARLY_CLOSE is epoll-only.

@alyssawilk
Copy link
Contributor

Ok, if EV_FEATURE_EARLY_CLOSE isn't supported on OSX we should just disable the HTTP/1 test for OSX noting that the downstream close will not be detected when envoy has paused reading during request handling, yeah?
#ifdef APPLE
if (downstream_protocol_ == Http::CodecClient::Type::HTTP1) { return; }
#endif
with explanatory comments?

@zuercher
Copy link
Member

zuercher commented Aug 29, 2018

@alyssawilk yes, that's the conclusion I came too as well. I'm going to write this up in an issue that we can refer to where it's disabled.

@zuercher
Copy link
Member

@mkbehr Let's disable HttpIntegrationTest::testRouterDownstreamDisconnectBeforeResponseComplete as described in #4232 (comment) and note #4294 in the comment. Sorry for the delay.

Signed-off-by: Michael Behr <mkbehr@google.com>
@zuercher
Copy link
Member

And have to add the disable block in SslIntegrationTest's RouterDownstreamDisconnectBeforeResponseComplete test case.

Signed-off-by: Michael Behr <mkbehr@google.com>
@@ -150,6 +150,13 @@ TEST_P(SslIntegrationTest, RouterUpstreamDisconnectBeforeResponseComplete) {
}

TEST_P(SslIntegrationTest, RouterDownstreamDisconnectBeforeRequestComplete) {
#ifdef __APPLE__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the request version of the test, which works. Conditional needs to be on the next one "RouterDownstreamDisconnectBeforeResponseComplete"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, right.

zuercher added a commit to turbinelabs/envoy that referenced this pull request Aug 30, 2018
Implements early close detection on OS X by inspecting the TCP
state using getsockopt.

*Risk Level*: low (conditional compilation)
*Testing*: integration tests pass (tested against envoyproxy#4232)
*Docs Changes*: n/a
*Release Notes*: n/a
*Fixes*: envoyproxy#4294

Signed-off-by: Stephan Zuercher <stephan@turbinelabs.io>
instead of before /request/ complete

Signed-off-by: Michael Behr <mkbehr@google.com>
Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@zuercher zuercher merged commit aa06142 into envoyproxy:master Sep 5, 2018
rshriram pushed a commit to rshriram/envoy that referenced this pull request Oct 30, 2018
Pulling the following changes from github.com/envoyproxy/envoy:

f936fc6 ssl: serialize accesses to SSL socket factory contexts (envoyproxy#4345)
e34dcd6 Fix crash in tcp_proxy (envoyproxy#4323)
ae6a252 router: fix matching when all domains have wildcards (envoyproxy#4326)
aa06142 test: Stop fake_upstream methods from accidentally succeeding (envoyproxy#4232)
5d73187 rbac: update the authenticated.user to a StringMatcher. (envoyproxy#4250)
c6bfc7d time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (envoyproxy#4257)
752483e Fixing the fix (envoyproxy#4333)
83487f6 tls: update BoringSSL to ab36a84b (3497). (envoyproxy#4338)
7bc210e test: fixing interactions between waitFor and ignore_spurious_events (envoyproxy#4309)
69474b3 admin: order stats in clusters json admin (envoyproxy#4306)
2d155f9 ppc64le build (envoyproxy#4183)
07efc6d fix static initialization fiasco problem (envoyproxy#4314)
0b7e3b5 test: Remove declared but undefined class methods (envoyproxy#4297)
1485a13 lua: make sure resetting dynamic metadata wrapper when request info is marked dead
d243cd6 test: set to zero when start_time exceeds limit (envoyproxy#4328)
0a1e92a test: fix heap use-after-free in ~IntegrationTestServer. (envoyproxy#4319)
cddc732 CONTRIBUTING: Document 'kick-ci' trick. (envoyproxy#4335)
f13ef24 docs: remove reference to deprecated value field (envoyproxy#4322)
e947a27 router: minor doc fixes in stream idle timeout (envoyproxy#4329)
0c2e998 tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (envoyproxy#4296)
00ffe44 utility: fix strftime overflow handling. (envoyproxy#4321)
af1183c Re-enable TcpProxySslIntegrationTest and make the tests pass again. (envoyproxy#4318)
3553461 fuzz: fix H2 codec fuzzer post envoyproxy#4262. (envoyproxy#4311)
42f6048 Proto string issue fix (envoyproxy#4320)
9c492a0 Support Envoy to fetch secrets using SDS service. (envoyproxy#4256)
a857219 ratelimit: revert `revert rate limit failure mode config` and add tests (envoyproxy#4303)
1d34172 dns: fix exception unsafe behavior in c-ares callbacks. (envoyproxy#4307)
1212423 alts: add gRPC TSI socket (envoyproxy#4153)
f0363ae fuzz: detect client-side resets in H2 codec fuzzer. (envoyproxy#4300)
01aa3f8 test: hopefully deflaking echo integration test (envoyproxy#4304)
1fc0f4b ratelimit: link legacy proto when message is being used (envoyproxy#4308)
aa4481e fix rare List::remove(&target) segfault (envoyproxy#4244)
89e0f23 headers: fixing fast fail of size-validation (envoyproxy#4269)
97eba59 build: bump googletest version. (envoyproxy#4293)
0057e22 fuzz: avoid false positives in HCM fuzzer. (envoyproxy#4262)
9d094e5 Revert ac0bd74 (envoyproxy#4295)
ddb28a4 Add validation context provider (envoyproxy#4264)
3b47cba added histogram latency information to Hystrix dashboard stream (envoyproxy#3986)
cf87d50 docs: update SNI FAQ. (envoyproxy#4285)
f952033 config: fix update empty stat for eds (envoyproxy#4276)
329e591 router: Add ability of custom headers to rely on per-request data (envoyproxy#4219)
68d20b4  thrift: refactor build files and imports (envoyproxy#4271)
5fa8192 access_log: log requested_server_name in tcp proxy (envoyproxy#4144)
fa45bb4 fuzz: libc++ clocks don't like nanos. (envoyproxy#4282)
53f8944 stats: add symbol table for future stat name encoding (envoyproxy#3927)
c987b42 test infra: Remove timeSource() from the ClusterManager api (envoyproxy#4247)
cd171d9 websocket: tunneling websockets (and upgrades in general) over H2 (envoyproxy#4188)
b9dc5d9 router: disallow :path/host rewriting in request_headers_to_add. (envoyproxy#4220)
0c91011 network: skip socket options and source address for UDS client connections (envoyproxy#4252)
da1857d build: fixing a downstream compile error by noting explicit fallthrough (envoyproxy#4265)
9857cfe fuzz: cleanup per-test environment after each fuzz case. (envoyproxy#4253)
52beb06 test: Wrap proto string in std::string before comparison (envoyproxy#4238)
f5e219e extensions/thrift_proxy: Add header matching to thrift router (envoyproxy#4239)
c9ce5d2 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (envoyproxy#4260)
35103b3 fuzz: use nanoseconds for SystemTime in RequestInfo. (envoyproxy#4255)
ba6ba98 fuzz: make runtime root hermetic in server_fuzz_test. (envoyproxy#4258)
b0a9014 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (envoyproxy#4248)
8567460 access_log: support beginning of epoch in START_TIME. (envoyproxy#4254)
28d5f41 proto: unify envoy_proto_library/api_proto_library. (envoyproxy#4233)
f7d3cb6 http: fix allocation bug introduced in envoyproxy#4211. (envoyproxy#4245)

Fixes istio/istio#8310 (once pulled into istio/istio).

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants