Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overload: tcp connection refusal overload action #13311

Merged
merged 12 commits into from
Oct 13, 2020

Conversation

akonradi
Copy link
Contributor

@akonradi akonradi commented Sep 29, 2020

Commit Message: Add an overload action to reject (accept and then close) TCP connections
Additional Description:
Add an action "envoy.overload_actions.reject_incoming_connections" that will reject
incoming connections when Envoy is overloaded. It will randomly choose whether to
reject connections based on the overload factor, with expected behavior at the ends
of the range (reject no connections at 0, reject all at saturation).
Risk Level: low
Testing: ran unit tests
Docs Changes: documented new overload action
Release Notes: documented new overload action

This will allow creating an overload action that rejects (accepts the
connection and then closes it) some fraction of connections in response
to overload conditions.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor Author

I still need to add documentation, but I wanted to get the code parts in shape first.

@akonradi akonradi marked this pull request as draft September 29, 2020 15:54
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall - thanks for picking this up!

@@ -148,6 +148,14 @@ void ConnectionHandlerImpl::enableListeners() {
}
}

void ConnectionHandlerImpl::setListenerRejectFraction(float reject_fraction) {
disable_listeners_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems strange to change this boolean without resuming listening.
Also if new listeners are added won't they be added listening but without a rejection fraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this was a bad copy-paste on my part. Fixed in the latest commit, and added tests.

return true;
}

return random_generator.random() <
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe comment why this is overflow-safe?

* of the level of saturation.
*/
bool isRandomizedActive(Random::RandomGenerator& random_generator) const {
if (isInactive()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for performance, or do the edges not work well with the casts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was stolen from #13120, but I assume it's for performance - don't grab a random number if it's not necessary. That's why this PR is still a draft.

Signed-off-by: Alex Konradi <akonradi@google.com>
Address feedback around adding new listeners when the reject fraction is
already set, and add tests.

Signed-off-by: Alex Konradi <akonradi@google.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the structure looks good overall, so want to ping when you have docs and relnotes and such? It wouldn't hurt to have an integration test too. Even if we can only usefully test the case where we're rejecting 100%, we could test accepting, config reload, and rejecting.

@@ -59,7 +59,7 @@ void TcpListenerImpl::onSocketEvent(short flags) {
break;
}

if (rejectCxOverGlobalLimit()) {
if (rejectCxOverGlobalLimit() || reject_fraction_.isRandomizedActive(random_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #13305 lands (which I think will before this) I thin it'd be useful to tag why the connection was closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can tag anything here since this happens before an Envoy connection object is created, so there's nothing to annotate.

@@ -59,7 +59,7 @@ void TcpListenerImpl::onSocketEvent(short flags) {
break;
}

if (rejectCxOverGlobalLimit()) {
if (rejectCxOverGlobalLimit() || reject_fraction_.isRandomizedActive(random_)) {
// The global connection limit has been reached.
io_handle->close();
cb_.onReject();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this increments stats_.downstream_global_cx_overflow_.inc()
Think we should have a separate limit for overflow?
If not maybe at least update the docs to note the stat covers both overflow cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a separate stat.

…-refusal

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi akonradi marked this pull request as ready for review October 9, 2020 21:12
@akonradi
Copy link
Contributor Author

akonradi commented Oct 9, 2020

@alyssawilk I've updated the release notes and documentation, PTAL

Signed-off-by: Alex Konradi <akonradi@google.com>
alyssawilk
alyssawilk previously approved these changes Oct 12, 2020
@alyssawilk
Copy link
Contributor

Looks great thanks. Throwing over to Matt for !googler review but in the meantime if you're up for updating the TODOs in PR description that'd be helpful

@akonradi
Copy link
Contributor Author

Thanks, updated TODOs (that I had completely forgotten about)

@mattklein123
Copy link
Member

Windows CI failure looks real. Can you take a look?

/wait

Relax the ordering constraints for events that occur asynchronously on
different ends of the connection while maintaining requirements for
relative ordering on each end.

Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor Author

Relaxed some of the ordering requirements in the test for things that shouldn't necessarily be synchronized. Looks like the tests are passing now

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! LGTM with small comments. Thank you.

/wait

@@ -118,5 +124,9 @@ void TcpListenerImpl::enable() { file_event_->setEnabled(Event::FileReadyType::R

void TcpListenerImpl::disable() { file_event_->setEnabled(0); }

void TcpListenerImpl::setRejectFraction(const float reject_fraction) {
reject_fraction_ = reject_fraction;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ASSERT 0 <= reject_fraction <= 1? I think the Bernoulli function correctly handles out of bounds but might as well try to check for obvious config/code bugs. (Bonus points would be to introduce a struct wrapper for the float which would check this at point of assignment, etc. but up to you.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

stats_.downstream_global_cx_overflow_.inc();
break;
case RejectCause::OverloadAction:
stats_.downstream_cx_overload_reject_.inc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify this stat increment in one of your tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a unit test.

Signed-off-by: Alex Konradi <akonradi@google.com>
…-refusal

Signed-off-by: Alex Konradi <akonradi@google.com>
@mattklein123
Copy link
Member

Needs main merge.

/wait

…-refusal

Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Contributor Author

Yep, I managed to add new tests in the same place both in this PR and in 13515 🤦

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 5fdf629 into envoyproxy:master Oct 13, 2020
@akonradi akonradi deleted the scaled-connection-refusal branch October 13, 2020 19:57
mpuncel added a commit to mpuncel/envoy that referenced this pull request Oct 14, 2020
* master: (22 commits)
  http: using CONNECT_ERROR for HTTP/2 (envoyproxy#13519)
  listener: respect address.pipe.mode (it didn't work) (envoyproxy#13493)
  examples: Fix more deprecations/warnings in configs (envoyproxy#13529)
  overload: tcp connection refusal overload action (envoyproxy#13311)
  tcp: towards pluggable upstreams (envoyproxy#13331)
  conn_pool: fixing comments (envoyproxy#13520)
  Prevent SEGFAULT when disabling listener (envoyproxy#13515)
  Convert overload manager config literals to YAML (envoyproxy#13518)
  Fix runtime feature variable name (envoyproxy#13533)
  dependencies: refactor repository location schema utils, cleanups. (envoyproxy#13452)
  router:  fix an invalid ASSERT when encoding metadata frames in the router. (envoyproxy#13511)
  http2: Proactively disconnect connections flooded when resetting stream (envoyproxy#13482)
  ci use azp to sync filter example (envoyproxy#13501)
  mongo_proxy: support configurable command list for metrics (envoyproxy#13494)
  http local rate limit: note token bucket is shared (envoyproxy#13525)
  wasm/extensions: Wasm extension policy. (envoyproxy#13526)
  http: removing envoy.reloadable_features.http1_flood_protection (envoyproxy#13508)
  build: update ppc64le CI build status shield (envoyproxy#13521)
  dependencies: enforce dependency shepherd sign-off via RepoKitteh. (envoyproxy#13522)
  Add no_traffic_healthy_interval (envoyproxy#13336)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants