Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Broker retry retries sending failed events immediately #1100

Closed
liu-cong opened this issue May 20, 2020 · 9 comments · Fixed by #1218
Closed

Broker retry retries sending failed events immediately #1100

liu-cong opened this issue May 20, 2020 · 9 comments · Fixed by #1218
Assignees
Labels
area/broker kind/feature-request New feature or request priority/2 Nice to have feature but doesn't block current release defined by release/* release/1 storypoint/5
Milestone

Comments

@liu-cong
Copy link
Contributor

liu-cong commented May 20, 2020

Problem
A short explanation of the problem, including relevant restrictions.

Persona:
Which persona is this feature for?

Exit Criteria
A retried message delivery that fails is retried only after some static timeout.

Time Estimate (optional):
How many developer-days do you think this may take to resolve?

Additional context (optional)
Add any other context about the feature request here.

@liu-cong liu-cong added release/1 priority/1 Blocks current release defined by release/* label or blocks current milestone kind/feature-request New feature or request area/broker labels May 20, 2020
@grantr grantr added priority/2 Nice to have feature but doesn't block current release defined by release/* and removed priority/1 Blocks current release defined by release/* label or blocks current milestone labels May 20, 2020
@grantr grantr added this to the Backlog milestone May 26, 2020
@yolocs
Copy link
Member

yolocs commented Jun 3, 2020

The go pubsub client we're using doesn't provide a way to directly write AckDeadline. Instead it manages the AckDeadline for us. We can directly call RPC to set the AckDeadline (e.g. like how it's done in the client) but it's hard to tell how well it can work with the client we're using (which also sets the AckDeadline dynamically).

If we stick with the existing client, the only way to alternate AckDeadline seems to be 1) explicitly holding on to the messages and call nack() after a period or 2) implicitly holding on to the messages by not calling either ack() or nack() and messages will eventually expire after MaxExtension. Both workarounds will create "congestion" for a pull client as we also set MaxOutstandingBytes and MaxOutstandingMessages. It could work well in the case where a subscriber consistently rejects events (e.g. service not avaialble). But it could potentially cause unfairness when the subscriber accepts certain events but not others. A batch of unacceptable events could block future acceptable events being retried efficiently.

For release 1, I think it's ok to proceed with the option 1 as it could provide knobs on how fast we nack().

@grantr
Copy link
Contributor

grantr commented Jun 3, 2020

Unfortunately calling nack() in the high-level client always sets the deadline to zero: https://github.com/googleapis/google-cloud-go/blob/df3d655c5e3fc7d1e62fabf96358c3a52a874f4c/pubsub/iterator.go#L343-L347. It would be a useful feature for the higher-level client to be able to nack with a timeout, or to enable the same sort of backoff as described in https://godoc.org/cloud.google.com/go/pubsub#hdr-Deadlines.

As we get more sophisticated in our needs and use of pubsub, it could makes sense to use the lower level client more. The docs point to an example of using the base client for slow message processing. But looking at the complexity of the pubsub client, I think our usage would end up more complex than the example.

I'm fine with option 1 for release 1. We don't have to hold on to the entire message, just the ids. The high-level client uses a map of ids and a ticker to periodically ack and nack messages, and we could do the same relatively cheaply. We could use a hierarchical timing wheel if we wanted to get sophisticated.

@nachocano
Copy link
Member

I recall one of my first issues when I joined the project was doing some exponential backoff in the old pubsub channel... Some pieces might be useful... knative/eventing#775

@yolocs
Copy link
Member

yolocs commented Jun 3, 2020

@grantr I found googleapis/google-cloud-go#1942, from which I realized pubsub is adding a RetryPolicy support when creating a subscription
https://github.com/googleapis/googleapis/blob/master/google/pubsub/v1/pubsub.proto#L726-L748. It's not available in the go client yet. So maybe soon we don't have to do any explicitly other than configuring this.

@nachocano thanks for sharing! I think I can "steal" this code and do option 1 for the short term.

@ian-mi
Copy link
Contributor

ian-mi commented Jun 5, 2020

While slow acking would be an improvement, it seems less than ideal for retry status be tracked by the client for the following reasons:

  1. The state will not be shared when horizontally scaling the retry service. New instances will initially not apply any backoff to queued messages and backoff will effectively be reduced by the number of instances.
  2. Retry state will be reset when pods are recreated potentially causing a burst of retries.
  3. A potentially large number of messages will need to be retained in the slow-ack phase causing high memory usage.
  4. Ack timeout is limited to 10 minutes which may not be sufficiently long.
  5. Messages held in the slow ack phase will contribute towards the max concurrent unacked messages allowed by the pubsub client.

Long term we may want to consider a solution which moves retry state out of the client, such as:

  1. Using Cloud Tasks to deliver retries instead of pubsub. I haven't explored this idea in detail but I am aware that the Cloud Tasks API supports better controls over the scheduling of task execution.
  2. Using a Pub/Sub deadletter topic and slow acking based on the current number of delivery_attempts rather than using internal state.

@yolocs
Copy link
Member

yolocs commented Jun 8, 2020

@ian-mi (points 1,2) Not sharing state among pods is a major drawback. Point 3 is kinda "mitigated" by point 5 as we do set that limit. But we still have the problem from point 5. Point 4 seems irrelevant to this topic?

In the near future, hopefully the go pubsub client will provide option to set retry policy (which is already provided by pubsub API - see my comment above) so that we don't have to manage that by ourselves.

@yolocs yolocs self-assigned this Jun 8, 2020
@ian-mi
Copy link
Contributor

ian-mi commented Jun 9, 2020

A pubsub client retry policy should not affect any of these issues as the state is still managed by the client.

I wouldn't say that point 5 mitigates point 3. Even with a limit on the number of unacked messages per topic, we could easily end up with thousands of retained messages per retry topic will leak memory and slow the entire retry pod. Additionally the limit on unacked messages could cause a topic be starved completely if at some point gets flooded with invalid messages.

@yolocs
Copy link
Member

yolocs commented Jun 9, 2020

@ian-mi I don't think it's client retry policy. It appears to be part of the subscription API.

If we limit outstanding bytes and messages, in theory we can limit how much memory can be consumed (with a fixed number of subscribers). That kinda mitigates point 3. What you said now are different problems which we could anyway run into if we don't dead-letter messages.

@ian-mi
Copy link
Contributor

ian-mi commented Jun 9, 2020

Ah, that's great! I took another look, you are right it is part of the subscription API although not officially documented yet. If it works the way I would expect this should address most or all of the issues I've pointed out. This support was merged yesterday so hopefully we can migrate to this API in the near future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/broker kind/feature-request New feature or request priority/2 Nice to have feature but doesn't block current release defined by release/* release/1 storypoint/5
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants