Broker retry retries sending failed events immediately #1100

liu-cong · 2020-05-20T17:54:30Z

Problem
A short explanation of the problem, including relevant restrictions.

Persona:
Which persona is this feature for?

Exit Criteria
A retried message delivery that fails is retried only after some static timeout.

Time Estimate (optional):
How many developer-days do you think this may take to resolve?

Additional context (optional)
Add any other context about the feature request here.

yolocs · 2020-06-03T00:05:18Z

The go pubsub client we're using doesn't provide a way to directly write AckDeadline. Instead it manages the AckDeadline for us. We can directly call RPC to set the AckDeadline (e.g. like how it's done in the client) but it's hard to tell how well it can work with the client we're using (which also sets the AckDeadline dynamically).

If we stick with the existing client, the only way to alternate AckDeadline seems to be 1) explicitly holding on to the messages and call nack() after a period or 2) implicitly holding on to the messages by not calling either ack() or nack() and messages will eventually expire after MaxExtension. Both workarounds will create "congestion" for a pull client as we also set MaxOutstandingBytes and MaxOutstandingMessages. It could work well in the case where a subscriber consistently rejects events (e.g. service not avaialble). But it could potentially cause unfairness when the subscriber accepts certain events but not others. A batch of unacceptable events could block future acceptable events being retried efficiently.

For release 1, I think it's ok to proceed with the option 1 as it could provide knobs on how fast we nack().

grantr · 2020-06-03T00:22:03Z

Unfortunately calling nack() in the high-level client always sets the deadline to zero: https://github.com/googleapis/google-cloud-go/blob/df3d655c5e3fc7d1e62fabf96358c3a52a874f4c/pubsub/iterator.go#L343-L347. It would be a useful feature for the higher-level client to be able to nack with a timeout, or to enable the same sort of backoff as described in https://godoc.org/cloud.google.com/go/pubsub#hdr-Deadlines.

As we get more sophisticated in our needs and use of pubsub, it could makes sense to use the lower level client more. The docs point to an example of using the base client for slow message processing. But looking at the complexity of the pubsub client, I think our usage would end up more complex than the example.

I'm fine with option 1 for release 1. We don't have to hold on to the entire message, just the ids. The high-level client uses a map of ids and a ticker to periodically ack and nack messages, and we could do the same relatively cheaply. We could use a hierarchical timing wheel if we wanted to get sophisticated.

nachocano · 2020-06-03T00:39:53Z

I recall one of my first issues when I joined the project was doing some exponential backoff in the old pubsub channel... Some pieces might be useful... knative/eventing#775

yolocs · 2020-06-03T16:20:01Z

@grantr I found googleapis/google-cloud-go#1942, from which I realized pubsub is adding a RetryPolicy support when creating a subscription
https://github.com/googleapis/googleapis/blob/master/google/pubsub/v1/pubsub.proto#L726-L748. It's not available in the go client yet. So maybe soon we don't have to do any explicitly other than configuring this.

@nachocano thanks for sharing! I think I can "steal" this code and do option 1 for the short term.

ian-mi · 2020-06-05T21:47:37Z

While slow acking would be an improvement, it seems less than ideal for retry status be tracked by the client for the following reasons:

The state will not be shared when horizontally scaling the retry service. New instances will initially not apply any backoff to queued messages and backoff will effectively be reduced by the number of instances.
Retry state will be reset when pods are recreated potentially causing a burst of retries.
A potentially large number of messages will need to be retained in the slow-ack phase causing high memory usage.
Ack timeout is limited to 10 minutes which may not be sufficiently long.
Messages held in the slow ack phase will contribute towards the max concurrent unacked messages allowed by the pubsub client.

Long term we may want to consider a solution which moves retry state out of the client, such as:

Using Cloud Tasks to deliver retries instead of pubsub. I haven't explored this idea in detail but I am aware that the Cloud Tasks API supports better controls over the scheduling of task execution.
Using a Pub/Sub deadletter topic and slow acking based on the current number of delivery_attempts rather than using internal state.

yolocs · 2020-06-08T16:38:15Z

@ian-mi (points 1,2) Not sharing state among pods is a major drawback. Point 3 is kinda "mitigated" by point 5 as we do set that limit. But we still have the problem from point 5. Point 4 seems irrelevant to this topic?

In the near future, hopefully the go pubsub client will provide option to set retry policy (which is already provided by pubsub API - see my comment above) so that we don't have to manage that by ourselves.

ian-mi · 2020-06-09T19:37:42Z

A pubsub client retry policy should not affect any of these issues as the state is still managed by the client.

I wouldn't say that point 5 mitigates point 3. Even with a limit on the number of unacked messages per topic, we could easily end up with thousands of retained messages per retry topic will leak memory and slow the entire retry pod. Additionally the limit on unacked messages could cause a topic be starved completely if at some point gets flooded with invalid messages.

yolocs · 2020-06-09T19:50:29Z

@ian-mi I don't think it's client retry policy. It appears to be part of the subscription API.

If we limit outstanding bytes and messages, in theory we can limit how much memory can be consumed (with a fixed number of subscribers). That kinda mitigates point 3. What you said now are different problems which we could anyway run into if we don't dead-letter messages.

ian-mi · 2020-06-09T20:39:33Z

Ah, that's great! I took another look, you are right it is part of the subscription API although not officially documented yet. If it works the way I would expect this should address most or all of the issues I've pointed out. This support was merged yesterday so hopefully we can migrate to this API in the near future.

liu-cong added release/1 priority/1 Blocks current release defined by release/* label or blocks current milestone kind/feature-request New feature or request area/broker labels May 20, 2020

grantr added priority/2 Nice to have feature but doesn't block current release defined by release/* and removed priority/1 Blocks current release defined by release/* label or blocks current milestone labels May 20, 2020

grantr added this to the Backlog milestone May 26, 2020

grantr added the storypoint/5 label May 27, 2020

yolocs mentioned this issue Jun 4, 2020

Add client side backoff retries for pubsub messages #1218

Merged

yolocs self-assigned this Jun 8, 2020

knative-prow-robot closed this as completed in #1218 Jun 9, 2020

yolocs mentioned this issue Jun 11, 2020

Use pubsub retry policy to backoff retries #1267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broker retry retries sending failed events immediately #1100

Broker retry retries sending failed events immediately #1100

liu-cong commented May 20, 2020 •

edited by grantr

Loading

yolocs commented Jun 3, 2020

grantr commented Jun 3, 2020

nachocano commented Jun 3, 2020

yolocs commented Jun 3, 2020 •

edited

Loading

ian-mi commented Jun 5, 2020 •

edited

Loading

yolocs commented Jun 8, 2020

ian-mi commented Jun 9, 2020

yolocs commented Jun 9, 2020

ian-mi commented Jun 9, 2020

Broker retry retries sending failed events immediately #1100

Broker retry retries sending failed events immediately #1100

Comments

liu-cong commented May 20, 2020 • edited by grantr Loading

yolocs commented Jun 3, 2020

grantr commented Jun 3, 2020

nachocano commented Jun 3, 2020

yolocs commented Jun 3, 2020 • edited Loading

ian-mi commented Jun 5, 2020 • edited Loading

yolocs commented Jun 8, 2020

ian-mi commented Jun 9, 2020

yolocs commented Jun 9, 2020

ian-mi commented Jun 9, 2020

liu-cong commented May 20, 2020 •

edited by grantr

Loading

yolocs commented Jun 3, 2020 •

edited

Loading

ian-mi commented Jun 5, 2020 •

edited

Loading