Skip to content

Commit

Permalink
Fix Celery task retry delays
Browse files Browse the repository at this point in the history
Fixes #5508.

Several of our Celery tasks have `retry_backoff=3600` which activates
Celery's exponential backoff feature for retries (see <https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_backoff>).
The intention was that:

1. The first retry of a task would be scheduled for `3600` secs (one
   hour) after the failure of the initial attempt
2. Following the rules of exponential backoff the second retry would be
   scheduled for `2 * 3600` secs (two hours) after the failure of the
   first retry

When accounting for Celery's enabled-by-default `retry_jitter` setting
(<https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_jitter>)
the delays should actually be randomly chosen times between `0` and
`3600` and between `0` and `7200` secs.

These long delays were chosen to prevent a snowballing problem: if tasks
start failing because a service is overloaded you don't want to further
load the service by quickly retrying those tasks. So instead wait an
hour or two before retrying.

Long delays were also chosen to maximize the chances of the retries
succeeding. For example if a task is failing because a service (such as
Mailchimp) is down then retrying the task right away is likely to fail
again but after an hour or two the service is likely to be back up.

Unfortunately [all the retries seem to be being scheduled within ten
minutes, not hours](https://hypothes-is.slack.com/archives/C074BUPEG/p1687344232668929?thread_ts=1687325831.540209&cid=C074BUPEG).
This turns out to be because of Celery's `retry_backoff_max` setting
(<https://docs.celeryq.dev/en/stable/userguide/tasks.html#Task.retry_backoff_max>)
which by default puts a cap of ten minutes on `retry_backoff`'s delays.

Fix this by increasing the `retry_backoff_max` setting to match the
maximum intended delay of `7200` secs.
  • Loading branch information
seanh committed Jun 21, 2023
1 parent 22114e1 commit 528831a
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 3 deletions.
16 changes: 14 additions & 2 deletions lms/tasks/email_digests.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,13 @@
LOG = logging.getLogger(__name__)


@app.task(acks_late=True, autoretry_for=(Exception,), max_retries=2, retry_backoff=3600)
@app.task(
acks_late=True,
autoretry_for=(Exception,),
max_retries=2,
retry_backoff=3600,
retry_backoff_max=7200,
)
def send_instructor_email_digest_tasks(*, batch_size):
"""
Generate and send instructor email digests.
Expand Down Expand Up @@ -82,7 +88,13 @@ def send_instructor_email_digest_tasks(*, batch_size):
)


@app.task(acks_late=True, autoretry_for=(Exception,), max_retries=2, retry_backoff=3600)
@app.task(
acks_late=True,
autoretry_for=(Exception,),
max_retries=2,
retry_backoff=3600,
retry_backoff_max=7200,
)
def send_instructor_email_digests(
*, h_userids: List[str], updated_after: str, updated_before: str, **kwargs
) -> None:
Expand Down
8 changes: 7 additions & 1 deletion lms/tasks/mailchimp.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@
from lms.tasks.celery import app


@app.task(acks_late=True, autoretry_for=(Exception,), max_retries=2, retry_backoff=3600)
@app.task(
acks_late=True,
autoretry_for=(Exception,),
max_retries=2,
retry_backoff=3600,
retry_backoff_max=7200,
)
def send_template(*, sender, recipient, **kwargs) -> None:
"""Send an email using Mailchimp's send-template API."""

Expand Down

0 comments on commit 528831a

Please sign in to comment.