Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of retries #194

Closed
tremby opened this issue Sep 22, 2016 · 13 comments
Closed

Limit number of retries #194

tremby opened this issue Sep 22, 2016 · 13 comments

Comments

@tremby
Copy link

tremby commented Sep 22, 2016

When a task fails (like an unhandled exception) it is thrown in the failing jobs list but also retried. It doesn't seem there's any limit to the number of retries, and that means a job which is doomed to failure (due to my bad programming) continues to fail all day. I get a message sent to Rollbar each time, and that adds up to a lot of messages and eats my quota.

Is there a way to set a retry limit?

@manuganji
Copy link

Is there a work around for this?

@DEKHTIARJonathan
Copy link

Is there any update on this ? I have the exact same issue ...
@tremby @manuganji did you find a workaround ?

@tremby
Copy link
Author

tremby commented Apr 15, 2017

I honestly don't recall, but I don't think so. Sorry!

@manuganji
Copy link

No, I haven't found it. I just fixed that scenario where my task was failing and made sure that I covered all the scenarios that I could predict. An ugly workaround is to wrap your whole task code in a try except block and only throw known exceptions and absorb all others. Of course, this only works if you have limited task types.

@Eagllus
Copy link
Collaborator

Eagllus commented Apr 18, 2017

One way I resolved this issue is with a function called retry
This is designed to work with iLO's that are unreliable with there response/connections.

def retry(func, *args, **kwargs):
    """
    a custom retry for setting iLO information
    """
    count = kwargs.get('count', 0)
    max_retries = kwargs.get('max_retries', 3)
    countdown = kwargs.get('countdown', 30)
    exc = kwargs.get('exc', BaseException)

    if count < max_retries:
        time.sleep(countdown)
        count += 1
        func(*args, count=count, max_retries=max_retries, countdown=countdown)
    else:
        raise exc

I used it like this

def set_host_power(ilo, values, **kwargs):
    try:
        return ilo.call_ilo('set_host_power_saver', values['host_power_saver'])
    except (IloCommunicationError, socket.timeout, socket.error) as exc:
        return retry(set_host_power, ilo, values, exc=exc, **kwargs)

This will make allow retry to retry it an max_retries (default: 3) before raising the exception.
You could instead return a state instead of raising the exception.

@dangerski
Copy link

If you are using AWS SQS, my workaround is to make a deadletter queue and then on your task queue enable "use redrive policy" which "Sends messages into a dead letter queue after exceeding the Maximum Receives." Then set the maximum receives to the amount of retries you want to allow.

@pilgrim2go
Copy link

@dangerski : I have the same problem but even if I configure AWS Queue to use redrive. It doesn't help.
Diving into code to see why.

Balletie added a commit to Balletie/django-q that referenced this issue Mar 9, 2018
If a task fails with an exception, it is retried until it
succeeds. This is contrary to what is said in the documentation: under
the "Architecture" section, heading "Broker" it says that even when a
task errors, it's still considered a successful delivery. Failed tasks
never get acknowledged however, thereby being retried after the
timeout period. See also issues Koed00#238 and Koed00#194.

This patch adds an option to acknowledge failures, thereby closing
issue Koed00#238. Issue Koed00#194 would require some more work. The default of
this option is set to `False`, thereby maintaining backwards
compatibility.
@mm-matthias
Copy link

The ack_failure/ack_failures options do not work for tasks that fail due to timeout. After the worker is killed, tasks remain in the queue. Once the worker is reincarnated, it fetches the task, runs into timeout and the whole things starts all over again.

@Waszker
Copy link

Waszker commented Mar 17, 2020

@mm-matthias I have exactly the same problem. Have you found the solution for it? It there a way to specify number of retries before termination?

@mm-matthias
Copy link

@Waszker We are using celery/redis instead of django-q for more than a year, so I don't have a solution for this problem

@Waszker
Copy link

Waszker commented Mar 17, 2020

@Waszker We are using celery/redis instead of django-q for more than a year, so I don't have a solution for this problem

That'a a pity... Thanks for the answer though...

@timomeara
Copy link
Contributor

i have a PR for a retry limit
#466
try it out :)

@abhishek-compro
Copy link

This can be closed

@tremby tremby closed this as completed Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants