Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry when it could not get the pull request #101

Merged
merged 1 commit into from
Mar 15, 2012

Conversation

asmeurer
Copy link
Member

This also changes all retries to use a doubled waiting scheme, where it waits
for 1 second, then 2 seconds, then 4 seconds, then 8 seconds, and so on. This
is for example what GMail does. This prevents us from DOSing the server.

The easiest way to test this is to disconnect yourself from the internet and
try to review a pull request.

Closes #96.

This also changes all retries to use a doubled waiting scheme, where it waits
for 1 second, then 2 seconds, then 4 seconds, then 8 seconds, and so on.  This
is for example what GMail does.  This prevents us from DOSing the server.

The easiest way to test this is to disconnect yourself from the internet and
try to review a pull request.

Closes sympy#96.
@MichaelMayorov
Copy link
Contributor

What point of using doubling?
I think timeout should be optional and defined by user(i.e. slow internet connection via GPRS at least).
Maybe choose default timeout(2-3 seconds) and let user to change it with option like --timeout

@MichaelMayorov
Copy link
Contributor

Count of tries will be useful also. Noone will want to wait for several minutes between tries.

@asmeurer
Copy link
Member Author

If you get to the point where it's waiting more than a minute between tries, it's likely that the server is down, or there is some other problem (internet connection). If that happens, you should correct the problem, and restart. This is designed to prevent DOSing the system when run unmanned. If you are manning the system and see that there's a problem, you should step in and fix it, or if it cannot be fixed (such as if the server is down), pull the bot offline until it is fixed.

A timeout of 2-3 seconds has the potential to DOS the server. Imagine there is some bug or typo somewhere that causes it to run the loop endlessly. If the system is being run unmanned, then in one hour, with a three second timeout it will ping the server 60_60/3 = 1200 times. With this system, it will ping the system no more than log(60_60, 2) ~= 11 times. That's for one hour. If the system is run for 24 hours unmanned, the first number goes up to 60_60_24/3 = 28800 times, and the second becomes log(60_60_24, 2) ~= 16 times.

You might say that we should just make the timeout longer. But this doesn't really solve the problem. First off, it will still be a ton more pings than the doubling approach. Second, if we make the timeout longer than 5 or 10 seconds, it becomes an inconvenience to the user in the common case, where it just had an intermittent failure and will work the next time or the time after that. This way, if there is an intermittent failure, it will try again in 1 second, and again after that in 2 seconds. Depending on how intermittent the failure is, it will likely work at that point (the scheme is adaptive to different levels of intermittentcy, so even longer term failures like something that lasts 10 seconds will be handled reasonably, without DOSing the system, which can cause the failure to last even longer in some cases).

@MichaelMayorov
Copy link
Contributor

OK, I get your point.

But how administrator will know that sympy-bot can't get pull requests? btw, if server going down for a long time(half of day or more) then timeout between tries could be more then 1 hour.

Besides that, it's a cron job. I don't know how GAE respond on running same update script in 2 or more cron jobs(I mean case when first cron job just started and didn't finish before second job starts).
But anyway, administrator doesn't know about problem until he'll see the problem itself.

So, I suggest two things. First, use counter of tries. So, if server don't respond for a long time then generate fail message and stop. Second, if that happens then application should forward this message to administrator(for example via email).

P.S. also, may be I just didn't find that, but sympy-bot has notification system? If not then may include that as idea in my GSoC proposal?

@asmeurer
Copy link
Member Author

But how administrator will know that sympy-bot can't get pull requests? btw, if server going down for a long time(half of day or more) then timeout between tries could be more then 1 hour.

Yeah, that's expected. Remember that if this happens, then it means that the bot itself is running unmanned (or else someone would have noticed that it was just sitting there doing nothing). The best we can do in this case is make it not DOS the server, but occasionally try it in case it goes up again (again unmanned). For now, unmanned doesn't happen much, but when we implement sympy-bot work (#63) it will.

I don't understand what you are saying about cron jobs. What are you saying will be a cron job?

We don't have any kind of notifications, other than the comments to github. If you think they will be useful, feel free to open an issue for it and include it in your proposal. I suppose it could be useful, e.g., if you have sympy-bot work running on some old machine or server somewhere, to have it send notifications if it errors, or maybe a heartbeat so we can easily see if it goes down. Does the app engine have APIs for this sort of thing? If so, we could just do everything through the reviews site.

Anyway, how the admin knows about it is a different problem from this pull request. For now, we just implement the basic low-tech solution of printing a message to the terminal every time it retries. The main thing here is to prevent high amounts of pinging (DOSing), and, more importantly, to catch the URLError that has been annoying me for the past few weeks.

@MichaelMayorov
Copy link
Contributor

I don't understand what you are saying about cron jobs. What are you saying will be a cron job?

I meant http://reviews.sympy.org/update (will run every 12 hours) and http://reviews.sympy.org/quickupdate (will run every 2 hours), as I understand theese scripts perfrom update of pull requests list.

time.sleep(2)
print "Error while accessing pastehtml.com, retrying in %d seconds..." % timer
time.sleep(timer)
timer *= 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is cool. Except that we don't use pastehtml anymore, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I figured that since this is still in here, I might as well update it. Maybe it was kept around incase we ever need to fall back to it, or something similar?

@certik
Copy link
Member

certik commented Mar 14, 2012

I am +1 for this patch. If this works for you, feel free to merge it.

@asmeurer
Copy link
Member Author

I meant http://reviews.sympy.org/update (will run every 12 hours) and http://reviews.sympy.org/quickupdate (will run every 2 hours), as I understand theese scripts perfrom update of pull requests list.

No, these changes are only for sympy-bot itself, which runs locally. The webapp is only stuff in the web directory, which I did not touch here.

@asmeurer
Copy link
Member Author

I am +1 for this patch. If this works for you, feel free to merge it.

Thanks. I'm doing so now, since I can't really use SymPy-Bot except in this branch.

asmeurer added a commit that referenced this pull request Mar 15, 2012
Retry when it could not get the pull request
@asmeurer asmeurer merged commit 54a7485 into sympy:master Mar 15, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

urllib2.URLError: <urlopen error [Errno 54] Connection reset by peer>
3 participants