Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download count limit is too restrictive for many uses #193

Closed
ValWood opened this issue May 6, 2015 · 32 comments
Closed

Download count limit is too restrictive for many uses #193

ValWood opened this issue May 6, 2015 · 32 comments

Comments

@ValWood
Copy link

ValWood commented May 6, 2015

Is it possible increase the download option number (restricted to 10,000). This is quite restrictive if you need all of the annotations for a species. GOA/QucikGO has a restriction, but has the option to over ride this restriction (can download at least 500,000 annotations). Is there any reason why AmiGO cannot allow the same?

@kltm kltm changed the title AmiGO download option is very restrictive (number) Download count limit is too restrictive for many uses May 6, 2015
@kltm
Copy link
Member

kltm commented May 6, 2015

Similar to #35, but higher limit.

@kltm
Copy link
Member

kltm commented May 6, 2015

Would prefer to wait until berkeleybop/bbop-js#16 has been crossed to take advantage of any performance increases and only do the numbers once.

@kltm kltm added this to the 2.3 milestone May 6, 2015
@kltm
Copy link
Member

kltm commented May 6, 2015

@ValWood Yes, the limit is a bit low--we get a steady trickle of comments about it on GO Help. The current number is the product of an ad hoc process during where we had a group of users simultaneously download sets at different limits while we watched the test servers and their response times. We wanted to make sure that one user trying to download a large file could not interfere with the responsiveness of the interface (not really an issue for AmiGO 1.x).

Now that we've been in production for a bit, we should actually have more solid numbers rather than the earlier guesstimate. @mugitty (whoops, wrong downstream), do you have a feel for how stressed the production servers are when they are running 1) during peak times and 2) the most used time when one of the servers is out of the balancer?

@kltm
Copy link
Member

kltm commented May 7, 2015

@kkarra (got the right production site now), now that we've been in production for a bit, we should actually have more solid numbers rather than the earlier guesstimate--do you have a feel for how stressed the production servers are when they are running 1) during peak times and 2) the most used time when one of the servers is out of the balancer?

@kkarra
Copy link

kkarra commented May 11, 2015

  1. do you want me to identify peak times or do we already know that and what to know load levels during peak times?
  2. looking back in logs for peak times when only one server was in use?

For increasing in #rows - was it memory issue or something else?

@kltm
Copy link
Member

kltm commented May 11, 2015

@kkarra I'd assume the peak times can be identified pretty easily by looking at the analytics. I'd be interested in machine stress (disk and CPU usage) during two different peak times: the global peak and the time of highest usage when a single machine is in use.

What I'm trying to determine is how much slack we have with our current setup for increasing the max download rows. We can also look at increasing the resources we have to allow downloads for a certain target number (say 500k), but it's a start to look at what we have.

@kltm kltm modified the milestones: 2.3, 2.4 Aug 26, 2015
@kltm kltm modified the milestones: 2.4, 2.5 Mar 2, 2016
@cmungall
Copy link
Member

cmungall commented Mar 9, 2016

Is this an accurate status update:

  • we think that we can increase the limit by, say 10x
  • we need to more stress testing to ensure that we don't end up choking the server

@kkarra are you still available to help with this?

@kltm
Copy link
Member

kltm commented Mar 9, 2016

I think we can up the limits a rather lot, given testing and possibly additional hardware. The main issue is not slowing down the UX on the main parts of AmiGO.
I'd vote for starting the process by introducing a "download server" URL to the configuration, and then build up from there. Worst case, we throw AWS at it.

@cmungall
Copy link
Member

cmungall commented Mar 9, 2016

I assume the download server option would need some changes in the app (still in 2.4 milestone?) and lots of coordination with production?

@kltm
Copy link
Member

kltm commented Mar 9, 2016

It will be a new variable that needs to be strung through. Once it's there, we could experiment with it fairly broadly. My druthers would be to start with a separate download server, behind a load balancer, and scale up as needed.

kltm added a commit to berkeleybop/bbop-manager-golr that referenced this issue Mar 18, 2016
@kltm
Copy link
Member

kltm commented Mar 19, 2016

With the next batch of commits coming down the pipe, this issue is fixed in the code. All that we need now is:

  1. a load balanced url to aim it
  2. update configs to reflect this (AMIGO_PUBLIC_GOLR_BULK_URL)

kltm added a commit that referenced this issue Apr 6, 2016
kltm added a commit that referenced this issue Apr 27, 2016
… download agents; TODO: could be more randomized; work on #193
@kltm
Copy link
Member

kltm commented Apr 28, 2016

EDITED

Okay, a little something in the way of "data" for this. I tried this against our machine here, using ~30s windows.

Five UI agents, no download agents:
newplot
Five UI agents, one download agent trying 10000 lines:
newplot 1
Five UI agents, one download agent trying 100000 lines:
newplot 2
Five UI agents, one download agent trying 500000 lines:
newplot 3

Looking at these, in this limited case, and without truly running the numbers, it doesn't look like the download agent is really dragging up the response times for the UI agents, up to 100000.
At 500000, I don't get a response from the download agent (number 6) within the 30s and the UI agents are getting hammered.

Of course, this is just with me server settings, etc., etc. But the takeaway here is that we should not just up to 500000. We either say something between 100000 and 500000 is good enough (gunna guess the lower end of the scale here) or we implement the separate download server. Any thoughts?

EDITED

@kltm
Copy link
Member

kltm commented Apr 28, 2016

Or maybe the load balancer is smart enough to deal with this kind of fun? @stuartmiyasato @kkarra would it be alright to retry some of these on the production servers at some point? Just a few 30s windows should annoy anybody too much right...?

@cmungall
Copy link
Member

there are 405k annotations for human. There is no point going beneath this number, we may as well keep it low and make people download via other means.

@ValWood
Copy link
Author

ValWood commented Apr 28, 2016

Won't be nearly so high when the redundancy is removed ;)

@kltm
Copy link
Member

kltm commented Apr 28, 2016

Okay, from @cmungall 's comment, we'll paint the desired limit at 500k.
That leaves us with, in order of difficulty and things to eliminate:

  • production servers can take the beating directly
  • production servers cannot take it, but the load balancer keeps things responsive
  • more though brought to bear against the current production balancing setup
  • switch to ui vs. download backend setup (a separate server (or set of) at a different URL)
  • remove redundant annotations (see Add ability to filter redundant annotations #43); may not be uniform solution

To start eliminating the first two, I just want to make sure I have a thumbs up to test a little against the production setup, both aiming at individual backends, as well as the current load balancer (@stuartmiyasato @kkarra).

@stuartmiyasato
Copy link

I am okay with testing against production.

@cmungall
Copy link
Member

On 28 Apr 2016, at 9:57, Val Wood wrote:

Won't be nearly so high when the redundancy is removed ;)

Good point. This is #43. If set by default it will reduce the size of
the typical download.

@kltm
Copy link
Member

kltm commented Apr 28, 2016

Okay, great. I'll probably start poking at it in a little bit a little later. It will be from the LBL block.
I've added the redundant annotation as an approach.

@stuartmiyasato
Copy link

Got Nagios/Uptime Robot alerts saying AmiGO and GOlr are down. Are you testing now? Should I restart the servers or will that interrupt any tests in progress?

@kltm
Copy link
Member

kltm commented Apr 28, 2016

Yes, please restart.

@kltm
Copy link
Member

kltm commented Apr 28, 2016

Okay, I'm not going to add more graphs, but I'll give a summary here.
I couldn't remember what the individual GOlr backend URLs were, so I just started with the load balancer. Over the same tests, the following (more or less obvious) things seemed to be confirmed, versus the tomodachi server:

  • The balanced URL was overall slower
  • The balancer URL had a wider range of times (less consistent)
  • The balanced URL was unable to return download results even at 100k (which tomodachi was able to give a dozen; nobody accomplished 500k in 30s)
  • The balancer URL was more robust--it did not have the pronounced knockout slots like tomodachi for the ui agents

Trying to see how long the 500k would actually take, I ran into a 1min timeout from nginx. If that could be upped, I could give it another try and see how long a user would actually have to wait.

Of course there are tons of uncontrolled variables here, include likely critical server settings. Considering that human is like 10% of total annotations and a non-trivial amount of total documents, I imagine that disk makes up a large portion of this.

@kltm
Copy link
Member

kltm commented Apr 28, 2016

BTW, my testing was over before the servers went down, so I'm not sure how much what I did was involved there.

As well, I forgot the conclusion. If people are willing to wait a minute or so (this will need to be tested) for the download, it's possible to stop this at the load balancer level. If not, we'll need to proceed to the independent download server, not for QoS, but to have something fast enough to server users large files from the index.

@stuartmiyasato
Copy link

Trying to see how long the 500k would actually take, I ran into a 1min timeout from nginx. If that could be upped, I could give it another try and see how long a user would actually have to wait.

@kltm I'm not sure what configuration needs to be edited to make this happen. Can you give me a URL and/or error message to replicate the timeout?

@kltm
Copy link
Member

kltm commented Apr 28, 2016

Well, I'm now getting 500k download completes in 40s, most of that on the transfer itself (whereas the query before was killing it), so hard to test. Possibly whatever happened interfered with the later testing?

The error was a fairly generic 502 for nginx, possibly referencing a timeout. I don't have a variable that we use to up this as it's not something we've run into when we were using the nginx reverse proxy server.

@kltm
Copy link
Member

kltm commented Apr 29, 2016

I'm going to go ahead and try the 500k again on the load balancer. The download query I'm using tries to start at a random spot so as to bypass any attempt to cache; let's see what happens, I'll run it for 90s.

@kltm
Copy link
Member

kltm commented Apr 29, 2016

Okay, the data is just confusing. This time around, the bbop backend demonstrates few disrupted slots, while the production balanced URL experiences significant disruption; neither successfully allow a 500k download over 90s. Something to do with the peppering by the ui agents maybe? Maybe one of the load balanced backends was unresponsive? It would probably take a lot to untangle all of this.

I see two paths here, as it's obvious that we won't be able to just make this work:

  • put more effort into the hardware, configuration, and load balancing of what we currently have
  • put that effort instead into having a separate download URL (either real, location TBD, or AWS)

Any input from @stuartmiyasato about current commitments or capacity?

@stuartmiyasato
Copy link

Mike Cherry has made it pretty clear to our lab that I won't be on the GO project for the next grant cycle, so it seems rather pointless for me to work on any local (Stanford) GO infrastructure projects. I would vote for AWS as a result, mostly due to my familiarity with it. But since I won't be managing it in the long term, my vote probably shouldn't count for much...

@kltm
Copy link
Member

kltm commented May 20, 2016

Okay, I'm getting bogged down in various things and we need to get this release out, even without the large downloads fully functional. I'm going to try and make this so it is just a configuration issue in the future (server variables and re-install), whatever solution we come up with.

For now, I'm going to thread through a new variable (AMIGO_DOWNLOAD_LIMIT/download_limit) and change the library settings to 100000 (while not 500k, still arguably 10x better and can be used to test the new settings). With these, and the download server addition from before, we should be able to switch to a separate download server down the line once we have it with just a few config changes.

@kltm
Copy link
Member

kltm commented May 20, 2016

The code is (well, should be) complete for this fix. I'm going to close this out and open a hotfix for download servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants