Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Revisit the issue of adding rate limiting logic to the application, create a list of actionable issues to start the effort. #23

Closed
kcondon opened this issue Jan 14, 2015 · 38 comments

Comments

@kcondon
Copy link

kcondon commented Jan 14, 2015

This ticket is a placeholder for general API rate and access limiting logic to better control the load placed on the service and provide options in case of system instability.

Rate limiting was mentioned during search api testing and github search api uses this concept too:
https://developer.github.com/v3/search/

Limiting access might involve varying degrees of options: general api access on/off switch, per api, and/or whitelist/blacklist of ip addresses/ users. The last might be integrated with groups and permissions.


Update: additional terms for this:

  • API throttling
@pdurbin
Copy link
Member

pdurbin commented Mar 15, 2018

For Harvard Dataverse we have been talking about investigating rate limiting solutions offered by AWS and I just pushed b1b703a to mention the new "Rate-Based Rules" offering that's part of AWS WAF (Web Application Firewall). This blog post provides a good overview: https://aws.amazon.com/blogs/aws/protect-web-sites-services-using-rate-based-rules-for-aws-waf/

@pdurbin
Copy link
Member

pdurbin commented May 18, 2018

At standup this morning I inquired if there is any specific technical plan or approach and got feedback that we are fine with an AWS-specific solution for now so I went ahead and made pull request IQSS/dataverse#4693 based on the commit I mentioned above and moved this issue to code review at https://waffle.io/IQSS/dataverse

@pdurbin pdurbin removed their assignment May 18, 2018
@djbrooke djbrooke changed the title API: Add rate limiting logic API: Add rate limiting logic on AWS May 18, 2018
@djbrooke djbrooke changed the title API: Add rate limiting logic on AWS API: Add rate limiting logic for AWS May 18, 2018
@djbrooke
Copy link

Thanks @pdurbin. I re-titled the issue to reflect that this is AWS-specific. We'll want a general solution at some point, but I think it's good to get this small chunk tested and out in a release.

@djbrooke djbrooke assigned djbrooke and landreev and unassigned djbrooke May 21, 2018
@djbrooke
Copy link

Talked after standup this morning. The approach is good, but we need some boxes from LTS to test. @landreev will look into this with LTS. @djbrooke will get involved if a credit card or something is needed :)

@landreev
Copy link

landreev commented May 21, 2018

I sent a note to LTS:

We've been thinking about creating a test AWS setup that would mock the production; for testing new releases before they go into production.

Something like a low-power instance or two. And we specifically want it to sit behind an ELB - in order to be able to test load-balancing and rate-limiting mechanisms. (This is one thing we have no way of testing as of now).

Is this something you could help us setting up, or would you recommend that we just set it up on our own?

(by the time I hit send I kinda felt like I was maybe pushing it with them... well, if that's the case they'll tell us to do it ourselves and we will. but I figured I'd ask)

@djbrooke
Copy link

Meeting with LTS on Wednesday, will discuss.

@djbrooke djbrooke assigned landreev and unassigned landreev Jun 18, 2018
@djbrooke djbrooke removed their assignment Jul 13, 2018
@matthew-a-dunlap matthew-a-dunlap self-assigned this Jul 30, 2018
@landreev
Copy link

@matthew-a-dunlap
The test cluster is made up of 2 app nodes:
dvn-cloud-dev-1.lts.harvard.edu
dvn-cloud-dev-2.lts.harvard.edu
I created a shell account for you, with the username mdunlap an sudo powers.
I'll slack the password to you.
The elb for the cluster is
https://dvn-dev.lts.harvard.edu/

Both nodes are using the database on dvn-cloud-dev-1.

@landreev landreev removed their assignment Jul 31, 2018
@matthew-a-dunlap
Copy link

This story is mostly blocked until we hear back from LTS about access to the web console, they only provided us access to the boxes themselves.

I'll can do some deeper research into web application firewall in the meantime.

@djbrooke
Copy link

The tech team will discuss and bring a well scoped issue to a future planning meeting.

@djbrooke
Copy link

@scolapasta - when you pick this up for discussion, one thing that @landreev mentioned is that it may be a good idea to check the number of locks a person has - for example a person can start a bunch of publishing requests, and the individual datasets are locked but what's to stop them from firing several thousand requests in parallel?

@PaulBoon
Copy link

PaulBoon commented Nov 29, 2021

Since last week we are experiencing lots of problems with lots of request to the '/api/access/datafiles/{id}' endpoint.
These download request are probably not malicious, but we don't know for sure of course.

Besides thinking about using something like mod_evasive, we also looked into our payara configuration, which might be tuned to give better performance.
This blog https://blog.payara.fish/fine-tuning-payara-server-5-in-production contains most useful information,
but I was wondering if there are some Dataverse specific tips available in the guides, or maybe it should be added?

@scolapasta scolapasta removed their assignment Mar 4, 2022
@mreekie
Copy link
Collaborator

mreekie commented Jan 10, 2023

Prio meeting with Stefano.

  • Moved from Dataverse Team Backlog to ordered backlog

@mreekie
Copy link
Collaborator

mreekie commented Jan 11, 2023

Top priority for upcoming sprint

@mreekie
Copy link
Collaborator

mreekie commented Jan 11, 2023

Sizing:

  • The first step here is a spike. time limited to a sprint
  • We don't have a nailed down approach to this though there has been some research and discussion.
  • As part of the spike, could include a tech hour session.
  • This could be big enough that it becomes a "deliverable"

@landreev
Copy link

This came up yet again, recently.
The reason it hasn't gone anywhere in 8 years is that it's way too fat, an elephant-sized issue that's too broadly defined. We've gone through this cycle quite a few times - of talking about it during tech hours, giving it to somebody to research and investigate, etc. But it's hard to even talk about, when we are defining it like this, that we want to "throttle everything", the full spectrum of our incoming traffic - it's not clear where to start even.
Our traffic is not uniform. It would be easy if we were only serving cat pictures (of roughly the same size) all day long. But our users' requests vary immensely in their impact on the system, plus we have different classes of users etc. In general, you need to know a lot about the specifics of our application and this makes adopting existing third party solutions difficult at least.

What I'm proposing is that instead of trying to re-visit this issue as a whole, we should just start chipping away at the problem by addressing certain specific cases of limiting excessive load that we can define and know how to address. I've proposed some, like detecting and blocking aggressive crawlers (basically what I do by hand occasionally; also blocking crawlers may be one area where some off the shelf solution may/should work); or limiting specific expensive activity on the user level (like a limit on how many files/data an unprivileged user can upload per hour). Features like this are in fact long overdue. And I'm convinced by now that it would be more productive to just work on them one clearly defined case at a time.

@mreekie
Copy link
Collaborator

mreekie commented Jan 27, 2023

Sprint board review

  • I'm pulling this one back off onto the sprint backlog, pending an OK from Stefano.
  • The reasons are:
    • This may need additional discussion
    • I added an extra 69 points to this sprint over the team estimate of 400.

(I can't wait until some of this is automated)

@mreekie
Copy link
Collaborator

mreekie commented Jan 27, 2023

Sprint board review

  • Resized as a time bounded spike.
  • Size 33.
  • Added back on the sprint.

@landreev landreev self-assigned this Jan 30, 2023
@landreev landreev changed the title Add rate limiting logic Spike: Revisit the issue of adding rate limiting logic to the application, create a list of actionable issues to start the effort. Jan 30, 2023
@landreev
Copy link

landreev commented Feb 2, 2023

There are few specific areas that have been identified where we can start working immediately.
The list below is the first set of such issues catalogued as part of this spike, some old and some brand-new.

  1. This new issue has been opened as a followup to the discussion with @siacus and @scolapasta, as a sensible area to focus on:
  1. During recent discussions it was suggested that metering and limiting file uploads should also be handled under this umbrella, since uploads are a very serious part of the overall practical system load, and there seems to be an agreement that this needs to be addressed urgently. Another practical consideration is that file uploads are not handled through the command engine, and therefore will not be subject to limiting by the technology described in 1. above.
    A few issues have been opened for storage quotas and limits over the years. There is some overlap between them.
  1. Add Apache-level solution for detecting bot/scripted or otherwise automated crawling, before it gets to the application:

@scolapasta scolapasta self-assigned this Feb 2, 2023
@landreev
Copy link

landreev commented Feb 2, 2023

Per feedback from @qqmyers, I'll run some quick practical analysis on the ActionLogRecord data in production, to see if any obvious results can be derived from it immediately, smoking guns/worst offenders, etc.

@landreev
Copy link

landreev commented Feb 2, 2023

Actually, I'll add any useful stats from the prod. ActionLogRecord to the "command engine" issue (#9356).

@scolapasta
Copy link

Reviewed the new issues added - I think they look good and represent what we can first get done, in order to help with rate limiting. There may well be more to do after those, but let's get them working (I've gone ahead and added them to the Dataverse Dev column in the backlog board) and we can revisit after, as needed.

@mreekie
Copy link
Collaborator

mreekie commented Feb 28, 2023

Grooming:

  • This is closed but it is the start of a dev effort to revisit rate limiting issues.
  • Added a deliverable label:

@mreekie mreekie reopened this Mar 6, 2023
@mreekie mreekie transferred this issue from IQSS/dataverse Mar 6, 2023
@mreekie
Copy link
Collaborator

mreekie commented Apr 10, 2023

grooming:

  • This needs to be looked at again.
  • For now I put it in the backlog and added the deliverable label.

@scolapasta
Copy link

Closing this, now that we have IQSS/dataverse#10211 in progress.

@landreev
Copy link

@scolapasta Are you sure you wanted to close this one?
Note that this spike was for being able to limit everything across the application; with the idea, I think, that more than one solution may be need in parallel, for different parts of the application.
IQSS/dataverse#10211, and the corresponding issue are specifically for the Command Engine only.

I can see how an argument can be made that if there is anything potentially expensive that we want to ration, that's done bypassing the command system, then it could potentially be addressed by creating dedicated commands for all such things... But I still think that would need to be discussed to make sure we're not missing anything.

@scolapasta
Copy link

@landreev If there are other areas that we do need to ration, outside of the command system, then I'd vote for creating more specific actionable issues for it. This one here was in the dm-project and I do think we've made plenty of headway on different aspects and I think that accomplished the goal of "creat[ing] a list of actionable issues to start the effort". But if you feel otherwise and think there's something more we can do for this one specifically, that's fine too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants