Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats page overload mitigation, passwording next steps discussion #9002

Closed
jywarren opened this issue Jan 12, 2021 · 14 comments
Closed

Stats page overload mitigation, passwording next steps discussion #9002

jywarren opened this issue Jan 12, 2021 · 14 comments
Labels
brainstorm Issues that need discussion and requirements need to be elucidated discussion optimization reducing load times and increasing code efficiency through refactoring

Comments

@jywarren
Copy link
Member

Just creating a coordination and discussion space for folks here -- I think there may be an info gap so just want to describe as much as I can so folks all know what's happening and we can identify any next steps.

It looks like the route https://publiclab.org/stats has experienced some very heavy loads due to bots, and in order to preserve our site, we've placed a password in front of that page. Thank you @icarito!

The original load (maybe "attack" is too strong a word?) was on Nov 13th as @icarito detected and reported in the chatroom. After initially blocking one bot, requests continued from a "masked" origin. @ebarry asked for temporary action to be taken due to an event that day so @icarito placed it behind a password to block anyone who didn't have the password.

Interestingly the bot was smart enough to be hitting unique ranges each time, which was one reason the load it generated was so high and disruptive.

Next steps

For community coord reasons Liz noted that the page should be open to the public, ideally, so we should brainstorm a bit on what we can do to re-open without requiring a password. I'll brainstorm some possible options here and we can discuss; if you have more ideas pls share!

  1. Lock the page to only logged-in users; this should prevent most bots but could potentially still remain open to determined attacks (or unintentional high use by community members; @skilfullycurled and others have occasionally wanted to download large segments of data, but i think it's tough for them to hit this route hard enough to cause too much trouble without really trying 😄 )
  2. Make stricter caching -- only allow download by pre-determined time chunks, like for one month at a time, and not for arbitrary timeframes, which means that there is an infinite number of possible requests to be responded to.
  3. Allow page access to /stats but to dig deeper you have to log in?
  4. Somehow have it automatically stop allowing pageloads to /stats pages if the server is having trouble keeping up? (Is this possible? Sounds complicated...)

Given that downloading what are essentially huge swaths of the database directly can be a heavy task which can impact site uptime and response time, what are some other compromises we could explore?

Thanks, all! And, we don't have to come up with the perfect solution if one of these will do for the time being!

@jywarren jywarren added brainstorm Issues that need discussion and requirements need to be elucidated discussion optimization reducing load times and increasing code efficiency through refactoring labels Jan 12, 2021
@icarito
Copy link
Member

icarito commented Jan 13, 2021

Hi,
I find it hard to understand what went on with the stats page.
Is it some random bot(s) going on a loop by pursuing links?
Is it a denial of service attack?
Is it some legitimate use?
Hoping it is the first option, then having the stats page only for logged in users would be a good choice to start as it seems like it would be the least effort!

@skilfullycurled
Copy link
Contributor

A few quick 🙄 notes :

  1. While the impetus for this discussion is certainly a new problem, these issues (particularly the first two) may be useful in seeing related problems we've dealt with in the past, cause we ruled out, solutions we tossed around, and the reasoning behind our decisions at that time.

Stats Download And Site Overload: #5524
Raw data from stats page: #4654
Planning for expanded community stats system: #3498
Stats Page Query Bug: #5917
Stats downloading returns "Page does not exist" for dates prior to early 2013: #5490

  1. Regarding the need to download data:

@skilfullycurled and others have occasionally wanted to download large segments of data, but i think it's tough for them to hit this route hard enough to cause too much trouble without really trying 😄 )

That is correct. But! I'll have to refresh my memory in the issues above, but there are some caveats which I avoided by simply downloading chunks of data with those issues in mind. And, with a bot not having been in on that conversation, it probably isn't accounting for them. We solved one of them (#5490) which had to do with corrupted data. The other has to do with the fact that there is a lot more data from the period of time prior to the site having any spam counter measures. I thought what we decided to do was to exclude data collected from the first iteration of the site entirely.

  1. Access to downloading data:

Lock the page to only logged-in users; this should prevent most bots but could potentially still remain open to determined attacks (or unintentional high use by community members.

We may have discussed requiring people to write/fill out a form and request access. This is a strategy I've seen other people use. Typically you have to answer a few questions (it's not gate keeping, just collecting internal data about who uses the data and what they use it for), and then agree to a few terms as well (citations, sharing, allowable usage, etc.)

  1. Size limits

Given that downloading what are essentially huge swaths of the database directly can be a heavy task which can impact site uptime and response time, what are some other compromises we could explore?

This might be another one of those, "our temporary solution was for people in the know to just not do the things that are bad". I don't know if it was implemented, but I think we discussed the following potential strategies:

  • limit the amount you could download at one time to a year or 6 months or whatever
  • "pre-package" time spans of data (e.g. a zip file for every year). I think the idea here is that the person downloading the data would be responsible for trimming it to the time period they wanted using Excel or their programming library of choice.

@ebarry
Copy link
Member

ebarry commented Jan 14, 2021

Thank you all for engaging with this! Have we ascertained if the main issue is that some bot/person is entering a lot of date ranges and making the visualizations redraw, or if it's that some bot/person is entering a lot of date ranges AND downloading the data?

@icarito
Copy link
Member

icarito commented Jan 14, 2021

Hi Liz,
I believe that if a bot hits the stats page it's unlikely to render the visualization as it is not a browser, however the load is caused by the heavy queries on the database, as any visualization is rendered client-side on the user's browser.

@icarito
Copy link
Member

icarito commented Jan 14, 2021

In looking at this one example of the problematic requests, it strikes me as odd that they are targeting a future date:

39.71.148.175 - - [13/Nov/2020:19:43:20 +0000] "GET /stats?start=March%2023,%202023%2019:48%20&end=June%2019,%202023%2019:48 HTTP/1.1" 200 18985 "https://publiclab.org/stats?start=March%2023,%202024%2019:48%20&end=June%2019,%202024%2019:48" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

In following the link, the following section draws my attention:
imagen

Could it be that the bot is following the anchor link <a href...:
imagen

Perhaps we could implement these buttons without an anchor link or add rel="nofollow" to them.

@ebarry
Copy link
Member

ebarry commented Jan 20, 2021

Dear Sebastian, are you saying we are being disrupted by a time-traveling bot from the Future searching for activity in 2024? Serious Inquiry 😜

To respond to Jeff's initial suggestions in this issue, could we combine 1 and 3 into a combined approach?

@jywarren
Copy link
Member Author

jywarren commented Feb 2, 2021

Hi all, just returning here to note that if we implemented (1) we'd almost certainly avoid all bots, as they can't easily create user accounts. Would just the very initial /stats page being public be enough? because we can cache that page very hard, so that's easy enough -- it's when people begin viewing or downloading arbitrary date ranges (ones we can't predict or cache) that we get in trouble.

(just to expand on that, imagine we precached all monthly data, which is like 10 years x 12 months = 120 datasets. Then we'd probably be OK unless someone tried to download them all in quick succession after the caches expired. But if we allow any arbitrary start/finish date, the queries are re-run for those ranges and the possible variety of ranges is near infinite so we can't pre-cache them)

If all are OK we can go ahead with 1+3 -- does this represent a temporary or permanent solution? Thanks!

@ebarry
Copy link
Member

ebarry commented Feb 2, 2021

1+3 sounds great Jeff, thank you so much for making these paths forward (and thanks Sebastian too!!!!!)

Precaching all monthly data, which is like 10 years x 12 months = 120 datasets, sounds great.
Can we limit custom download ranges to account privilege levels "mod" or "admin"? We can provide the email address to moderators@publiclab.org on the /stats page for researchers who wish to access the data.

also shouting out appreciation to @cesswairimu for the existence of our Stats system 😻

@skilfullycurled
Copy link
Contributor

I've been looking at it from two perspectives of interaction:

  1. folks like me who would like to work with the open data.

I think we can expect that anyone who would like to work with the open data (even people who are just beginning) should have or as a learning experience should acquire, the ability to download pre-defined amounts of data and slice out the part that they want to work with. Perhaps the zip file has subfolders by year. I'll leave it to you to figure out what/how to implement such a thing (including how often it is updated with new data), but my main point is that the data doesn't need to be packaged on-demand, it can be prepackaged for download.

  1. people who would like to see certain specific aspects of the data as needed. While @ebarry may not be the only person in this category, I'll leave it to her or represent the potential needs of those folks.

Future idea: rely on the front end visualization to do all of the different aggregations. For example, you download a cached version of a year, and then use the filter and sort functions or a highlight and zoom implantation to achieve what you could have if you were able to download arbitrary dates in the first place. I don't know what the limits are on how much a front end library can take on, but it's probably more than one might expect.

@ebarry
Copy link
Member

ebarry commented Feb 2, 2021

sounds great Benjamin!

@jywarren
Copy link
Member Author

jywarren commented Mar 4, 2021

OK, just circling back here, we've heard initial support in favor of 1 and 3, but then also now support in favor of 2 - shall we prioritize?

  1. If we begin by locking /stats completely to only logged-in users (1) that's relatively easy and means we can turn off our password protection.
  2. then we refine by locking only sub-pages, and re-opening just /stats to the public
  3. finally we circle back and try reworking the interface for fixed date ranges to expand non-logged-in access

If this sounds good, we can take the first step soon. Thanks, all!

jywarren added a commit that referenced this issue Apr 20, 2021
await approval in #9002, but this is a pretty simple change!
@jywarren
Copy link
Member Author

jywarren commented Apr 20, 2021

For step 1, here is the PR - I'll take it we can move forward on that and will then examine 2.

#9536

However, we should check with @icarito on if /stats itself is not a big performance drain as long as we block any other stats routes more specific than that (i.e. with larger time ranges).

jywarren added a commit that referenced this issue May 10, 2021
* Place /stats routes behind login check, for performance reasons

await approval in #9002, but this is a pretty simple change!

* Update public_pages_test.rb

* Update stats_controller_test.rb

* Update stats_controller_test.rb

* fix failing tests

Co-authored-by: Cess <cessmbuguar@gmail.com>
@jywarren
Copy link
Member Author

We went ahead and implemented the login check for all /stats* routes! This should mean we should shortly be able to waive the password protection.

#9536

@icarito
Copy link
Member

icarito commented May 18, 2021

Today I've commented out the password protection at the webserver level, and confirmed that this page is only accessible with a logged in user.

@icarito icarito closed this as completed May 18, 2021
reginaalyssa pushed a commit to reginaalyssa/plots2 that referenced this issue Oct 16, 2021
…iclab#9536)

* Place /stats routes behind login check, for performance reasons

await approval in publiclab#9002, but this is a pretty simple change!

* Update public_pages_test.rb

* Update stats_controller_test.rb

* Update stats_controller_test.rb

* fix failing tests

Co-authored-by: Cess <cessmbuguar@gmail.com>
billymoroney1 pushed a commit to billymoroney1/plots2 that referenced this issue Dec 28, 2021
…iclab#9536)

* Place /stats routes behind login check, for performance reasons

await approval in publiclab#9002, but this is a pretty simple change!

* Update public_pages_test.rb

* Update stats_controller_test.rb

* Update stats_controller_test.rb

* fix failing tests

Co-authored-by: Cess <cessmbuguar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorm Issues that need discussion and requirements need to be elucidated discussion optimization reducing load times and increasing code efficiency through refactoring
Projects
None yet
Development

No branches or pull requests

4 participants