Stats page overload mitigation, passwording next steps discussion #9002

jywarren · 2021-01-12T23:39:08Z

Just creating a coordination and discussion space for folks here -- I think there may be an info gap so just want to describe as much as I can so folks all know what's happening and we can identify any next steps.

It looks like the route https://publiclab.org/stats has experienced some very heavy loads due to bots, and in order to preserve our site, we've placed a password in front of that page. Thank you @icarito!

The original load (maybe "attack" is too strong a word?) was on Nov 13th as @icarito detected and reported in the chatroom. After initially blocking one bot, requests continued from a "masked" origin. @ebarry asked for temporary action to be taken due to an event that day so @icarito placed it behind a password to block anyone who didn't have the password.

Interestingly the bot was smart enough to be hitting unique ranges each time, which was one reason the load it generated was so high and disruptive.

Next steps

For community coord reasons Liz noted that the page should be open to the public, ideally, so we should brainstorm a bit on what we can do to re-open without requiring a password. I'll brainstorm some possible options here and we can discuss; if you have more ideas pls share!

Lock the page to only logged-in users; this should prevent most bots but could potentially still remain open to determined attacks (or unintentional high use by community members; @skilfullycurled and others have occasionally wanted to download large segments of data, but i think it's tough for them to hit this route hard enough to cause too much trouble without really trying 😄 )
Make stricter caching -- only allow download by pre-determined time chunks, like for one month at a time, and not for arbitrary timeframes, which means that there is an infinite number of possible requests to be responded to.
Allow page access to /stats but to dig deeper you have to log in?
Somehow have it automatically stop allowing pageloads to /stats pages if the server is having trouble keeping up? (Is this possible? Sounds complicated...)

Given that downloading what are essentially huge swaths of the database directly can be a heavy task which can impact site uptime and response time, what are some other compromises we could explore?

Thanks, all! And, we don't have to come up with the perfect solution if one of these will do for the time being!

icarito · 2021-01-13T20:27:17Z

Hi,
I find it hard to understand what went on with the stats page.
Is it some random bot(s) going on a loop by pursuing links?
Is it a denial of service attack?
Is it some legitimate use?
Hoping it is the first option, then having the stats page only for logged in users would be a good choice to start as it seems like it would be the least effort!

skilfullycurled · 2021-01-13T22:43:30Z

A few quick 🙄 notes :

While the impetus for this discussion is certainly a new problem, these issues (particularly the first two) may be useful in seeing related problems we've dealt with in the past, cause we ruled out, solutions we tossed around, and the reasoning behind our decisions at that time.

Stats Download And Site Overload: #5524
Raw data from stats page: #4654
Planning for expanded community stats system: #3498
Stats Page Query Bug: #5917
Stats downloading returns "Page does not exist" for dates prior to early 2013: #5490

Regarding the need to download data:

@skilfullycurled and others have occasionally wanted to download large segments of data, but i think it's tough for them to hit this route hard enough to cause too much trouble without really trying 😄 )

That is correct. But! I'll have to refresh my memory in the issues above, but there are some caveats which I avoided by simply downloading chunks of data with those issues in mind. And, with a bot not having been in on that conversation, it probably isn't accounting for them. We solved one of them (#5490) which had to do with corrupted data. The other has to do with the fact that there is a lot more data from the period of time prior to the site having any spam counter measures. I thought what we decided to do was to exclude data collected from the first iteration of the site entirely.

Access to downloading data:

Lock the page to only logged-in users; this should prevent most bots but could potentially still remain open to determined attacks (or unintentional high use by community members.

We may have discussed requiring people to write/fill out a form and request access. This is a strategy I've seen other people use. Typically you have to answer a few questions (it's not gate keeping, just collecting internal data about who uses the data and what they use it for), and then agree to a few terms as well (citations, sharing, allowable usage, etc.)

Size limits

Given that downloading what are essentially huge swaths of the database directly can be a heavy task which can impact site uptime and response time, what are some other compromises we could explore?

This might be another one of those, "our temporary solution was for people in the know to just not do the things that are bad". I don't know if it was implemented, but I think we discussed the following potential strategies:

limit the amount you could download at one time to a year or 6 months or whatever
"pre-package" time spans of data (e.g. a zip file for every year). I think the idea here is that the person downloading the data would be responsible for trimming it to the time period they wanted using Excel or their programming library of choice.

ebarry · 2021-01-14T14:40:08Z

Thank you all for engaging with this! Have we ascertained if the main issue is that some bot/person is entering a lot of date ranges and making the visualizations redraw, or if it's that some bot/person is entering a lot of date ranges AND downloading the data?

icarito · 2021-01-14T15:38:59Z

Hi Liz,
I believe that if a bot hits the stats page it's unlikely to render the visualization as it is not a browser, however the load is caused by the heavy queries on the database, as any visualization is rendered client-side on the user's browser.

icarito · 2021-01-14T16:48:20Z

In looking at this one example of the problematic requests, it strikes me as odd that they are targeting a future date:

39.71.148.175 - - [13/Nov/2020:19:43:20 +0000] "GET /stats?start=March%2023,%202023%2019:48%20&end=June%2019,%202023%2019:48 HTTP/1.1" 200 18985 "https://publiclab.org/stats?start=March%2023,%202024%2019:48%20&end=June%2019,%202024%2019:48" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

In following the link, the following section draws my attention:

Could it be that the bot is following the anchor link <a href...:

Perhaps we could implement these buttons without an anchor link or add rel="nofollow" to them.

ebarry · 2021-01-20T18:18:09Z

Dear Sebastian, are you saying we are being disrupted by a time-traveling bot from the Future searching for activity in 2024? Serious Inquiry 😜

To respond to Jeff's initial suggestions in this issue, could we combine 1 and 3 into a combined approach?

jywarren · 2021-02-02T15:26:09Z

Hi all, just returning here to note that if we implemented (1) we'd almost certainly avoid all bots, as they can't easily create user accounts. Would just the very initial /stats page being public be enough? because we can cache that page very hard, so that's easy enough -- it's when people begin viewing or downloading arbitrary date ranges (ones we can't predict or cache) that we get in trouble.

(just to expand on that, imagine we precached all monthly data, which is like 10 years x 12 months = 120 datasets. Then we'd probably be OK unless someone tried to download them all in quick succession after the caches expired. But if we allow any arbitrary start/finish date, the queries are re-run for those ranges and the possible variety of ranges is near infinite so we can't pre-cache them)

If all are OK we can go ahead with 1+3 -- does this represent a temporary or permanent solution? Thanks!

ebarry · 2021-02-02T15:46:20Z

1+3 sounds great Jeff, thank you so much for making these paths forward (and thanks Sebastian too!!!!!)

Precaching all monthly data, which is like 10 years x 12 months = 120 datasets, sounds great.
Can we limit custom download ranges to account privilege levels "mod" or "admin"? We can provide the email address to moderators@publiclab.org on the /stats page for researchers who wish to access the data.

also shouting out appreciation to @cesswairimu for the existence of our Stats system 😻

skilfullycurled · 2021-02-02T17:29:11Z

I've been looking at it from two perspectives of interaction:

folks like me who would like to work with the open data.

I think we can expect that anyone who would like to work with the open data (even people who are just beginning) should have or as a learning experience should acquire, the ability to download pre-defined amounts of data and slice out the part that they want to work with. Perhaps the zip file has subfolders by year. I'll leave it to you to figure out what/how to implement such a thing (including how often it is updated with new data), but my main point is that the data doesn't need to be packaged on-demand, it can be prepackaged for download.

people who would like to see certain specific aspects of the data as needed. While @ebarry may not be the only person in this category, I'll leave it to her or represent the potential needs of those folks.

Future idea: rely on the front end visualization to do all of the different aggregations. For example, you download a cached version of a year, and then use the filter and sort functions or a highlight and zoom implantation to achieve what you could have if you were able to download arbitrary dates in the first place. I don't know what the limits are on how much a front end library can take on, but it's probably more than one might expect.

ebarry · 2021-02-02T21:46:13Z

sounds great Benjamin!

jywarren · 2021-03-04T15:18:08Z

OK, just circling back here, we've heard initial support in favor of 1 and 3, but then also now support in favor of 2 - shall we prioritize?

If we begin by locking /stats completely to only logged-in users (1) that's relatively easy and means we can turn off our password protection.
then we refine by locking only sub-pages, and re-opening just /stats to the public
finally we circle back and try reworking the interface for fixed date ranges to expand non-logged-in access

If this sounds good, we can take the first step soon. Thanks, all!

await approval in #9002, but this is a pretty simple change!

jywarren · 2021-04-20T19:14:00Z

For step 1, here is the PR - I'll take it we can move forward on that and will then examine 2.

#9536

However, we should check with @icarito on if /stats itself is not a big performance drain as long as we block any other stats routes more specific than that (i.e. with larger time ranges).

* Place /stats routes behind login check, for performance reasons await approval in #9002, but this is a pretty simple change! * Update public_pages_test.rb * Update stats_controller_test.rb * Update stats_controller_test.rb * fix failing tests Co-authored-by: Cess <cessmbuguar@gmail.com>

jywarren · 2021-05-10T22:10:45Z

We went ahead and implemented the login check for all /stats* routes! This should mean we should shortly be able to waive the password protection.

#9536

icarito · 2021-05-18T16:27:45Z

Today I've commented out the password protection at the webserver level, and confirmed that this page is only accessible with a logged in user.

…iclab#9536) * Place /stats routes behind login check, for performance reasons await approval in publiclab#9002, but this is a pretty simple change! * Update public_pages_test.rb * Update stats_controller_test.rb * Update stats_controller_test.rb * fix failing tests Co-authored-by: Cess <cessmbuguar@gmail.com>

jywarren added brainstorm Issues that need discussion and requirements need to be elucidated discussion optimization reducing load times and increasing code efficiency through refactoring labels Jan 12, 2021

jywarren added a commit that referenced this issue Apr 20, 2021

Place /stats routes behind login check, for performance reasons

cf0c0e1

await approval in #9002, but this is a pretty simple change!

jywarren mentioned this issue Apr 20, 2021

Place /stats routes behind login check, for performance reasons #9536

Merged

icarito closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats page overload mitigation, passwording next steps discussion #9002

Stats page overload mitigation, passwording next steps discussion #9002

jywarren commented Jan 12, 2021

icarito commented Jan 13, 2021

skilfullycurled commented Jan 13, 2021

ebarry commented Jan 14, 2021

icarito commented Jan 14, 2021

icarito commented Jan 14, 2021

ebarry commented Jan 20, 2021

jywarren commented Feb 2, 2021

ebarry commented Feb 2, 2021

skilfullycurled commented Feb 2, 2021

ebarry commented Feb 2, 2021

jywarren commented Mar 4, 2021

jywarren commented Apr 20, 2021 •

edited

Loading

jywarren commented May 10, 2021

icarito commented May 18, 2021

Stats page overload mitigation, passwording next steps discussion #9002

Stats page overload mitigation, passwording next steps discussion #9002

Comments

jywarren commented Jan 12, 2021

Next steps

icarito commented Jan 13, 2021

skilfullycurled commented Jan 13, 2021

ebarry commented Jan 14, 2021

icarito commented Jan 14, 2021

icarito commented Jan 14, 2021

ebarry commented Jan 20, 2021

jywarren commented Feb 2, 2021

ebarry commented Feb 2, 2021

skilfullycurled commented Feb 2, 2021

ebarry commented Feb 2, 2021

jywarren commented Mar 4, 2021

jywarren commented Apr 20, 2021 • edited Loading

jywarren commented May 10, 2021

icarito commented May 18, 2021

jywarren commented Apr 20, 2021 •

edited

Loading