-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats page overload mitigation, passwording next steps discussion #9002
Comments
Hi, |
A few quick 🙄 notes :
Stats Download And Site Overload: #5524
That is correct. But! I'll have to refresh my memory in the issues above, but there are some caveats which I avoided by simply downloading chunks of data with those issues in mind. And, with a bot not having been in on that conversation, it probably isn't accounting for them. We solved one of them (#5490) which had to do with corrupted data. The other has to do with the fact that there is a lot more data from the period of time prior to the site having any spam counter measures. I thought what we decided to do was to exclude data collected from the first iteration of the site entirely.
We may have discussed requiring people to write/fill out a form and request access. This is a strategy I've seen other people use. Typically you have to answer a few questions (it's not gate keeping, just collecting internal data about who uses the data and what they use it for), and then agree to a few terms as well (citations, sharing, allowable usage, etc.)
This might be another one of those, "our temporary solution was for people in the know to just not do the things that are bad". I don't know if it was implemented, but I think we discussed the following potential strategies:
|
Thank you all for engaging with this! Have we ascertained if the main issue is that some bot/person is entering a lot of date ranges and making the visualizations redraw, or if it's that some bot/person is entering a lot of date ranges AND downloading the data? |
Hi Liz, |
Dear Sebastian, are you saying we are being disrupted by a time-traveling bot from the Future searching for activity in 2024? Serious Inquiry 😜 To respond to Jeff's initial suggestions in this issue, could we combine 1 and 3 into a combined approach? |
Hi all, just returning here to note that if we implemented (1) we'd almost certainly avoid all bots, as they can't easily create user accounts. Would just the very initial (just to expand on that, imagine we precached all monthly data, which is like 10 years x 12 months = 120 datasets. Then we'd probably be OK unless someone tried to download them all in quick succession after the caches expired. But if we allow any arbitrary start/finish date, the queries are re-run for those ranges and the possible variety of ranges is near infinite so we can't pre-cache them) If all are OK we can go ahead with 1+3 -- does this represent a temporary or permanent solution? Thanks! |
1+3 sounds great Jeff, thank you so much for making these paths forward (and thanks Sebastian too!!!!!) Precaching all monthly data, which is like 10 years x 12 months = 120 datasets, sounds great. also shouting out appreciation to @cesswairimu for the existence of our Stats system 😻 |
I've been looking at it from two perspectives of interaction:
I think we can expect that anyone who would like to work with the open data (even people who are just beginning) should have or as a learning experience should acquire, the ability to download pre-defined amounts of data and slice out the part that they want to work with. Perhaps the zip file has subfolders by year. I'll leave it to you to figure out what/how to implement such a thing (including how often it is updated with new data), but my main point is that the data doesn't need to be packaged on-demand, it can be prepackaged for download.
Future idea: rely on the front end visualization to do all of the different aggregations. For example, you download a cached version of a year, and then use the filter and sort functions or a highlight and zoom implantation to achieve what you could have if you were able to download arbitrary dates in the first place. I don't know what the limits are on how much a front end library can take on, but it's probably more than one might expect. |
sounds great Benjamin! |
OK, just circling back here, we've heard initial support in favor of 1 and 3, but then also now support in favor of 2 - shall we prioritize?
If this sounds good, we can take the first step soon. Thanks, all! |
await approval in #9002, but this is a pretty simple change!
* Place /stats routes behind login check, for performance reasons await approval in #9002, but this is a pretty simple change! * Update public_pages_test.rb * Update stats_controller_test.rb * Update stats_controller_test.rb * fix failing tests Co-authored-by: Cess <cessmbuguar@gmail.com>
We went ahead and implemented the login check for all |
Today I've commented out the password protection at the webserver level, and confirmed that this page is only accessible with a logged in user. |
…iclab#9536) * Place /stats routes behind login check, for performance reasons await approval in publiclab#9002, but this is a pretty simple change! * Update public_pages_test.rb * Update stats_controller_test.rb * Update stats_controller_test.rb * fix failing tests Co-authored-by: Cess <cessmbuguar@gmail.com>
…iclab#9536) * Place /stats routes behind login check, for performance reasons await approval in publiclab#9002, but this is a pretty simple change! * Update public_pages_test.rb * Update stats_controller_test.rb * Update stats_controller_test.rb * fix failing tests Co-authored-by: Cess <cessmbuguar@gmail.com>
Just creating a coordination and discussion space for folks here -- I think there may be an info gap so just want to describe as much as I can so folks all know what's happening and we can identify any next steps.
It looks like the route https://publiclab.org/stats has experienced some very heavy loads due to bots, and in order to preserve our site, we've placed a password in front of that page. Thank you @icarito!
The original load (maybe "attack" is too strong a word?) was on Nov 13th as @icarito detected and reported in the chatroom. After initially blocking one bot, requests continued from a "masked" origin. @ebarry asked for temporary action to be taken due to an event that day so @icarito placed it behind a password to block anyone who didn't have the password.
Interestingly the bot was smart enough to be hitting unique ranges each time, which was one reason the load it generated was so high and disruptive.
Next steps
For community coord reasons Liz noted that the page should be open to the public, ideally, so we should brainstorm a bit on what we can do to re-open without requiring a password. I'll brainstorm some possible options here and we can discuss; if you have more ideas pls share!
Given that downloading what are essentially huge swaths of the database directly can be a heavy task which can impact site uptime and response time, what are some other compromises we could explore?
Thanks, all! And, we don't have to come up with the perfect solution if one of these will do for the time being!
The text was updated successfully, but these errors were encountered: