-
Notifications
You must be signed in to change notification settings - Fork 6
Load testing for catalog-next #449
Comments
@avdata99 can you elaborate on why you want to look at the indexed URLs in the search engines? I wasn't sure how that's related. |
I guess there are thousands of links that point to the catalog. When we move to the new one, we will send 404s for many of them. I'm not sure if the list of indexed URLs is the best starting point, maybe the 1,000 or 10,000 most visited URLs in Analytics. I think it is a good idea to try to minimize this impact. If we get this list and test these URLs in the staging environment, maybe we can discover some clever redirects that might help here. Or at least know how big this impact will be |
Use access logs to determine current load on production site |
Began looking at locust and familiarizing myself with the tool. Note: |
@danmayol can you share any code you have, like docker-compose.yml or a locust config? I think we'd want to run this from the jumpbox, however we don't want to install docker or any containers there. That's not a deal breaker, we could run this from a laptop if we thought locust was the best tool. Honestly, I'm looking for something simple that would take an apache access log or similar format and just randomize or replay requests based on the distribution of requests in there. If we can write some simple python in order to do that for locust, that sounds good. We're only testing read requests here, there's no write traffic on catalog, so this should be pretty simple. I haven't used it before, but sar is fairly straight forward and I think covers everything we want (CPU, Memory, Disk, Network), and logs data on a configurable interval. I'm not sure about how to measure the backing services like solr and RDS. If we have New Relic configured, we'll get some insight there. |
@adborden Absolutely! Much the same thoughts I had (to not over-engineer this). I've also been thinking that measuring CPU, Memory, etc. is good for benchmarking what resource usage is under load and looking for variances, but the real value in load testing would be to get raw numbers of the number of concurrent users/load before the site itself becomes slow/unresponsive to impact actual use. In other words, even if CPU usage and memory is minimal, if we find that we can only support 10 concurrent users before the site is slow, that is a more meaningful data point at which point we would then want to look at the resource use to see if that is the underlying cause for the performance breakdown. Having logging from something like sar would definitely be good, but I wonder if that is more a system monitoring goal and not a load testing goal? (just brainstorming) This is what I have been playing with in regards to locust. We should of course not be locked into this, simply where I started based on the ticket notes. docker-compose.yml:
locustfile.py (simple test to run through a list of url paths from file and get the url):
locust-catalognext-urls (just some manually grabbed URLs, still need to get a better list or generate from a log as you mentioned):
If run manually through a gui (local desktop) then we define the number of users and spawn rate manually. Or we can run headless and have the desired values passed in (sample command line in script). As the container is being passed the command, just a matter of performing a Let me know what you think and how we should proceed (and we can discuss in scrum, zoom, or slack if easier). Thanks! |
Aaron and I discussed a bit on Monday. Some notes on what we discussed included:
Aside from the conversation: In further analyzing the first version of the above locust test script, realized there was a flaw in the logic for the test in that one locust task would walk through querying all the urls in the presented file (as fast as it could) which would limit our ability to control the delay periods (min_wait, max_wait) between queries and make it harder to get a true representation of load effect. As such, reworked it slightly to have each task call a single url which is randomly chosen from the urls in the file. Last but not least, one item I neglected to mention originally is that if desired we can also scale the worker containers so that the load is not being generated by a single container (although still from a single host of course). This can be accomplished simply by adding the desired number of scaled worker to the docker-compose call, for example to run 5 worker containers: Tar ball with all relevant files here: |
I pulled a week of logs from catalog production. They are not sorted in chronological order, so beware. https://drive.google.com/file/d/1cYvziM8IwIOeqj2cD6z6ychHIrvjqH6h/view?usp=sharing |
@avdata99 here are requests to catalog.data.gov over a two week period. This can be used to estimate maximum concurrent users we're seeing today. |
Next steps:
|
New test results Basic tests
Test from apache logs
|
We disabled
|
Update: Currently investigating why the organization index is so slow |
The slowest URL is |
Currently there's no CloudFront CDN enabled for catalog-next. There will be as part of the launch. There were recent changes to the cache logic in ckan/ckan#4781 but I think those only landed in CKAN 2.9. |
Same test with 50 users (instead 250)
|
Here are some things to try:
|
Test with 250 users after solr optimization
|
Test with 75 users and static assets
|
User Story
As a data.gov operator, I want to load test catalog-next so that I have confidence that there are no critical performance issues that should be resolved prior to catalog-next's launch.
Acceptance Criteria
WHEN I apply this load to catalog-next
AND I analyze the performance logs post-test
THEN I don't see any critical performance issues in CPU, Memory, Disk, Network, Database throughput
Details / tasks
Notes
https://locust.io/ is the recommended load testing tool at the gitter CKAN channel
The text was updated successfully, but these errors were encountered: