-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Test - scheduled for 2024-06-07 #162
Comments
The specific backend branch being tested is 161-paginate-via-search-facet-requests. I believe it to be a branch of 160-paginate-non-semantic-facet-requests, which is a branch of release1.18. In case this test is a raging success and we have time for a second test, we could re-run #132: include alternative names when resolving name search criteria. I created the 161-pagination-PLUS-100-alternative-names branch which is the above plus #100's edits. cc: @gigamorph, @kamerynB, @prowns, @jffcamp, @clarkepeterf |
Middle tier stats: |
There were fewer backend requests than expected --yet there are not significant differences between the total number of executed Virtual User (VU) flows/transactions, overall error rate, and average response times with this test's baseline (22 Jun 23) and a more recent test (14 Feb 24). Test duration doesn't explain it either. Comparison:
If the system isn't being pushed as hard as previous tests, the results of this test could be skewed. Regardless, today's test was so much better than those on 9 May 24 (#132) and 22 May 24 (#151) as there were zero v8 engine crashes! Possible explanations thought of to date (not mutually-exclusive):
* Logs were trimmed to 1:55. The QA report states the test ran for 2:15. Found a note that read "Test stopped after full ramp up but did not run for the full 15 minutes afterwards." Regardless, for the purposes of this comparison, 22 Jun 23's test running shorter would skew the results to the benefit of 7 Jun 24's test. |
@gigamorph noticed early on that there were thousands of 504s (gateway timeouts) from the ML load balancer early on. As shown above, that persisted throughout the test but 504s at the web cache later never exceeded 100 per minute. The likely explanation is that the data service proxies automatically retried the 504s. Looking at this and previous tests' app server metrics, a contributing factor could be a full request queue, for the |
ML Monitoring HistoryTime period: 1700 - 1920 UTC CPU: File IO Detail: Memory: Intra-cluster activity, 1 of 2: Intra-cluster activity, 2 of 2: Both HTTP app servers: Just the Note the difference in the Just the Note the difference in the Data node characteristics for the |
ML App Server MetricsThe following reflects group one's queue was full for the majority of the test. This could explain many of the 504s the MarkLogic load balancer is returning. Please see this above comment for more details and suggested follow-up. Script: extractAppServerQueueMetrics.js* Input: 20240607-1700-1920-app-server-stats.json Output:
* The input includes one instance of a queue size of 37, which exceeds the maximum. I changed the script to use 36 for group one, then added one to that total to account for the single report of 37. |
OS-level metrics (via These were downloaded from Green's nodes but filenames do not reflect nodes 104, 73, 20. sar_ip-10-5-156-171.its.yale.edu_2024-06-07T165241.out.gz |
Looks like two requests failed due to an attempt to send too much data between nodes:
Which line up with these entries:
|
Encountered five
Nodes and times associated to all instances:
cc: @clarkepeterf |
While there is a The response status code for all was 500, meaning the data service proxies would not have retried them.
From 22 Jun 23:
cc: @clarkepeterf |
Out of the remaining entries non-trace event entries found in the 8003 and 8004 application error logs, there are three
|
Request response times:
|
Mined log output:
|
Executive Summary
|
* #162: perf test procedure and issue template updates. * Removed a runtime related list check for target scopes other than item and work. The set scope is now a valid target scope and this trace event has never otherwise fired.
From the middle tier application log:There are some peculiar errors that may be worth noting. The root cause must be some error from in the frontend, but @brent-hartwig - would you say this error "Document is not JSON" was returned by MarkLogic?
There were 1,366 of them. Also interesting is that there were only(?) five JS-BAD errors and all of them were with the same data:
And there were 22 errors about retries for 504, e.g.,
As previously discussed, this is the behavior of the marklogic client library that we cannot change, but 31 attempts over 121 seconds seems just too much, doesn't quite make sense to me. |
@gigamorph, thanks for pointing out the 'document is not JSON' errors. I encountered them and thought I chased down the issue but must have forgotten about it. Here's a complete set of entries per instance:
Based the grep results/count of the middle line, all instances occurred within searchEstimate.mjs, leading me to believe something in the performance test is compelling the middle tier to send invalid search criteria. Undoubtedly there are other possibilities but I don't know that the backend logs will help with that determination. |
@gigamorph, the five |
Agreed. Since the backend logs document That aside, we have also discussed submitting a support ticket / RFE requesting more control over data service retry behavior. I'm not opposed to doing so but wouldn't use this example. Hopefully the middle tier receives 504s (or 503s) when an app server request queue is full. Such requests should be retried. One of 119 successfully requests for the above document:
|
Related to there being fewer than expected backend requests... Frontend to Backend Request RatioDuring a performance test, how many frontend requests should there be for every backend request? We have never attempted to establish this metric. We would either have to account for requests the frontend makes to other sources (such as for images and analytics) or include them. When included, this test has 105% to 202% more frontend requests compared to the 14 Feb 24 and 22 Jun 23 tests, respectively. Since SI was introduced in Mar 2024, we can't really compare to tests older than that. Perhaps we can exclude those other requests. We asked QA if their test logs can tell us how many middle tier requests there were, ideally by status code. (Middle tier requests have paths beginning with "/api/" or "/data/".) Middle Tier to Backend Request Ratio@gigamorph was able to determine there were 363,788 middle tier requests during the test via:
Given most if not all middle tier requests result in multiple backend requests, neither of us would expect more middle tier requests than backend requests, yet that's what the counts look like as there were only 307,511 backend requests. We do not yet know why. Middle tier request counts by status code:
Questions:
Sampling of requests with 503 response status codes: |
@brent-hartwig SI was implemented in March. For almost every clickable action, there is an SI event being pushed. I'm not sure if that helps though. |
Thanks, @kamerynB. It does. I updated my comment accordingly. |
Seong researched and provided the following explanation. I agree we do not need to pursue this. In this test, they only accounted for 3% of the middle tier responses.
|
Decisions on executive summary items from conversation with @jffcamp and @prowns today:
|
Closing this ticket as all findings have been "acknowledged, understood, issue submitted, or resolved". For the most part, links were added to #181 comments that continue the associated investigation or links to new tickets were added. |
Primary objective:
Test the system stability and performance impact of #160. Given previous analysis, we believe #160 will:
If available, #161 can be part of this performance test as well. Given the aforementioned analysis was of a search for works, semantic facets were not involved; as such, it is welcome but not required by this test.
Code and Configuration Changes:
Required: implementation of #160 and associated frontend and middle tier implementations.
Optional: implementation of #161; if included, its associated frontend and middle tier implementations are required.
Environment and versions (update as needed):
Scenario AK of the Perf Test Line Up: "See if paginating facet requests (#160 and #161) restores system stability by reducing --if not eliminating-- v8 engine crashes."
Backend application server configuration:
For more information please see the documentation: LUX Performance Testing Procedure
Tasks:
v8 delay timeout
.Collect Data (Details from procedure):
Restore and Verify Environment:
Analyze:
Notes:
The text was updated successfully, but these errors were encountered: