Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split overview to make more parallel queries? #73

Open
rahulbot opened this issue Jun 12, 2024 · 2 comments
Open

split overview to make more parallel queries? #73

rahulbot opened this issue Jun 12, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@rahulbot
Copy link
Contributor

Right now the overview endpoint does a lot of lifting. It generates daily counts, top langs, top domains, top TLD, and total count all in one query into ES. Would it be fast in real-world system performance to split these apart into separate endpoints?

Our web server UI and architecture assumes these can all be fetched in parallel. In fact to avoid duplicative queries right now when someone clicks "search" in the web UI we wait for the first results to come back and be cached so that subsequent calls hit the cache (since they all call overview under the hood).

The end result of changing approaches would mean more parallel ES queries but each would do less work. We should figure out if there is some way to test if this would help improve user-facing search performance without making the requisite changes all up and down the stack..

@rahulbot rahulbot added the enhancement New feature or request label Jun 12, 2024
@philbudne
Copy link
Contributor

I have a variety of thoughts:

  • If the N-S-A doesn't already have a way to issue individual queries, add them: this will allow further testing, and less waste if there are cases where an N-S-A client doesn't care about some parts of the "overview" results.

  • Once the queries can be issued separately, try serial vs. parallel queries, a "simulated user" script using aiohttp could be a useful tool in testing parallel queries. To assess system impact, I'd suggest having the ability to generate random search terms (inserted into "typical research queries), and the ability to loop (have a command line option to say how many times) in order to get a sampling of times. If an individual call returns very quickly, the samples could be swamped by variance due to system load/conditions, and the average of multiple calls (each with new random values) might be needed.

  • To assess system impact of parallelization with multiple active users, run multiple instances of the "simulated user"

  • The "ministat" command might be useful in seeing whether any given change has statistical significance.

  • Look at the server "load averages" in the "Server stats" grafana dashboard while running tests!

  • As a server person, I have misgivings about assuming things will always be faster if you throw more requests at the server in parallel. It may work in "good" situations and bring the server to its knees in others: servers have finite resources, and tarbell is easily taxed. Using a "thread per request" (as the Django backend currently does, with 1043 processes, and 37938 threads) consumes memory: tarbell often has 80% of memory in use, and, ISTR very little swap space, and we have, in the past seen load averages (average runnable processes) of 90 and above when the system has only 32 CPU cores,

Pushing all parallelism out to the JS App in the browser can also mean different user experiences based on the user's browser: different revisions of different browsers may have very different concurrent connection limits.

@pgulley pgulley self-assigned this Jul 3, 2024
@pgulley pgulley added this to the July milestone Jul 3, 2024
@pgulley pgulley moved this from Todo to In Progress in Ingest + Index Infrastructure Jul 8, 2024
@pgulley
Copy link
Member

pgulley commented Jul 23, 2024

I've begun the work to break this out in #89

@pgulley pgulley modified the milestones: 2 - July, 3 - August Jul 31, 2024
@pgulley pgulley modified the milestones: 3 - August, 4 - September Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants