split `overview` to make more parallel queries? #73

rahulbot · 2024-06-12T01:30:24Z

Right now the overview endpoint does a lot of lifting. It generates daily counts, top langs, top domains, top TLD, and total count all in one query into ES. Would it be fast in real-world system performance to split these apart into separate endpoints?

Our web server UI and architecture assumes these can all be fetched in parallel. In fact to avoid duplicative queries right now when someone clicks "search" in the web UI we wait for the first results to come back and be cached so that subsequent calls hit the cache (since they all call overview under the hood).

The end result of changing approaches would mean more parallel ES queries but each would do less work. We should figure out if there is some way to test if this would help improve user-facing search performance without making the requisite changes all up and down the stack..

The text was updated successfully, but these errors were encountered:

philbudne · 2024-06-12T20:58:14Z

I have a variety of thoughts:

If the N-S-A doesn't already have a way to issue individual queries, add them: this will allow further testing, and less waste if there are cases where an N-S-A client doesn't care about some parts of the "overview" results.
Once the queries can be issued separately, try serial vs. parallel queries, a "simulated user" script using aiohttp could be a useful tool in testing parallel queries. To assess system impact, I'd suggest having the ability to generate random search terms (inserted into "typical research queries), and the ability to loop (have a command line option to say how many times) in order to get a sampling of times. If an individual call returns very quickly, the samples could be swamped by variance due to system load/conditions, and the average of multiple calls (each with new random values) might be needed.
To assess system impact of parallelization with multiple active users, run multiple instances of the "simulated user"
The "ministat" command might be useful in seeing whether any given change has statistical significance.
Look at the server "load averages" in the "Server stats" grafana dashboard while running tests!
As a server person, I have misgivings about assuming things will always be faster if you throw more requests at the server in parallel. It may work in "good" situations and bring the server to its knees in others: servers have finite resources, and tarbell is easily taxed. Using a "thread per request" (as the Django backend currently does, with 1043 processes, and 37938 threads) consumes memory: tarbell often has 80% of memory in use, and, ISTR very little swap space, and we have, in the past seen load averages (average runnable processes) of 90 and above when the system has only 32 CPU cores,

Pushing all parallelism out to the JS App in the browser can also mean different user experiences based on the user's browser: different revisions of different browsers may have very different concurrent connection limits.

pgulley · 2024-07-23T20:21:05Z

I've begun the work to break this out in #89

rahulbot added the enhancement New feature or request label Jun 12, 2024

pgulley added this to Ingest + Index Infrastructure Jun 28, 2024

pgulley moved this to Todo in Ingest + Index Infrastructure Jul 3, 2024

pgulley self-assigned this Jul 3, 2024

pgulley added this to the July milestone Jul 3, 2024

pgulley mentioned this issue Jul 8, 2024

News-search-api code cleanup #81

Open

pgulley moved this from Todo to In Progress in Ingest + Index Infrastructure Jul 8, 2024

pgulley modified the milestones: 2 - July, 3 - August Jul 31, 2024

pgulley modified the milestones: 3 - August, 4 - September Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split `overview` to make more parallel queries? #73

split `overview` to make more parallel queries? #73

rahulbot commented Jun 12, 2024

philbudne commented Jun 12, 2024

pgulley commented Jul 23, 2024

split overview to make more parallel queries? #73

split overview to make more parallel queries? #73

Comments

rahulbot commented Jun 12, 2024

philbudne commented Jun 12, 2024

pgulley commented Jul 23, 2024

split `overview` to make more parallel queries? #73

split `overview` to make more parallel queries? #73