Server-side aggregation of matches for many pieces of content when testing rules #344

jonathonherbert · 2023-06-17T09:47:17Z

jonathonherbert
Jun 17, 2023
Maintainer

We would like to be able to test a new rule against existing content in CAPI. The diagram below shows one way we might do this.

We must communicate with CAPI and the checker service to do this. I think we should prefer orchestrating on the server:

We can maintain a central cache for common queries (for example, checking against the most recent n articles in CAPI.
We can communicate directly with the checker service within AWS, which should significantly speed up query times (HMAC will be necessary for authentication.)
We avoid some configuration wiring and client-side dependencies (e.g. CAPI types) that we'd otherwise need.
We can also offload the streaming and aggregation work to the server, and only stream valid matches back to the client, which leaves the client with less work to as it processes incoming matches.

sequenceDiagram
    participant RMC as Rule management client
    participant RMS as Rule management server 
    participant CS as Checker server
    participant CAPI
    RMC ->> RMS: [GET] checkerRule, capiQuery?, matchCount
    Note over RMC, RMS: Query optional, defaults to latest content
        loop for each page of CAPI, until matchCount matches found, we run out of pages, or some limit of pages reached
        RMS -->> CAPI: [GET] capiQuery
        CAPI -->> RMS: content[]
        Note over RMS: Convert HTML content to block[]
        loop for each page of documents
            RMS ->> CS: [GET] checkerRule, document[]
            CS ->> RMS: [Chunked NDJSON] match[]
        end
    end
    Note over RMS: Stream matches back to client, filtering documents with no matches and reporting progress
    RMS ->> RMC: [Chunked NDJSON] progress, match[]

rhystmills · 2023-06-20T14:52:17Z

rhystmills
Jun 20, 2023
Maintainer

This seems like the right approach to me (handling server-side rather than client-side).

0 replies

rhystmills · 2023-07-17T08:08:16Z

rhystmills
Jul 17, 2023
Maintainer

Is it possible this will create a large enough amount of work for CAPI that it could affect its performance? Should we check in with that team?

I ask this as we'll probably want to check against a very large number of articles - probably more than most queries that go to CAPI - though on the other hand we don't create new rules that often so might not need to do large numbers of corpus checks.

0 replies

jonathonherbert · 2023-07-17T09:59:21Z

jonathonherbert
Jul 17, 2023
Maintainer Author

We can, and probably should do that! My assumption here is that, as Michael B. once said to me, 'CAPI is hardcore'. Taking a look at the status page, it's currently serving ~283 reqs/second across private and public accounts.

Safe to say that we'll want to cache our results no matter what happens for some duration. Suspect we could get away with a simple time-expired cache that keeps pages for some reasonable duration TBC.

One place that will be harder: the checker! We should consider the load there, as lots of checking it may affect PROD Typerighter users. Having said that, a cache might be useful here, as I suspect users will go backwards and forwards between the same pattern often, esp. if they're working to understand the difference between pattern A and pattern B, and even if there's not really an impact on load, the speed benefit will impact our users.

There are standard, powerful cache implementations available as a part of Play, so this shouldn't be too much work.

We could also want to look at prioritising traffic within the rule management service. I think we should look at the real impact of checking rules on the PROD service before we take this step – the service routinely checks 5000 word pieces with ~13,000 rules and has a maximum p95 check duration of 500ms max. and 1-200ms average, so 5,000,000 words with 1 rule feels like it'll be within an order of magnitude. We'll find out.

I think there are some product questions to answer here. On my mind – do we have different sorts of checks? For example, a deterministic, 'standard' check with a large but predefined search, plus a CAPI search check to cover particular pieces? The 'standard' check is a good noise check, but may be inadequate when checking matches against neologisms etc.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server-side aggregation of matches for many pieces of content when testing rules #344

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Server-side aggregation of matches for many pieces of content when testing rules #344

jonathonherbert Jun 17, 2023 Maintainer

Replies: 3 comments

rhystmills Jun 20, 2023 Maintainer

rhystmills Jul 17, 2023 Maintainer

jonathonherbert Jul 17, 2023 Maintainer Author

jonathonherbert
Jun 17, 2023
Maintainer

rhystmills
Jun 20, 2023
Maintainer

rhystmills
Jul 17, 2023
Maintainer

jonathonherbert
Jul 17, 2023
Maintainer Author