validate references #30

tomreitz · 2024-05-08T19:32:58Z

This PR implements reference validation - a new feature whereby lightbeam validate will attempt to resolve Reference properties within a payload by first looking up the natural key values in local JSONL files for the referenced resource ("local references"), and if none are found, then doing a GET against the Ed-Fi API with the same natural key values ("remote references"). If a given reference is neither local nor remote, the payload will fail validation.

Because this feature can be slow (see comment 1 below), by default it is disabled. This is achieved by adding a new optional structure in lightbeam.yaml as follows:

validate:
  methods:
    - schema # checks that payloads conform to the Swagger definitions from the API
    - descriptors # checks that descriptor values are either locally-defined or exist in the remote API
    - uniqueness # checks that local payloads are unique by the required property values
    - references # NEW checks that references resolve

(In addition to the above, one more basic validation method is done before any of these: that the payload is valid JSON. This validation method cannnot be disabled because all the others require a valid JSON payload.) Default validate.methods (if unspecified by the user in lightbeam.yaml are ["schema", "descriptors", "uniqueness"].

I've tested the functionality and it works, however I'm leaving this PR as a draft for several reasons:

The feature is slow - due to the serial nature of lookups, it took 3+ minutes to validate the schoolReferences and studentReferences in 954 studentSchoolAssociation payloads. I have a few ideas for how to improve performance, but I'm unsure how much effort to invest in this optimization work.
Because of 1 above, it seems prudent to try to allow reference validation to fail quickly, when some user-specifiable threshold is reached. I propose a configuration structure (in lightbeam.yaml) like this:

validate:
  references:
    max_failures: 10 # stop testing after 10 failed payloads ("fail fast")

An open question is whether a default value of 10 max_failures is reasonable.

Another open question is whether we should also "succeed fast": if

An unresolved issue is how to handle "typed" resources. For example, in studentEducationOrganizationAssociation the educationOrganizationReference property may be

{
  "educationOrganizationId": 255901,
  "link": {
    "rel": "LocalEducationAgency",
    "href": "/ed-fi/localEducationAgencies/5ec280c188db4f0bae9ef60e2ae5c231"
  }
}

or

{
  "educationOrganizationId": 255901044,
  "link": {
    "rel": "School",
    "href": "/ed-fi/school/7381540c0eff4b778d0399ce8b397c9a"
  }
}

Currently, this reference validation implementation determines what resource is being referenced based on the reference property name, i.e., schoolReference must be a reference to a school. For educationOrganizationReference, in principle one could derive the resource from link.rel which more specifically denotes the type of educationOrganization (localEducationAgency, school, etc.). However lightbeam JSON payloads created with earthmover often omit the link property and instead simply look like

{
  "educationOrganizationId": 255901044
}

from where it is impossible to determine what type of educationOrganization is being referenced. Possible solutions might be

having lightbeam know internally which Ed-Fi resources are members of what "type" and iteratively looks for the referenced values across all of them - so first look for schools with schoolId=255901, then for localEducationAgencies with localEducationAgencyId=255901, etc.
requiring link.rel (at least) to be present in JSONL payloads when using reference validation

I welcome feedback on the above questions and this PR in general.

tomreitz · 2024-05-10T19:32:43Z

Per the team discussion yesterday, I've made further changes to this branch, including:

performance improvements on local references checks: load relevant properties of relevant ref'd endpoints from local files into an in-memory cache once, rather than looping over and reading files with every payload
performance improvements on remote reference checks: (asynchronously) GETting remote references in batches and caching responses
implement both "fail fast" and "succeed fast" features

Details including configuration options can be found in the updated README.

I've tested this, it works and is significantly faster (same processing of ~960 stuSchoolAssns that took 3+ mins before making the above performance changes now takes <1.5 mins). Marking the PR ready for review.

tomreitz · 2024-05-13T12:52:43Z

lightbeam/validate.py

+                # to comparatively datasets (sections, schools, students).
+                self.load_local_reference_data(endpoint)
+                # create a structure which remote reference lookups can populate to prevent repeated lookups for the same thing
+                self.remote_reference_cache = {}


This resets the cache for each endpoint, perhaps we only want to do it once outside the loop (at line 35) since multiple endpoints may reference the same (cacheable) resources?

…tion, based on discussion with Jules and development of student ID matching bundle

johncmerfeld

This is looking great! I appreciate that it's a very complex problem and I like your overall approach. I didn't find any bugs, so IMO this is ready to merge as-is.
I'll approve once I've conducted my own testing; I plan to do so this afternoon.

That said, I noted some ideas for simplification that you can take or leave. I've tried to fully flesh out the ones I feel more strongly about, so they hopefully aren't too heavy a lift to implement. Also if you'd prefer, I'm happy to make any of these changes myself -- just let me know!

lightbeam/lightbeam.py

lightbeam/validate.py

johncmerfeld · 2024-07-01T17:23:22Z

lightbeam/validate.py

+        for k in payload.keys():
+            cache_key += f"{payload[k]}~~~"


Might it be worth spinning this out into its own even smaller static method so that writing to and reading from the cache are always syntactically aligned? I'm not sure

Are you pointing out that this is similar to the functionality in references_data_to_cache()? it is, and importantly, it's slightly different - the keys are sorted alphabetically there but not here... which could lead to inconsistent cache keys for the same structure.

TBH there's probably a more performant way to build and reference this local cache, but I propose we leave that as a future refactor/improvement. Is that ok by you, @johncmerfeld ?

Definitely not a blocker to merging!

Yeah I think the main thing I'm reacting to is just the way keys are constructed (f"{payload[k]}~~~"). Since that's a novel format it could be preferable for it to be further abstracted away from the developer. Totally fair that the two functions to use those keys in different ways; just wondering about a hypothetical future where the triple-tilde syntax is written in many more places.

But very possibly a case of premature abstraction - like I say, it's totally good as is

lightbeam/validate.py

johncmerfeld · 2024-07-01T17:36:24Z

README.md

+The `references` `method` can be slow, as a separate `GET` request may be made to your API for each reference. (Therefore the validation method is disabled by default.) `lightbeam` tries to improve efficiency by:
+* batching requests and sending several concurrently (based on `connection`.`pool_size` of `lightbeam.yaml`)
+* caching responses and first checking the cache before making another (potentially identical) request


IMO this information doesn't belong here. It's describing Lightbeam's internals, but this section is really about how the user interacts with Lightbeam. Mixing the two steepens the learning curve

I think it's important for a user to understand that using the reference validation method will be slow. If we don't document that, a user might enable that and they wonder why lightbeam is so darn slow.

Maybe your comment here is more that there's too much detail in this section, not that it should be removed entirely? I'm open to persuasion on that, but I generally think that being explicit in documentation about features related to performance is best.

Gotcha, I definitely appreciate the motivation to warn the user of slowness. The principle I'm speaking from here is just separation of concerns. IMO the best readmes adhere to the inverted pyramid structure, with essential points delivered as densely as possible at the top and contextual information provided later.

Personally I'd vote for the explanation behind the slowness - and especially the measures taken to alleviate it - to be in the performance section of the readme. All the user really needs here is a warning.

README.md

johncmerfeld · 2024-07-01T17:37:47Z

README.md

+**Note:** Reference validation efficiency may be improved by first `lightbeam fetch`ing certain resources to have a local copy. `lightbeam validate` checks local JSONL files to resolve references before trying the remote API, and `fetch` retrieves many records per  `GET`, so total runtime can be faster in this scenario. The downsides include
+* more data movement
+* `fetch`ed data becoming stale over time
+* needing to track which data is your own vs. was `fetch`ed (all the data must coexist in the `config.data_dir` to be discoverable by `lightbeam validate`)


Mayyybe belongs elsewhere too. This is a good tip but it's kind of an advanced usage

Possibly. Where would you suggest as "elsewhere"? (I'm not sure we have any alternate place to document things like this at the moment, other than this README.)

Ah yeah my bad. I really mean "elsewhere in this document." We don't exactly have a section for advanced usage but it might belong in Performance and limitations?

lightbeam/validate.py

johncmerfeld

Happy to keep the conversations going but as far as I'm concerned, this code is ready to roll. Great stuff!

Tom Reitz added 4 commits May 8, 2024 11:24

inital commit with a working implementation

801c7b1

improvements per discussion in yesterday meeting, update README

75b963d

update README

9c049d1

change order of edOrgRef resolution to prevalence

b28eeb4

tomreitz marked this pull request as ready for review May 10, 2024 19:32

Tom Reitz added 6 commits May 10, 2024 14:34

update README

4dc8188

update README

c903fd6

update README

33c15b7

update README

82d28d3

update README

56ac652

update README

b5f5ee8

tomreitz requested a review from ejoranlienea May 10, 2024 19:38

tomreitz changed the title ~~inital commit with a working implementation~~ validate references May 13, 2024

tomreitz commented May 13, 2024

View reviewed changes

tomreitz requested a review from jalvord1 May 16, 2024 13:52

tomreitz and others added 8 commits June 3, 2024 16:54

Merge branch 'main' into feature/reference_validation

769eeb8

update to current main branch, fix bugs found in testing

f0a3dcc

clean up comment

80781c2

fixes to descriptor validation with local descriptors

accd4ac

make remote reference lookup synchronous, only check endpoints with data

63ac1a5

nested reference fixes and performance improvements

0aa59f4

bugfixes and performance improvements to reference validation

f7bd739

remove succeed fast feature (for now, at least) from reference valida…

6a66d37

…tion, based on discussion with Jules and development of student ID matching bundle

johncmerfeld reviewed Jul 1, 2024

View reviewed changes

johncmerfeld reviewed Jul 2, 2024

View reviewed changes

lightbeam/validate.py Outdated Show resolved Hide resolved

updates per review from John

bcec891

tomreitz requested a review from johncmerfeld July 10, 2024 16:02

johncmerfeld approved these changes Jul 10, 2024

View reviewed changes

updates per review from John

e58942d

Tom Reitz added 4 commits July 10, 2024 14:57

updates per review from John

fb5ecab

bugfix

7d880bf

bugfix

c4a62b4

bugfix

b97e203

This was referenced Jul 12, 2024

Consider not resetting local cache for each endpoint when validating records #40

Open

Better handling of leftover tasks in a (potentially large) task queue #41

Open

Streamline error handling across lightbeam validation methods #42

Open

tomreitz merged commit 7c8c395 into main Jul 12, 2024

tomreitz deleted the feature/reference_validation branch July 12, 2024 19:01

tomreitz mentioned this pull request Jul 17, 2024

adding a test suite; fixing version output #45

Merged

johncmerfeld mentioned this pull request Jul 18, 2024

Hotfix: bug with get_endpoints_with_data #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate references #30

validate references #30

tomreitz commented May 8, 2024

tomreitz commented May 10, 2024 •

edited

Loading

tomreitz May 13, 2024

johncmerfeld left a comment

johncmerfeld Jul 1, 2024

tomreitz Jul 10, 2024

johncmerfeld Jul 10, 2024

johncmerfeld Jul 10, 2024

johncmerfeld Jul 1, 2024

tomreitz Jul 10, 2024

johncmerfeld Jul 10, 2024

johncmerfeld Jul 1, 2024

tomreitz Jul 10, 2024

johncmerfeld Jul 10, 2024

johncmerfeld left a comment

validate references #30

validate references #30

Conversation

tomreitz commented May 8, 2024

tomreitz commented May 10, 2024 • edited Loading

Choose a reason for hiding this comment

johncmerfeld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johncmerfeld left a comment

Choose a reason for hiding this comment

tomreitz commented May 10, 2024 •

edited

Loading