Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate references #30

Merged
merged 24 commits into from
Jul 12, 2024
Merged

validate references #30

merged 24 commits into from
Jul 12, 2024

Conversation

tomreitz
Copy link
Collaborator

@tomreitz tomreitz commented May 8, 2024

This PR implements reference validation - a new feature whereby lightbeam validate will attempt to resolve Reference properties within a payload by first looking up the natural key values in local JSONL files for the referenced resource ("local references"), and if none are found, then doing a GET against the Ed-Fi API with the same natural key values ("remote references"). If a given reference is neither local nor remote, the payload will fail validation.

Because this feature can be slow (see comment 1 below), by default it is disabled. This is achieved by adding a new optional structure in lightbeam.yaml as follows:

validate:
  methods:
    - schema # checks that payloads conform to the Swagger definitions from the API
    - descriptors # checks that descriptor values are either locally-defined or exist in the remote API
    - uniqueness # checks that local payloads are unique by the required property values
    - references # NEW checks that references resolve

(In addition to the above, one more basic validation method is done before any of these: that the payload is valid JSON. This validation method cannnot be disabled because all the others require a valid JSON payload.) Default validate.methods (if unspecified by the user in lightbeam.yaml are ["schema", "descriptors", "uniqueness"].

I've tested the functionality and it works, however I'm leaving this PR as a draft for several reasons:

  1. The feature is slow - due to the serial nature of lookups, it took 3+ minutes to validate the schoolReferences and studentReferences in 954 studentSchoolAssociation payloads. I have a few ideas for how to improve performance, but I'm unsure how much effort to invest in this optimization work.

  2. Because of 1 above, it seems prudent to try to allow reference validation to fail quickly, when some user-specifiable threshold is reached. I propose a configuration structure (in lightbeam.yaml) like this:

validate:
  references:
    max_failures: 10 # stop testing after 10 failed payloads ("fail fast")

An open question is whether a default value of 10 max_failures is reasonable.

Another open question is whether we should also "succeed fast": if

  1. An unresolved issue is how to handle "typed" resources. For example, in studentEducationOrganizationAssociation the educationOrganizationReference property may be
{
  "educationOrganizationId": 255901,
  "link": {
    "rel": "LocalEducationAgency",
    "href": "/ed-fi/localEducationAgencies/5ec280c188db4f0bae9ef60e2ae5c231"
  }
}

or

{
  "educationOrganizationId": 255901044,
  "link": {
    "rel": "School",
    "href": "/ed-fi/school/7381540c0eff4b778d0399ce8b397c9a"
  }
}

Currently, this reference validation implementation determines what resource is being referenced based on the reference property name, i.e., schoolReference must be a reference to a school. For educationOrganizationReference, in principle one could derive the resource from link.rel which more specifically denotes the type of educationOrganization (localEducationAgency, school, etc.). However lightbeam JSON payloads created with earthmover often omit the link property and instead simply look like

{
  "educationOrganizationId": 255901044
}

from where it is impossible to determine what type of educationOrganization is being referenced. Possible solutions might be

  • having lightbeam know internally which Ed-Fi resources are members of what "type" and iteratively looks for the referenced values across all of them - so first look for schools with schoolId=255901, then for localEducationAgencies with localEducationAgencyId=255901, etc.
  • requiring link.rel (at least) to be present in JSONL payloads when using reference validation

I welcome feedback on the above questions and this PR in general.

@tomreitz
Copy link
Collaborator Author

tomreitz commented May 10, 2024

Per the team discussion yesterday, I've made further changes to this branch, including:

  • performance improvements on local references checks: load relevant properties of relevant ref'd endpoints from local files into an in-memory cache once, rather than looping over and reading files with every payload
  • performance improvements on remote reference checks: (asynchronously) GETting remote references in batches and caching responses
  • implement both "fail fast" and "succeed fast" features

Details including configuration options can be found in the updated README.

I've tested this, it works and is significantly faster (same processing of ~960 stuSchoolAssns that took 3+ mins before making the above performance changes now takes <1.5 mins). Marking the PR ready for review.

@tomreitz tomreitz marked this pull request as ready for review May 10, 2024 19:32
@tomreitz tomreitz requested a review from ejoranlienea May 10, 2024 19:38
@tomreitz tomreitz changed the title inital commit with a working implementation validate references May 13, 2024
# to comparatively datasets (sections, schools, students).
self.load_local_reference_data(endpoint)
# create a structure which remote reference lookups can populate to prevent repeated lookups for the same thing
self.remote_reference_cache = {}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resets the cache for each endpoint, perhaps we only want to do it once outside the loop (at line 35) since multiple endpoints may reference the same (cacheable) resources?

@tomreitz tomreitz requested a review from jalvord1 May 16, 2024 13:52
Copy link
Contributor

@johncmerfeld johncmerfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! I appreciate that it's a very complex problem and I like your overall approach. I didn't find any bugs, so IMO this is ready to merge as-is.
I'll approve once I've conducted my own testing; I plan to do so this afternoon.

That said, I noted some ideas for simplification that you can take or leave. I've tried to fully flesh out the ones I feel more strongly about, so they hopefully aren't too heavy a lift to implement. Also if you'd prefer, I'm happy to make any of these changes myself -- just let me know!

lightbeam/lightbeam.py Show resolved Hide resolved
lightbeam/validate.py Outdated Show resolved Hide resolved
lightbeam/validate.py Show resolved Hide resolved
lightbeam/validate.py Show resolved Hide resolved
lightbeam/validate.py Show resolved Hide resolved
Comment on lines +369 to +370
for k in payload.keys():
cache_key += f"{payload[k]}~~~"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be worth spinning this out into its own even smaller static method so that writing to and reading from the cache are always syntactically aligned? I'm not sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you pointing out that this is similar to the functionality in references_data_to_cache()? it is, and importantly, it's slightly different - the keys are sorted alphabetically there but not here... which could lead to inconsistent cache keys for the same structure.

TBH there's probably a more performant way to build and reference this local cache, but I propose we leave that as a future refactor/improvement. Is that ok by you, @johncmerfeld ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not a blocker to merging!

Yeah I think the main thing I'm reacting to is just the way keys are constructed (f"{payload[k]}~~~"). Since that's a novel format it could be preferable for it to be further abstracted away from the developer. Totally fair that the two functions to use those keys in different ways; just wondering about a hypothetical future where the triple-tilde syntax is written in many more places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But very possibly a case of premature abstraction - like I say, it's totally good as is

lightbeam/validate.py Show resolved Hide resolved
Comment on lines +153 to +155
The `references` `method` can be slow, as a separate `GET` request may be made to your API for each reference. (Therefore the validation method is disabled by default.) `lightbeam` tries to improve efficiency by:
* batching requests and sending several concurrently (based on `connection`.`pool_size` of `lightbeam.yaml`)
* caching responses and first checking the cache before making another (potentially identical) request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this information doesn't belong here. It's describing Lightbeam's internals, but this section is really about how the user interacts with Lightbeam. Mixing the two steepens the learning curve

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important for a user to understand that using the reference validation method will be slow. If we don't document that, a user might enable that and they wonder why lightbeam is so darn slow.

Maybe your comment here is more that there's too much detail in this section, not that it should be removed entirely? I'm open to persuasion on that, but I generally think that being explicit in documentation about features related to performance is best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I definitely appreciate the motivation to warn the user of slowness. The principle I'm speaking from here is just separation of concerns. IMO the best readmes adhere to the inverted pyramid structure, with essential points delivered as densely as possible at the top and contextual information provided later.

Personally I'd vote for the explanation behind the slowness - and especially the measures taken to alleviate it - to be in the performance section of the readme. All the user really needs here is a warning.

README.md Outdated Show resolved Hide resolved
Comment on lines +165 to +168
**Note:** Reference validation efficiency may be improved by first `lightbeam fetch`ing certain resources to have a local copy. `lightbeam validate` checks local JSONL files to resolve references before trying the remote API, and `fetch` retrieves many records per `GET`, so total runtime can be faster in this scenario. The downsides include
* more data movement
* `fetch`ed data becoming stale over time
* needing to track which data is your own vs. was `fetch`ed (all the data must coexist in the `config.data_dir` to be discoverable by `lightbeam validate`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mayyybe belongs elsewhere too. This is a good tip but it's kind of an advanced usage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. Where would you suggest as "elsewhere"? (I'm not sure we have any alternate place to document things like this at the moment, other than this README.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah my bad. I really mean "elsewhere in this document." We don't exactly have a section for advanced usage but it might belong in Performance and limitations?

lightbeam/validate.py Outdated Show resolved Hide resolved
@tomreitz tomreitz requested a review from johncmerfeld July 10, 2024 16:02
Copy link
Contributor

@johncmerfeld johncmerfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to keep the conversations going but as far as I'm concerned, this code is ready to roll. Great stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants