-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate references #30
Conversation
Per the team discussion yesterday, I've made further changes to this branch, including:
Details including configuration options can be found in the updated README. I've tested this, it works and is significantly faster (same processing of ~960 stuSchoolAssns that took 3+ mins before making the above performance changes now takes <1.5 mins). Marking the PR ready for review. |
lightbeam/validate.py
Outdated
# to comparatively datasets (sections, schools, students). | ||
self.load_local_reference_data(endpoint) | ||
# create a structure which remote reference lookups can populate to prevent repeated lookups for the same thing | ||
self.remote_reference_cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resets the cache for each endpoint, perhaps we only want to do it once outside the loop (at line 35) since multiple endpoints may reference the same (cacheable) resources?
…tion, based on discussion with Jules and development of student ID matching bundle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! I appreciate that it's a very complex problem and I like your overall approach. I didn't find any bugs, so IMO this is ready to merge as-is.
I'll approve once I've conducted my own testing; I plan to do so this afternoon.
That said, I noted some ideas for simplification that you can take or leave. I've tried to fully flesh out the ones I feel more strongly about, so they hopefully aren't too heavy a lift to implement. Also if you'd prefer, I'm happy to make any of these changes myself -- just let me know!
for k in payload.keys(): | ||
cache_key += f"{payload[k]}~~~" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be worth spinning this out into its own even smaller static method so that writing to and reading from the cache are always syntactically aligned? I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you pointing out that this is similar to the functionality in references_data_to_cache()
? it is, and importantly, it's slightly different - the keys are sorted alphabetically there but not here... which could lead to inconsistent cache keys for the same structure.
TBH there's probably a more performant way to build and reference this local cache, but I propose we leave that as a future refactor/improvement. Is that ok by you, @johncmerfeld ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not a blocker to merging!
Yeah I think the main thing I'm reacting to is just the way keys are constructed (f"{payload[k]}~~~"
). Since that's a novel format it could be preferable for it to be further abstracted away from the developer. Totally fair that the two functions to use those keys in different ways; just wondering about a hypothetical future where the triple-tilde syntax is written in many more places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But very possibly a case of premature abstraction - like I say, it's totally good as is
The `references` `method` can be slow, as a separate `GET` request may be made to your API for each reference. (Therefore the validation method is disabled by default.) `lightbeam` tries to improve efficiency by: | ||
* batching requests and sending several concurrently (based on `connection`.`pool_size` of `lightbeam.yaml`) | ||
* caching responses and first checking the cache before making another (potentially identical) request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this information doesn't belong here. It's describing Lightbeam's internals, but this section is really about how the user interacts with Lightbeam. Mixing the two steepens the learning curve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important for a user to understand that using the reference validation method will be slow. If we don't document that, a user might enable that and they wonder why lightbeam
is so darn slow.
Maybe your comment here is more that there's too much detail in this section, not that it should be removed entirely? I'm open to persuasion on that, but I generally think that being explicit in documentation about features related to performance is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I definitely appreciate the motivation to warn the user of slowness. The principle I'm speaking from here is just separation of concerns. IMO the best readmes adhere to the inverted pyramid structure, with essential points delivered as densely as possible at the top and contextual information provided later.
Personally I'd vote for the explanation behind the slowness - and especially the measures taken to alleviate it - to be in the performance section of the readme. All the user really needs here is a warning.
**Note:** Reference validation efficiency may be improved by first `lightbeam fetch`ing certain resources to have a local copy. `lightbeam validate` checks local JSONL files to resolve references before trying the remote API, and `fetch` retrieves many records per `GET`, so total runtime can be faster in this scenario. The downsides include | ||
* more data movement | ||
* `fetch`ed data becoming stale over time | ||
* needing to track which data is your own vs. was `fetch`ed (all the data must coexist in the `config.data_dir` to be discoverable by `lightbeam validate`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mayyybe belongs elsewhere too. This is a good tip but it's kind of an advanced usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. Where would you suggest as "elsewhere"? (I'm not sure we have any alternate place to document things like this at the moment, other than this README.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah my bad. I really mean "elsewhere in this document." We don't exactly have a section for advanced usage but it might belong in Performance and limitations
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to keep the conversations going but as far as I'm concerned, this code is ready to roll. Great stuff!
This PR implements reference validation - a new feature whereby
lightbeam validate
will attempt to resolve Reference properties within a payload by first looking up the natural key values in local JSONL files for the referenced resource ("local references"), and if none are found, then doing a GET against the Ed-Fi API with the same natural key values ("remote references"). If a given reference is neither local nor remote, the payload will fail validation.Because this feature can be slow (see comment 1 below), by default it is disabled. This is achieved by adding a new optional structure in
lightbeam.yaml
as follows:(In addition to the above, one more basic validation method is done before any of these: that the payload is valid JSON. This validation method cannnot be disabled because all the others require a valid JSON payload.) Default
validate.methods
(if unspecified by the user inlightbeam.yaml
are["schema", "descriptors", "uniqueness"]
.I've tested the functionality and it works, however I'm leaving this PR as a draft for several reasons:
The feature is slow - due to the serial nature of lookups, it took 3+ minutes to validate the
schoolReferences
andstudentReferences
in 954studentSchoolAssociation
payloads. I have a few ideas for how to improve performance, but I'm unsure how much effort to invest in this optimization work.Because of 1 above, it seems prudent to try to allow reference validation to fail quickly, when some user-specifiable threshold is reached. I propose a configuration structure (in
lightbeam.yaml
) like this:An open question is whether a default value of
10
max_failures
is reasonable.Another open question is whether we should also "succeed fast": if
studentEducationOrganizationAssociation
theeducationOrganizationReference
property may beor
Currently, this reference validation implementation determines what resource is being referenced based on the reference property name, i.e.,
schoolReference
must be a reference to aschool
. ForeducationOrganizationReference
, in principle one could derive the resource fromlink.rel
which more specifically denotes the type ofeducationOrganization
(localEducationAgency
,school
, etc.). Howeverlightbeam
JSON payloads created with earthmover often omit thelink
property and instead simply look likefrom where it is impossible to determine what type of
educationOrganization
is being referenced. Possible solutions might belightbeam
know internally which Ed-Fi resources are members of what "type" and iteratively looks for the referenced values across all of them - so first look for schools withschoolId=255901
, then for localEducationAgencies withlocalEducationAgencyId=255901
, etc.link.rel
(at least) to be present in JSONL payloads when using reference validationI welcome feedback on the above questions and this PR in general.