-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate references #30
Changes from all commits
801c7b1
75b963d
9c049d1
b28eeb4
4dc8188
c903fd6
33c15b7
82d28d3
56ac652
b5f5ee8
769eeb8
f0a3dcc
80781c2
accd4ac
63ac1a5
0aa59f4
f7bd739
6a66d37
bcec891
e58942d
fb5ecab
7d880bf
c4a62b4
b97e203
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,13 +135,36 @@ Like [selectors](#selectors), `keep-keys` and `drop-keys` are comma-separated li | |
```bash | ||
lightbeam validate -c path/to/config.yaml | ||
``` | ||
You may `validate` your JSONL before transmitting it. This checks that the payloads | ||
1. are valid JSON | ||
1. conform to the structure described in the Swagger documents for [resources](https://api.ed-fi.org/v5.3/api/metadata/data/v3/resourcess/swagger.json) and [descriptors](https://api.ed-fi.org/v5.3/api/metadata/data/v3/descriptors/swagger.json) fetched from your API | ||
1. contain valid descriptor values (fetched from your API and/or from descriptor values in your JSONL files) | ||
1. contain unique values for any natural key | ||
You may `validate` your JSONL before transmitting it. Configuration for `validate` goes in its own section of `lightbeam.yaml`: | ||
```yaml | ||
validate: | ||
methods: | ||
- schema # checks that payloads conform to the Swagger definitions from the API | ||
- descriptors # checks that descriptor values are either locally-defined or exist in the remote API | ||
- uniqueness # checks that local payloads are unique by the required property values | ||
- references # checks that references resolve, either locally or in the remote API | ||
# or | ||
# methods: "*" | ||
``` | ||
Default `validate`.`methods` are `["schema", "descriptors", "uniqueness"]` (not `references`; see below). In addition to the above methods, `lighteam validate` will also (first) check that each payload is valid JSON. | ||
|
||
The `references` `method` can be slow, as a separate `GET` request may be made to your API for each reference. (Therefore the validation method is disabled by default.) `lightbeam` tries to improve efficiency by: | ||
* batching requests and sending several concurrently (based on `connection`.`pool_size` of `lightbeam.yaml`) | ||
* caching responses and first checking the cache before making another (potentially identical) request | ||
|
||
Even with these optimizations, checking `references` can easily take minutes for even relatively small amounts of data. Therefore `lightbeam.yaml` also accepts a further configuration option: | ||
```yaml | ||
validate: | ||
references: | ||
max_failures: 10 # stop testing after X failed payloads ("fail fast") | ||
``` | ||
This is optional; if absent, references in every payload are checked, no matter how many fail. | ||
|
||
**Note:** Reference validation efficiency may be improved by first `lightbeam fetch`ing certain resources to have a local copy. `lightbeam validate` checks local JSONL files to resolve references before trying the remote API, and `fetch` retrieves many records per `GET`, so total runtime can be faster in this scenario. The downsides include | ||
* more data movement | ||
* `fetch`ed data becoming stale over time | ||
* needing to track which data is your own vs. was `fetch`ed (all the data must coexist in the `config.data_dir` to be discoverable by `lightbeam validate`) | ||
Comment on lines
+163
to
+166
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mayyybe belongs elsewhere too. This is a good tip but it's kind of an advanced usage There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possibly. Where would you suggest as "elsewhere"? (I'm not sure we have any alternate place to document things like this at the moment, other than this README.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yeah my bad. I really mean "elsewhere in this document." We don't exactly have a section for advanced usage but it might belong in |
||
|
||
This command will not find invalid reference errors, but is helpful for finding payloads that are invalid JSON, are missing required fields, or have other structural issues. | ||
|
||
## `send` | ||
```bash | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this information doesn't belong here. It's describing Lightbeam's internals, but this section is really about how the user interacts with Lightbeam. Mixing the two steepens the learning curve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important for a user to understand that using the reference validation method will be slow. If we don't document that, a user might enable that and they wonder why
lightbeam
is so darn slow.Maybe your comment here is more that there's too much detail in this section, not that it should be removed entirely? I'm open to persuasion on that, but I generally think that being explicit in documentation about features related to performance is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I definitely appreciate the motivation to warn the user of slowness. The principle I'm speaking from here is just separation of concerns. IMO the best readmes adhere to the inverted pyramid structure, with essential points delivered as densely as possible at the top and contextual information provided later.
Personally I'd vote for the explanation behind the slowness - and especially the measures taken to alleviate it - to be in the performance section of the readme. All the user really needs here is a warning.