Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate references #30

Merged
merged 24 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 29 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,13 +135,36 @@ Like [selectors](#selectors), `keep-keys` and `drop-keys` are comma-separated li
```bash
lightbeam validate -c path/to/config.yaml
```
You may `validate` your JSONL before transmitting it. This checks that the payloads
1. are valid JSON
1. conform to the structure described in the Swagger documents for [resources](https://api.ed-fi.org/v5.3/api/metadata/data/v3/resourcess/swagger.json) and [descriptors](https://api.ed-fi.org/v5.3/api/metadata/data/v3/descriptors/swagger.json) fetched from your API
1. contain valid descriptor values (fetched from your API and/or from descriptor values in your JSONL files)
1. contain unique values for any natural key
You may `validate` your JSONL before transmitting it. Configuration for `validate` goes in its own section of `lightbeam.yaml`:
```yaml
validate:
methods:
- schema # checks that payloads conform to the Swagger definitions from the API
- descriptors # checks that descriptor values are either locally-defined or exist in the remote API
- uniqueness # checks that local payloads are unique by the required property values
- references # checks that references resolve, either locally or in the remote API
# or
# methods: "*"
```
Default `validate`.`methods` are `["schema", "descriptors", "uniqueness"]` (not `references`; see below). In addition to the above methods, `lighteam validate` will also (first) check that each payload is valid JSON.

The `references` `method` can be slow, as a separate `GET` request may be made to your API for each reference. (Therefore the validation method is disabled by default.) `lightbeam` tries to improve efficiency by:
* batching requests and sending several concurrently (based on `connection`.`pool_size` of `lightbeam.yaml`)
* caching responses and first checking the cache before making another (potentially identical) request
Comment on lines +151 to +153
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this information doesn't belong here. It's describing Lightbeam's internals, but this section is really about how the user interacts with Lightbeam. Mixing the two steepens the learning curve

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important for a user to understand that using the reference validation method will be slow. If we don't document that, a user might enable that and they wonder why lightbeam is so darn slow.

Maybe your comment here is more that there's too much detail in this section, not that it should be removed entirely? I'm open to persuasion on that, but I generally think that being explicit in documentation about features related to performance is best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I definitely appreciate the motivation to warn the user of slowness. The principle I'm speaking from here is just separation of concerns. IMO the best readmes adhere to the inverted pyramid structure, with essential points delivered as densely as possible at the top and contextual information provided later.

Personally I'd vote for the explanation behind the slowness - and especially the measures taken to alleviate it - to be in the performance section of the readme. All the user really needs here is a warning.


Even with these optimizations, checking `references` can easily take minutes for even relatively small amounts of data. Therefore `lightbeam.yaml` also accepts a further configuration option:
```yaml
validate:
references:
max_failures: 10 # stop testing after X failed payloads ("fail fast")
```
This is optional; if absent, references in every payload are checked, no matter how many fail.

**Note:** Reference validation efficiency may be improved by first `lightbeam fetch`ing certain resources to have a local copy. `lightbeam validate` checks local JSONL files to resolve references before trying the remote API, and `fetch` retrieves many records per `GET`, so total runtime can be faster in this scenario. The downsides include
* more data movement
* `fetch`ed data becoming stale over time
* needing to track which data is your own vs. was `fetch`ed (all the data must coexist in the `config.data_dir` to be discoverable by `lightbeam validate`)
Comment on lines +163 to +166
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mayyybe belongs elsewhere too. This is a good tip but it's kind of an advanced usage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. Where would you suggest as "elsewhere"? (I'm not sure we have any alternate place to document things like this at the moment, other than this README.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah my bad. I really mean "elsewhere in this document." We don't exactly have a section for advanced usage but it might belong in Performance and limitations?


This command will not find invalid reference errors, but is helpful for finding payloads that are invalid JSON, are missing required fields, or have other structural issues.

## `send`
```bash
Expand Down
4 changes: 2 additions & 2 deletions lightbeam/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,10 @@ def prepare(self):
self.config["open_api_metadata_url"] = api_base["urls"]["openApiMetadata"]

# load all endpoints in dependency-order
all_endpoints = self.get_sorted_endpoints()
self.lightbeam.all_endpoints = self.get_sorted_endpoints()

# filter down to only selected endpoints
self.lightbeam.endpoints = self.apply_filters(all_endpoints)
self.lightbeam.endpoints = self.apply_filters(self.lightbeam.all_endpoints)


def apply_filters(self, endpoints=[]):
Expand Down
3 changes: 2 additions & 1 deletion lightbeam/delete.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ async def do_deletes(self, endpoint):
data = line.strip()
# fill out the required fields from the data payload
# (so we can search for matching records in the API)
params = util.interpolate_params(params_structure, data)
payload = json.loads(data)
params = util.interpolate_params(params_structure, payload)

# check if we've posted this data before
data_hash = hashlog.get_hash(data)
Expand Down
29 changes: 25 additions & 4 deletions lightbeam/lightbeam.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,12 +160,33 @@ def get_data_files_for_endpoint(self, endpoint):
return file_list

# Prunes the list of endpoints down to those for which .jsonl files exist in the config.data_dir
def get_endpoints_with_data(self, endpoints):
def get_endpoints_with_data(self):
johncmerfeld marked this conversation as resolved.
Show resolved Hide resolved
self.logger.debug("discovering data...")
endpoints_with_data = []
for endpoint in endpoints:
if self.get_data_files_for_endpoint(endpoint):
endpoints_with_data.append(endpoint)
data_dir_list = os.listdir(self.config["data_dir"])
for data_dir_item in data_dir_list:
data_dir_item_path = os.path.join(self.config["data_dir"], data_dir_item)
if os.path.isfile(data_dir_item_path):
filename = os.path.basename(data_dir_item)
extension = filename.rsplit(".", 1)[-1]
filename_without_extension = filename.rsplit(".", 1)[0]
if extension in self.DATA_FILE_EXTENSIONS and filename_without_extension in self.all_endpoints:
endpoints_with_data.append(filename_without_extension)
elif os.path.isdir(data_dir_item_path):
if data_dir_item in self.all_endpoints:
has_data_file = False
sub_dir_list = os.listdir(data_dir_item_path)
for sub_dir_item in sub_dir_list:
sub_dir_item_path = os.path.join(data_dir_item_path, sub_dir_item)
if os.path.isfile(sub_dir_item_path):
filename = os.path.basename(sub_dir_item)
extension = filename.rsplit(".", 1)[-1]
if extension in self.DATA_FILE_EXTENSIONS:
has_data_file = True
break
if has_data_file:
endpoints_with_data.append(data_dir_item)

return endpoints_with_data

# Returns a generator which produces json lines for a given endpoint based on relevant files in config.data_dir
Expand Down
6 changes: 5 additions & 1 deletion lightbeam/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,16 @@ def singularize_endpoint(endpoint):
if endpoint[-3:]=="ies": return endpoint[0:-3] + "y"
elif endpoint=="people": return "person"
else: return endpoint[0:-1]
def pluralize_endpoint(endpoint):
if endpoint[-1:]=="y": return endpoint[0:-1] + "ies"
elif endpoint=="person": return "people"
else: return endpoint+"s"

# Takes a params structure and interpolates values from a (string) JSON payload
def interpolate_params(params_structure, payload):
params = {}
for k,v in params_structure.items():
value = json.loads(payload)
value = payload.copy()
for key in v.split('.'):
value = value[key]
params[k] = value
Expand Down
Loading