Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Populate dataset #844

Merged
merged 25 commits into from
Jul 14, 2022
Merged

Populate dataset #844

merged 25 commits into from
Jul 14, 2022

Conversation

galvana
Copy link
Collaborator

@galvana galvana commented Jul 8, 2022

Purpose

Adds a update_dataset function which takes a ConnectionConfig, DatasetConfig, and API data to update the existing Dataset. This is meant to be used for SaaS connector development.

Functionality

  • Updates an existing dataset based on the API data
    • Can also write the generated dataset to a separate file from the existing dataset for comparisons
  • The collection order is derived from the collection order in the SaaS config
    • Default behavior preserves the collection order of the existing dataset
  • Preserves any user-defined data categories during the dataset generation
  • Invoked from the test_*_access_request_task after the run_access_request has returned a response
    v = graph_task.run_access_request(
        privacy_request,
        policy,
        graph,
        [connection_config],
        {"email": identity_email},
    )
    
    update_dataset(
        connection_config,
        dataset_config,
        v,
        "generated_dataset.yml",
    )
    

Checklist

  • Good unit test/integration test coverage
  • The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

@galvana galvana marked this pull request as ready for review July 8, 2022 23:44
return collections


def generate_fields(
Copy link
Collaborator Author

@galvana galvana Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a merge_datasets function but we need to preserve the order of the fields and we lose control of that as soon as we convert a simple dict to a FidesopsDataset, the fields are alphabetized due to default Pydantic functionality.
pydantic/pydantic#593

@galvana galvana requested a review from adamsachs July 9, 2022 00:01
@galvana galvana added the run unsafe ci checks Triggers running of unsafe CI checks label Jul 9, 2022
@galvana galvana added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jul 11, 2022
},
],
}
api_data = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a better test here include an api_data object that has at least one of the existing collection elements placed in a different order than in existing_dataset?

@@ -0,0 +1,667 @@
from tests.test_helpers.dataset_utils import generate_collections, generate_dataset


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job with these tests - they're a great starting point for understanding the expected functionality of the utilities.

one follow up - would it be feasible/useful to include more of an "integration" level test that actually tests and demonstrates how the utility can work on a real API response, i.e. using a saas provider we already integrate with as an example? that test would obviously need to be marked as an integration_saas test since it would require external communication, but just seems like it could be a nice-to-have. (if you think this is feasible, maybe we can split out a separate issue/PR for it, since i don't think it should necessarily block this PR)

# into generate_collections
generated_collections = generate_collections(
{
collection_name.replace(f"{existing_dataset['fides_key']}:", ""): collection
Copy link
Contributor

@adamsachs adamsachs Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logic/processing covered in the unit tests? i know we may not be able to cover every corner of the dataset generation in unit tests, but it's some behavior that i think would be good to have covered both to ensure correct functionality and to clarify expected behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i think my suggestion about an "integration" level test may help in providing coverage for this piece of things? that would address my concern, no need to make it specifically a unit test.

maybe just another reason to look into having that sort of "integration" test with more realistic data

the existing collections if no API data is available.
"""

# convert FidesopsDataset to Dataset to be able to use the Collection helpers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting that i don't totally follow this, but probably due to my lack of understanding, so may be good to review together offline

return object_list


def get_data_type(value) -> Optional[str]:
Copy link
Contributor

@adamsachs adamsachs Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, it feels like this function should live more centrally. what would you think of defining it somewhere within fidesops.graph.data_type.py? from what i can tell, there's nothing yet defined there that takes an arbitrary input value/variable and gives the corresponding proper Fidesops DataType, but I think that's what you want here.

If you add that utility method over there, then I think here you should be relying upon the proper DataType enum - passing around string literals here just feels a bit sloppy to me. Specifically, I think it's ripe for unnoticed regressions if/when any of the typing logic changes in the fidesops "taxonomy" (for lack of a better term), and it will be burdensome to maintain.

collection_map.keys(),
):
if len(rows := api_data.get(collection_name, [])):
fields = generate_fields(rows[0], collection_name, collection_map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily a feature for now, but a potential enhancement could be to allow for a configurable "sampling size" here so that potentially more than just the first row of response data is being used here. obviously we'd then need to build in some more advanced type-inference logic to take into account multiple values for each column, but just calling it out as an area for fine tuning.

@adamsachs
Copy link
Contributor

it looks like there are some unit tests failing that i think are real failures? once that's addressed, i think i'm good to approve - there is one additional nitpick i threw on there for the get_data_type, let me know what you think. i don't think it should block the PR but if you agree with it and it's a quick change, may as well get it in there.

@galvana galvana requested a review from adamsachs July 14, 2022 00:06
Copy link
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this all looks great - thanks for making those updates to move the data type logic more centrally, i think it will help for maintainability.

excited to see this in action!

@galvana galvana merged commit 43e4134 into main Jul 14, 2022
@galvana galvana deleted the populate-dataset branch July 14, 2022 02:52
sanders41 pushed a commit that referenced this pull request Sep 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
run unsafe ci checks Triggers running of unsafe CI checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants