Populate dataset #844

galvana · 2022-07-08T23:23:23Z

Purpose

Adds a update_dataset function which takes a ConnectionConfig, DatasetConfig, and API data to update the existing Dataset. This is meant to be used for SaaS connector development.

Functionality

Updates an existing dataset based on the API data
- Can also write the generated dataset to a separate file from the existing dataset for comparisons
The collection order is derived from the collection order in the SaaS config
- Default behavior preserves the collection order of the existing dataset
Preserves any user-defined data categories during the dataset generation

Invoked from the test_*_access_request_task after the run_access_request has returned a response

v = graph_task.run_access_request(
    privacy_request,
    policy,
    graph,
    [connection_config],
    {"email": identity_email},
)

update_dataset(
    connection_config,
    dataset_config,
    v,
    "generated_dataset.yml",
)

Checklist

Good unit test/integration test coverage
The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

galvana · 2022-07-08T23:58:28Z

tests/test_helpers/dataset_utils.py

+    return collections
+
+
+def generate_fields(


We have a merge_datasets function but we need to preserve the order of the fields and we lose control of that as soon as we convert a simple dict to a FidesopsDataset, the fields are alphabetized due to default Pydantic functionality.
pydantic/pydantic#593

adamsachs · 2022-07-11T21:14:18Z

tests/test_helpers/test_dataset_utils.py

+                },
+            ],
+        }
+        api_data = {}


would a better test here include an api_data object that has at least one of the existing collection elements placed in a different order than in existing_dataset?

adamsachs · 2022-07-11T21:18:26Z

tests/test_helpers/test_dataset_utils.py

@@ -0,0 +1,667 @@
+from tests.test_helpers.dataset_utils import generate_collections, generate_dataset
+
+


nice job with these tests - they're a great starting point for understanding the expected functionality of the utilities.

one follow up - would it be feasible/useful to include more of an "integration" level test that actually tests and demonstrates how the utility can work on a real API response, i.e. using a saas provider we already integrate with as an example? that test would obviously need to be marked as an integration_saas test since it would require external communication, but just seems like it could be a nice-to-have. (if you think this is feasible, maybe we can split out a separate issue/PR for it, since i don't think it should necessarily block this PR)

adamsachs · 2022-07-12T21:48:58Z

tests/test_helpers/dataset_utils.py

+    # into generate_collections
+    generated_collections = generate_collections(
+        {
+            collection_name.replace(f"{existing_dataset['fides_key']}:", ""): collection


is this logic/processing covered in the unit tests? i know we may not be able to cover every corner of the dataset generation in unit tests, but it's some behavior that i think would be good to have covered both to ensure correct functionality and to clarify expected behavior

also, i think my suggestion about an "integration" level test may help in providing coverage for this piece of things? that would address my concern, no need to make it specifically a unit test.

maybe just another reason to look into having that sort of "integration" test with more realistic data

adamsachs · 2022-07-12T21:55:50Z

tests/test_helpers/dataset_utils.py

+    the existing collections if no API data is available.
+    """
+
+    # convert FidesopsDataset to Dataset to be able to use the Collection helpers


just noting that i don't totally follow this, but probably due to my lack of understanding, so may be good to review together offline

adamsachs · 2022-07-12T22:20:09Z

tests/test_helpers/dataset_utils.py

+    return object_list
+
+
+def get_data_type(value) -> Optional[str]:


to me, it feels like this function should live more centrally. what would you think of defining it somewhere within fidesops.graph.data_type.py? from what i can tell, there's nothing yet defined there that takes an arbitrary input value/variable and gives the corresponding proper Fidesops DataType, but I think that's what you want here.

If you add that utility method over there, then I think here you should be relying upon the proper DataType enum - passing around string literals here just feels a bit sloppy to me. Specifically, I think it's ripe for unnoticed regressions if/when any of the typing logic changes in the fidesops "taxonomy" (for lack of a better term), and it will be burdensome to maintain.

adamsachs · 2022-07-12T22:25:04Z

tests/test_helpers/dataset_utils.py

+        collection_map.keys(),
+    ):
+        if len(rows := api_data.get(collection_name, [])):
+            fields = generate_fields(rows[0], collection_name, collection_map)


not necessarily a feature for now, but a potential enhancement could be to allow for a configurable "sampling size" here so that potentially more than just the first row of response data is being used here. obviously we'd then need to build in some more advanced type-inference logic to take into account multiple values for each column, but just calling it out as an area for fine tuning.

adamsachs · 2022-07-13T21:05:33Z

it looks like there are some unit tests failing that i think are real failures? once that's addressed, i think i'm good to approve - there is one additional nitpick i threw on there for the get_data_type, let me know what you think. i don't think it should block the PR but if you agree with it and it's a quick change, may as well get it in there.

adamsachs

nice, this all looks great - thanks for making those updates to move the data type logic more centrally, i think it will help for maintainability.

excited to see this in action!

HamzaWaseem and others added 20 commits June 8, 2022 20:42

Initial commit for Zendesk connector

6b40416

Merge branch 'main' into 249-saas-connector-zendesk

1bd463a

First pass of the ticketing endpoints

1d4c9ea

In progress dataset

38ea57f

Completing dataset and assertions

04d14e5

Updating changelog

7eea2ae

Minor fix

e56018e

Merge branch 'main' into 249-saas-connector-zendesk

837c235

Merge branch 'main' into 249-saas-connector-zendesk

cf88ed2

First pass of populate_dataset function

b209402

Mapping existing dataset fields

64d8f12

Handling various edge-cases

8023b56

Adding Field type assertion before reading fields

9438750

Formatting fixes

048dbeb

Adding tests

cf622d2

Updating required data format for tests

ce7b584

More cleanup

a489274

Wrapping up testing

63402fb

Merging in main branch

879379f

Cleaning up

63429f5

galvana marked this pull request as ready for review July 8, 2022 23:44

Reverting unrelated changes

cbaec76

galvana commented Jul 8, 2022

View reviewed changes

galvana requested a review from adamsachs July 9, 2022 00:01

galvana added the run unsafe ci checks Triggers running of unsafe CI checks label Jul 9, 2022

Updating SaaSType enum

2f7c443

galvana added run unsafe ci checks Triggers running of unsafe CI checks and removed run unsafe ci checks Triggers running of unsafe CI checks labels Jul 11, 2022

Merge branch 'main' into populate-dataset

671cfbc

adamsachs reviewed Jul 11, 2022

View reviewed changes

adamsachs reviewed Jul 12, 2022

View reviewed changes

Moving get_data_type to data_type module

b666b55

Using is_array in addition to data_type

0f3d790

galvana requested a review from adamsachs July 14, 2022 00:06

adamsachs approved these changes Jul 14, 2022

View reviewed changes

galvana merged commit 43e4134 into main Jul 14, 2022

galvana deleted the populate-dataset branch July 14, 2022 02:52

sanders41 pushed a commit that referenced this pull request Sep 22, 2022

Populate dataset (#844)

d783e63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate dataset #844

Populate dataset #844

galvana commented Jul 8, 2022 •

edited

Loading

galvana Jul 8, 2022 •

edited

Loading

adamsachs Jul 11, 2022

adamsachs Jul 11, 2022

adamsachs Jul 12, 2022 •

edited

Loading

adamsachs Jul 12, 2022

adamsachs Jul 12, 2022

adamsachs Jul 12, 2022 •

edited

Loading

adamsachs Jul 12, 2022

adamsachs commented Jul 13, 2022

adamsachs left a comment

		@@ -0,0 +1,667 @@
		from tests.test_helpers.dataset_utils import generate_collections, generate_dataset

		return object_list


		def get_data_type(value) -> Optional[str]:

Populate dataset #844

Populate dataset #844

Conversation

galvana commented Jul 8, 2022 • edited Loading

Purpose

Functionality

Checklist

galvana Jul 8, 2022 • edited Loading

Choose a reason for hiding this comment

adamsachs Jul 11, 2022

Choose a reason for hiding this comment

adamsachs Jul 11, 2022

Choose a reason for hiding this comment

adamsachs Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

adamsachs Jul 12, 2022

Choose a reason for hiding this comment

adamsachs Jul 12, 2022

Choose a reason for hiding this comment

adamsachs Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

adamsachs Jul 12, 2022

Choose a reason for hiding this comment

adamsachs commented Jul 13, 2022

adamsachs left a comment

Choose a reason for hiding this comment

galvana commented Jul 8, 2022 •

edited

Loading

galvana Jul 8, 2022 •

edited

Loading

adamsachs Jul 12, 2022 •

edited

Loading

adamsachs Jul 12, 2022 •

edited

Loading