Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Dataset Squashes Tables From Multiple Schemas #173

Closed
brentonmallen1 opened this issue Oct 19, 2021 · 3 comments
Closed

Generate Dataset Squashes Tables From Multiple Schemas #173

brentonmallen1 opened this issue Oct 19, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@brentonmallen1
Copy link
Contributor

Issue:

  • when using generate-dataset on a the slice database (mysql and/or postgres), which has multiple user created databases/schemas (mypizza and test), the resulting auto-generated dataset.yml file only contains the mypizza schema with the test schemas appended to the list of collection objest.

Expected:

  • The resulting database.yml file should have two databases/schemas each containing their respective tables under their collection list

Investigation:

Impact:

  • Users who take advantage of the generate-dataset functionality in the CLI may not notice extra tables associated with a given dataset in the output file.
  • Potential impact on annotate-dataset, though I will try to make it so it doesn't either way
  • Not sure what the impact, if any, there will be on the fides system in general.

Potential solutions:

  • Fix the nested dict highlighted above so that it handles multiple schemas/databases appropriately
  • Enforce one dataset.yml file per database/schema
  • I'm sure there are other approaches
@iamkelllly iamkelllly added the bug Something isn't working label Oct 19, 2021
@brentonmallen1
Copy link
Contributor Author

brentonmallen1 commented Oct 19, 2021

Further impact/consideration:

  • Currently, write_manifest is expecting a single dataset dictionary and will write a single file

@brentonmallen1
Copy link
Contributor Author

I looked into this a bit more due to the conversation here: #176 (comment)

It seems to me that a few functions need to be refactored to support multiple datasets in a single dataset.yml file. specifically the following functions in generate_dataset.py:

They seem to be written to expect that a single dataset exits.

I'm not sure if manifests.write_manifest is actually impacted due to it being a yaml dump. I imagine if the the dataset dictionary is correctly compiled that function shouldn't have to change.

For a few reasons, most importantly due to the proximity to the web summit, I don't feel comfortable addressing this at this time. I at least wanted to scope out the level of effort here as the result will cause a refactor in annotate_dataset.py

@brentonmallen1
Copy link
Contributor Author

brentonmallen1 commented Oct 28, 2021

Taking care of this in #176 since it's dependent

ThomasLaPiana pushed a commit that referenced this issue Aug 17, 2022
…173)

* updates docs for supported masking strategies and associated configs

* formatting to make each masking strategy more obvious

* missed a spot

* cr changes
ThomasLaPiana pushed a commit that referenced this issue Sep 26, 2022
…173)

* updates docs for supported masking strategies and associated configs

* formatting to make each masking strategy more obvious

* missed a spot

* cr changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants