Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tap-test-data #90

Open
pnadolny13 opened this issue Sep 12, 2023 · 6 comments
Open

tap-test-data #90

pnadolny13 opened this issue Sep 12, 2023 · 6 comments
Assignees

Comments

@pnadolny13
Copy link
Collaborator

pnadolny13 commented Sep 12, 2023

Many taps have statically defined schemas so you can --discover without credentials to get the catalog schema for all streams available. Given that catalog as input it would be nice to have a tap that generates fake data that conforms to it. This would be very helpful for testing pipelines without credentials. https://github.com/lk-geimfari/mimesis or https://github.com/joke2k/faker can be used to generate fake data.

This is somewhat similar to meltano/sdk#1009 but that requires the tap to pull real data in order to mask it.

@edgarrmondragon any thoughts? I want to test out https://hub.meltano.com/extractors/tap-amazon-sp but I dont have credentials which made me wonder if we could generate data as if I do. This would unlock anything downstream and would expect that the tap conforms to its own schema, otherwise thats a bug anyways.

@pnadolny13 pnadolny13 self-assigned this Sep 12, 2023
@edgarrmondragon
Copy link
Member

@pnadolny13 I've used the faker approach to test tap-dbt, for which I don't have credentials at the moment.

It's allowed me to convince myself that the tap is tested 😅 but a bunch of issues get swept under the rug by not having access to the production system: which fields are nullable, how incremental replication works in practice, etc.

I guess GIGO is very true for this approach, so the better the source system documentation, the better the mocks.

@pnadolny13
Copy link
Collaborator Author

It's allowed me to convince myself that the tap is tested 😅 but a bunch of issues get swept under the rug by not having access to the production system: which fields are nullable, how incremental replication works in practice, etc.

@edgarrmondragon Your use case sounds like it was more for mocking API data to build the tap and mine is more for simulating an existing tap's output without credentials. I was thinking about using it more as a way to basically mock the tap completely at the integration test level rather than mocking API responses.

I guess GIGO is very true for this approach

Yeah assuming the tap developer, who built the tap using real data with their valid credentials, properly defined the stream schemas then theoretically we could run a pipeline that simulates the data it would generate.

@edgarrmondragon
Copy link
Member

@pnadolny13 I see, so you'd like a generic tap that can be configured by a developer of a different tap to output messages that simulate the latter's?

who built the tap using real data with their valid credentials, properly defined the stream schemas

If you know the schemas, is this approach much better than generating the data once1 and, for example, sharing around a Singer JSON lines file, or is there a benefit to having this wrapped by a tap?

[1]: We could still help with that somehow by offering wrappers for mimesis or faker that allow a developer to define a ssot for both the JSON schema and the generator function, something like https://gist.github.com/edgarrmondragon/9b6d962e232a37a883e577150a723c09.

@pnadolny13
Copy link
Collaborator Author

@pnadolny13 I see, so you'd like a generic tap that can be configured by a developer of a different tap to output messages that simulate the latter's?

Yes - I think 😅 . Given a catalog for any tap I should be able to generate records that conforms to the json schema for those streams.

If you know the schemas, is this approach much better than generating the data once1 and, for example, sharing around a Singer JSON lines file, or is there a benefit to having this wrapped by a tap?

This is more for the multi tenant use case. Like if my customer's data is involved i.e. running the tap with their credentials. The customer could share their catalog and we can generate a test data set for ourselves, no need for credentials, and no PII to worry about.

@edgarrmondragon
Copy link
Member

I see. That's definitely doable by mapping json types to generators. It would generate nonsense for fields without further annotating the fields (e.g. name is a string but every string satisfies being a valid name) but that's probably OK for a rough approximation at the true source.

@pnadolny13
Copy link
Collaborator Author

I hacked together a version of this https://github.com/pnadolny13/tap-test-data.

I wasnt able to find a good json schema to fake data implementation in python so I ended up using the npm package json-schema-faker https://json-schema-faker.js.org/, actually the cli version https://github.com/oprogramador/json-schema-faker-cli.

Theres still challenges with its output and it is nonsense data because the package is primarily for making fake data for unit tests to try to break your code...but it does pass validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants