Name	Name	Last commit message	Last commit date
Latest commit pflooky Merge pull request #84 from data-catering/confluent-schema-registry Dec 25, 2024 0e642c2 · Dec 25, 2024 History 273 Commits
.github	.github	Use debug log_level	Sep 12, 2024
api	api	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024
app	app	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024
gradle/wrapper	gradle/wrapper	Update dependency gradle to v8.8	Jun 2, 2024
misc	misc	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024
.dockerignore	.dockerignore	Use corretto alpine java base image, reduce number of layers of docke…	Jun 28, 2024
.gitattributes	.gitattributes	Initial commit	Nov 27, 2023
.gitignore	.gitignore	Add in support for Iceberg data source for generation and validation,…	May 30, 2024
Dockerfile	Dockerfile	Clean top-level directory files	Oct 16, 2024
LICENSE	LICENSE	Update to use GNU General Public License v3.0 (GPL-3.0) with Non-Comm…	Oct 25, 2024
README.md	README.md	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024
build.gradle.kts	build.gradle.kts	Update plugin org.jetbrains.gradle.plugin.idea-ext to v1.1.8	May 31, 2024
docker-action.sh	docker-action.sh	Push image to datacatering/data-caterer	Oct 17, 2024
gradle.properties	gradle.properties	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024
gradlew	gradlew	Update dependency gradle to v8.8	Jun 2, 2024
gradlew.bat	gradlew.bat	Update dependency gradle to v8.8	Jun 2, 2024
insta-integration.yaml	insta-integration.yaml	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024
renovate.json	renovate.json	Add renovate.json	May 30, 2024
settings.gradle.kts	settings.gradle.kts	Initial commit	Nov 27, 2023
workspace.xml	workspace.xml	Major refactor of plan attribute, add new types of validations, refac…	Dec 25, 2024

Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Quick start

Docker

docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.12.1

Open localhost:9898.

Run Scala/Java examples

git clone git@github.com:data-catering/data-caterer-example.git
cd data-caterer-example && ./run.sh
#check results under docker/sample/report/index.html folder

UI App: Mac download
UI App: Windows download
1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
3. Click on 'More info' then at the bottom, click 'Run anyway'
4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
UI App: Linux download

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type	Data Source	Support
Cloud Storage	AWS S3	✅
Cloud Storage	Azure Blob Storage	✅
Cloud Storage	GCP Cloud Storage	✅
Database	Cassandra	✅
Database	MySQL	✅
Database	Postgres	✅
Database	Elasticsearch	❌
Database	MongoDB	❌
File	CSV	✅
File	Delta Lake	✅
File	JSON	✅
File	Iceberg	✅
File	ORC	✅
File	Parquet	✅
File	Hudi	❌
HTTP	REST API	✅
Messaging	Kafka	✅
Messaging	Solace	✅
Messaging	ActiveMQ	❌
Messaging	Pulsar	❌
Messaging	RabbitMQ	❌
Metadata	Data Contract CLI	✅
Metadata	Great Expectations	✅
Metadata	Marquez	✅
Metadata	OpenAPI/Swagger	✅
Metadata	OpenMetadata	✅
Metadata	Open Data Contract Standard (ODCS)	✅
Metadata	Amundsen	❌
Metadata	Datahub	❌
Metadata	Solace Event Portal	❌

Sponsorship

Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer as an enterprise, you are required to be a sponsor for the project.

Find out more details here to help with sponsorship.

Contributing

View details here about how you can contribute to the project.

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

Mildly Quick Start

Generate and validate data

I want to generate data in Postgres

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url

But I want `account_id` to follow a pattern and be unique

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

I then want to test my job ingests all the data after generating

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))

I want to make sure all the `account_id` values in Postgres are in the Parquet file

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )

I want to start validating once the Parquet file is available

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

I also want to generate events in Kafka

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

But I want the same `account_id` to show in Postgres and Kafka

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

I want to generate 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerField(5, "account_id"))

Randomly generate 1 to 5 transactions per `account_id`

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(1).max(5), "account_id"))

I want to delete the generated data

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

I also want to delete the data in Cassandra because my job consumed the data in Postgres and pushed to Cassandra

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

But only the `account_number` is saved in Cassandra from the `account_id`

val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

I have a data contract using the Open Data Contract Standard (ODCS) format

parquet("customer_parquet", "/data/parquet/customer")
  .fields(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))

I have an OpenAPI/Swagger doc

http("my_http")
  .fields(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

I have expectations from Great Expectations

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))

License

data-catering/data-caterer

Folders and files

Latest commit

History

Repository files navigation