Skip to content

Test data management tool for any data source, batch or real-time. Generate, validate and clean up data all in one tool.

License

Notifications You must be signed in to change notification settings

data-catering/data-caterer

Folders and files

NameName
Last commit message
Last commit date
Sep 12, 2024
Dec 25, 2024
Dec 25, 2024
Jun 2, 2024
Dec 25, 2024
Jun 28, 2024
Nov 27, 2023
May 30, 2024
Oct 16, 2024
Oct 25, 2024
Dec 25, 2024
May 31, 2024
Oct 17, 2024
Dec 25, 2024
Jun 2, 2024
Jun 2, 2024
Dec 25, 2024
May 30, 2024
Nov 27, 2023
Dec 25, 2024

Repository files navigation

Data Caterer - Test Data Management Tool

Overview

A test data management tool with automated data generation, validation and clean up.

Basic data flow for Data Caterer

Generate data for databases, files, messaging systems or HTTP requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run data validations after generating data to ensure it is consumed correctly. Clean up generated data or consumed data in downstream data sources to keep your environments tidy. Define alerts to get notified when failures occur and deep dive into issues from the generated report.

Full docs can be found here.

Scala/Java examples found here.

A demo of the UI found here.

Features

Basic flow

Quick start

  1. Docker
    docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.12.1
    Open localhost:9898.
  2. Run Scala/Java examples
    git clone git@github.com:data-catering/data-caterer-example.git
    cd data-caterer-example && ./run.sh
    #check results under docker/sample/report/index.html folder
  3. UI App: Mac download
  4. UI App: Windows download
    1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
    2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
    3. Click on 'More info' then at the bottom, click 'Run anyway'
    4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
    5. If your browser doesn't open, go to http://localhost:9898 in your preferred browser
  5. UI App: Linux download

Integrations

Supported data sources

Data Caterer supports the below data sources. Check here for the full roadmap.

Data Source Type Data Source Support
Cloud Storage AWS S3
Cloud Storage Azure Blob Storage
Cloud Storage GCP Cloud Storage
Database Cassandra
Database MySQL
Database Postgres
Database Elasticsearch
Database MongoDB
File CSV
File Delta Lake
File JSON
File Iceberg
File ORC
File Parquet
File Hudi
HTTP REST API
Messaging Kafka
Messaging Solace
Messaging ActiveMQ
Messaging Pulsar
Messaging RabbitMQ
Metadata Data Contract CLI
Metadata Great Expectations
Metadata Marquez
Metadata OpenAPI/Swagger
Metadata OpenMetadata
Metadata Open Data Contract Standard (ODCS)
Metadata Amundsen
Metadata Datahub
Metadata Solace Event Portal

Sponsorship

Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer as an enterprise, you are required to be a sponsor for the project.

Find out more details here to help with sponsorship.

Contributing

View details here about how you can contribute to the project.

Additional Details

Run Configurations

Different ways to run Data Caterer based on your use case:

Types of run configurations

Design

Design motivations and details can be found here.

Roadmap

Can check here for full list.

Mildly Quick Start

Generate and validate data

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")  //name and url
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(validation.count.isEqual(1000))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))

val parquetValidation = parquet("output_parquet", "/data/parquet/customer")
  .validation(
     validation.upstreamData(postgresTask)
       .joinFields("account_id")
       .withValidation(validation.count().isEqual(1000))
  )
  .validationWait(waitCondition.file("/data/parquet/customer"))

Generate same data across data sources

kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .fields(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
  .topic("account-topic")
  .fields(...)

plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(kafkaTask -> List("account_id"))
)

Generate data and clean up

postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerField(5, "account_id"))
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(1).max(5), "account_id"))
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .table("account", "transactions")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("account_id"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
  .count(count.recordsPerFieldGenerator(generator.min(0).max(5), "account_id"))

val cassandraTxns = cassandra("ingested_data", "localhost:9042")
  .table("account", "transactions")

val deletePlan = plan.addForeignKeyRelationship(
   postgresTask, List("account_id"),
   List(),
   List(cassandraTxns -> List("SUBSTR(account_id, 3) AS account_number"))
)

val conf = configuration
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Generate data with schema from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .fields(metadataSource.openDataContractStandard("/data/odcs/full-example.odcs.yaml"))
http("my_http")
  .fields(metadataSource.openApi("/data/http/petstore.json"))

Validate data using validations from metadata source

parquet("customer_parquet", "/data/parquet/customer")
  .validations(metadataSource.greatExpectations("/data/great-expectations/taxi-expectations.json"))

Packages

No packages published

Languages

  • Scala 82.0%
  • JavaScript 12.7%
  • Java 2.8%
  • HTML 1.6%
  • Shell 0.6%
  • CSS 0.3%