Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean top-level directory files #75

Merged
merged 5 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ RUN addgroup -S app \
&& apk add --no-cache libc6-compat bash \
&& mkdir -p /opt/app /opt/DataCaterer/connection /opt/DataCaterer/plan /opt/DataCaterer/execution /opt/DataCaterer/report \
&& chown -R app:app /opt/app /opt/DataCaterer/connection /opt/DataCaterer/plan /opt/DataCaterer/execution /opt/DataCaterer/report
COPY --chown=app:app script app/src/main/resources app/build/libs /opt/app/
COPY --chown=app:app misc/docker-image app/src/main/resources app/build/libs /opt/app/

USER app
WORKDIR /opt/app
Expand Down
209 changes: 90 additions & 119 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

A test data management tool with automated data generation, validation and cleanup.

![Basic data flow for Data Caterer](design/high_level_flow-run-config-basic-flow.svg)
![Basic data flow for Data Caterer](misc/design/high_level_flow-run-config-basic-flow.svg)

[Generate data](https://data.catering/setup/generator/data-generator/) for databases, files, messaging systems or HTTP
requests via UI, Scala/Java SDK or YAML input and executed via Spark. Run
Expand Down Expand Up @@ -34,21 +34,21 @@ and deep dive into issues [from the generated report](https://data.catering/samp
- [Alerts to be notified of results](https://data.catering/setup/report/alert/)
- [Run as GitHub Action](https://github.com/data-catering/insta-integration)

![Basic flow](design/basic_data_caterer_flow_medium.gif)
![Basic flow](misc/design/basic_data_caterer_flow_medium.gif)

## Quick start

1. [Mac download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-mac.zip)
2. [Windows download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-windows.zip)
1. [UI App: Mac download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-mac.zip)
2. [UI App: Windows download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-windows.zip)
1. After downloading, go to 'Downloads' folder and 'Extract All' from data-caterer-windows
2. Double-click 'DataCaterer-1.0.0' to install Data Caterer
3. Click on 'More info' then at the bottom, click 'Run anyway'
4. Go to '/Program Files/DataCaterer' folder and run DataCaterer application
5. If your browser doesn't open, go to [http://localhost:9898](http://localhost:9898) in your preferred browser
3. [Linux download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-linux.zip)
3. [UI App: Linux download](https://nightly.link/data-catering/data-caterer/workflows/build/main/data-caterer-linux.zip)
4. Docker
```shell
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer-basic:0.11.9
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer-basic:0.11.11
```
[Open localhost:9898](http://localhost:9898).

Expand All @@ -64,105 +64,61 @@ cd data-caterer-example && ./run.sh

### Supported data sources

Data Caterer supports the below data sources. Additional data sources can be added on a demand basis. [Check here for
the full roadmap](#roadmap).

| Data Source Type | Data Source | Support | Free |
|------------------|------------------------------------|---------|------|
| Cloud Storage | AWS S3 | ✅ | ✅ |
| Cloud Storage | Azure Blob Storage | ✅ | ✅ |
| Cloud Storage | GCP Cloud Storage | ✅ | ✅ |
| Database | Cassandra | ✅ | ✅ |
| Database | MySQL | ✅ | ✅ |
| Database | Postgres | ✅ | ✅ |
| Database | Elasticsearch | ❌ | ✅ |
| Database | MongoDB | ❌ | ✅ |
| File | CSV | ✅ | ✅ |
| File | Delta Lake | ✅ | ✅ |
| File | JSON | ✅ | ✅ |
| File | Iceberg | ✅ | ✅ |
| File | ORC | ✅ | ✅ |
| File | Parquet | ✅ | ✅ |
| File | Hudi | ❌ | ✅ |
| HTTP | REST API | ✅ | ❌ |
| Messaging | Kafka | ✅ | ❌ |
| Messaging | Solace | ✅ | ❌ |
| Messaging | ActiveMQ | ❌ | ❌ |
| Messaging | Pulsar | ❌ | ❌ |
| Messaging | RabbitMQ | ❌ | ❌ |
| Metadata | Great Expectations | ✅ | ❌ |
| Metadata | Marquez | ✅ | ❌ |
| Metadata | OpenAPI/Swagger | ✅ | ❌ |
| Metadata | OpenMetadata | ✅ | ❌ |
| Metadata | Open Data Contract Standard (ODCS) | ✅ | ❌ |
| Metadata | Amundsen | ❌ | ❌ |
| Metadata | Datahub | ❌ | ❌ |
| Metadata | Data Contract CLI | ❌ | ❌ |
| Metadata | Solace Event Portal | ❌ | ❌ |


## Supported use cases

1. Insert into single data sink
2. Insert into multiple data sinks
1. Foreign keys associated between data sources
2. Number of records per column value
3. Set random seed at column and whole data generation level
4. Generate real-looking data (via DataFaker) and edge cases
1. Names, addresses, places etc.
2. Edge cases for each data type (e.g. newline character in string, maximum integer, NaN, 0)
3. Nullability
5. Send events progressively
6. Automatically insert data into data source
1. Read metadata from data source and insert for all sub data sources (e.g. tables)
2. Get statistics from existing data in data source if exists
7. Track and delete generated data
8. Extract data profiling and metadata from given data sources
1. Calculate the total number of combinations
9. Validate data
1. Basic column validations (not null, contains, equals, greater than)
2. Aggregate validations (group by account_id and sum amounts should be less than 100, each account should have at
least one transaction)
3. Upstream data source validations (generate data and then check same data is inserted in another data source with
potential transformations)
4. Column name validations (check count and ordering of column names)
10. Data migration validations
1. Ensure row counts are equal
2. Check both data sources have same values for key columns
Data Caterer supports the below data sources. [Check here for the full roadmap](#roadmap).

| Data Source Type | Data Source | Support |
|------------------|------------------------------------|---------|
| Cloud Storage | AWS S3 | ✅ |
| Cloud Storage | Azure Blob Storage | ✅ |
| Cloud Storage | GCP Cloud Storage | ✅ |
| Database | Cassandra | ✅ |
| Database | MySQL | ✅ |
| Database | Postgres | ✅ |
| Database | Elasticsearch | ❌ |
| Database | MongoDB | ❌ |
| File | CSV | ✅ |
| File | Delta Lake | ✅ |
| File | JSON | ✅ |
| File | Iceberg | ✅ |
| File | ORC | ✅ |
| File | Parquet | ✅ |
| File | Hudi | ❌ |
| HTTP | REST API | ✅ |
| Messaging | Kafka | ✅ |
| Messaging | Solace | ✅ |
| Messaging | ActiveMQ | ❌ |
| Messaging | Pulsar | ❌ |
| Messaging | RabbitMQ | ❌ |
| Metadata | Data Contract CLI | ✅ |
| Metadata | Great Expectations | ✅ |
| Metadata | Marquez | ✅ |
| Metadata | OpenAPI/Swagger | ✅ |
| Metadata | OpenMetadata | ✅ |
| Metadata | Open Data Contract Standard (ODCS) | ✅ |
| Metadata | Amundsen | ❌ |
| Metadata | Datahub | ❌ |
| Metadata | Solace Event Portal | ❌ |

## Run Configurations

Different ways to run Data Caterer based on your use case:

![Types of run configurations](design/high_level_flow-run-config.svg)

## Sponsorship

Data Caterer is set up under a sponsorware model where all features are available to sponsors. The core features
are available here in this project for all to use/fork/update/improve etc., as the open core.

Sponsors have access to the following features:

- All data sources (see [here for all data sources](https://data.catering/setup/connection/))
- Batch and Event generation
- [Auto generation from data connections or metadata sources](https://data.catering/setup/guide/scenario/auto-generate-connection/)
- Suggest data validations
- [Clean up generated and consumed data](https://data.catering/setup/guide/scenario/delete-generated-data/)
- Run as many times as you want, not charged by usage
- Metadata discovery
- [Plus more to come](#roadmap)
Data Caterer is set up under a sponsorship model. If you require support or additional features from Data Caterer
as an enterprise, you are required to be a sponsor for the project.

[Find out more details here to help with sponsorship.](https://data.catering/sponsor)

This is inspired by the [mkdocs-material project](https://github.com/squidfunk/mkdocs-material) which
[follows the same model](https://squidfunk.github.io/mkdocs-material/insiders/).

## Contributing

[View details here about how you can contribute to the project.](CONTRIBUTING.md)
[View details here about how you can contribute to the project.](misc/CONTRIBUTING.md)

## Additional Details

## Run Configurations

Different ways to run Data Caterer based on your use case:

![Types of run configurations](misc/design/high_level_flow-run-config.svg)

### Design

[Design motivations and details can be found here.](https://data.catering/setup/design)
Expand All @@ -171,40 +127,55 @@ This is inspired by the [mkdocs-material project](https://github.com/squidfunk/m

[Can check here for full list.](https://data.catering/use-case/roadmap/)

#### UI
### Mildly Quick Start

1. Allow the application to run with UI enabled
2. Runs as a long-lived app with UI that interacts with the existing app as a single container
3. Ability to run as UI, Spark job or both
4. Persist data in files or database (Postgres)
5. UI will show the history of data generation/validation runs, delete generated data, create new scenarios, define data connections
#### I want to generate data in Postgres

#### Distribution
```scala
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer") //name and url
```

##### Docker
#### But I want `account_id` to follow a pattern

```shell
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
```scala
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}"))
```

##### Jpackage
#### I also want to generate events

```bash
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"
```scala
kafka("my_kafka", "localhost:29092")
.topic("account-topic")
.schema(...)
```

##### Java 17 VM Options
#### But I want the same `account_id` to show in Postgres and Kafka

```shell
--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
```scala
val postgresTask = postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.schema(field.name("account_id").regex("ACC[0-9]{10}"))

val kafkaTask = kafka("my_kafka", "localhost:29092")
.topic("account-topic")
.schema(...)

plan.addForeignKeyRelationship(
postgresTask, List("account_id"),
List(kafkaTask -> List("account_id"))
)
```
-Dlog4j.configurationFile=classpath:log4j2.properties

#### I want to generate 5 transactions per `account_id`

```scala
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.count(count.recordsPerColumn(5, "account_id"))
```

#### But I want to generate 0 to 5 transactions per `account_id`

```scala
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
.count(count.recordsPerColumnGenerator(generator.min(0).max(5), "account_id"))
```
20 changes: 0 additions & 20 deletions local-docker-build.sh

This file was deleted.

File renamed without changes.
File renamed without changes.
File renamed without changes
29 changes: 29 additions & 0 deletions misc/distribution/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#### Distribution

##### Docker

```shell
gradle clean :api:shadowJar :app:shadowJar
docker build --build-arg "APP_VERSION=0.7.0" --build-arg "SPARK_VERSION=3.5.0" --no-cache -t datacatering/data-caterer:0.7.0 .
docker run -d -i -p 9898:9898 -e DEPLOY_MODE=standalone -v data-caterer-data:/opt/data-caterer --name datacaterer datacatering/data-caterer:0.7.0
#open localhost:9898
```

##### Jpackage

```bash
JPACKAGE_BUILD=true gradle clean :api:shadowJar :app:shadowJar
# Mac
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-mac.cfg"
# Windows
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-windows.cfg"
# Linux
jpackage "@misc/jpackage/jpackage.cfg" "@misc/jpackage/jpackage-linux.cfg"
```

##### Java 17 VM Options

```shell
--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED
```
-Dlog4j.configurationFile=classpath:log4j2.properties
File renamed without changes.
5 changes: 0 additions & 5 deletions run-docker.sh

This file was deleted.

Loading