Streaming data with Kafka. Current projects:
The workflow is as follows:
- Stream data from Postgres to Kakfa using Debezium (log-based CDC), KSQL and Kafka Connect provided Confluent Platform. Code: ksql/source/source__postgres__airbnb.sql
- Sink data from Kafka to Google Cloud Storage using Kafka Connect (data is stored in hive-style partitioning). Code: connectors/sink/gcp/gcs-sink.json
- Automatically detect and create new topics as external tables on BigQuery using Dagster. Code: kafka-dagster/kafka_dagster/airbnb__gcs_to_bigquery_asset.py
- Example of the DAG created on Dagster:
-
Install Kafka Connectors in Docker: Doc: https://rmoff.net/2020/06/19/how-to-install-connector-plugins-in-kafka-connect/
-
Monitor Kafka connect and Connector https://docs.confluent.io/platform/current/connect/monitoring.html
-
Stream ELT pipeline https://docs.ksqldb.io/en/latest/tutorials/etl/?_ga=2.12145522.779215627.1700084765-1437246833.1700084765#create-the-ksqldb-source-streams
-
KSQL docker exec -it ksqldb-cli ksql http://ksqldb-server:8088
-
Create a connector curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @opt/connectors/sink/gcp/gcs-sink.json
-
Update the connector (the config file structure is a little bit different from POST ref ) curl --request PUT -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/gcs-sink/config -d @opt/connectors/sink/gcp/archive/gcs-sink-update.json
-
List current connectors curl --request GET -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/
-
CICD with CloudBuild, Compute Engine https://beranger.medium.com/automate-deployment-with-google-compute-engine-and-cloud-build-cccd5c3eb93c
-
Docker build docker build --tag tototus-dagster --file Dockerfile-dagster .
-
dbt-dagster cd dbt_dagster/airbnb dagster dev