Avalanche is an automatic materialization process designed to transform raw, structured data. It ensures data completeness while providing a streamlined approach to managing Snowflake Kafka connector-based schemas. It is a Python-based solution that integrates with Snowflake, Kafka, and other data sources to facilitate the transformation and loading of data into structured formats. While not necessary, Avalanche was conceptualized in WW Tech as part of a broader stack like below
The recommendation for a production setup is similar to above , and thus this document assumes understanding Kafka , Snowflake Connector. For a non-standard setup , you could refer config/sample_nyc_taxi_data.yaml
for guidance
Purpose of Avalanche is to transform raw, semi-structured data into relational, structured data. While doing so, it also checks for data completeness.
This is a one-time setup where the base Avalanche system tables are created. These tables serve as the foundation for all Avalanche deployments. This step is executed using the initialize_system.py
module and is required only once per new system setup.
Refer docs/system_initialization.md
for details on how to run this script here
Once the system is initialized, you are ready to deploy Avalanche service. Deployments are the core of Avalanche's functionality, allowing it to process data from various sources and materialize it into structured tables in Snowflake. Avalanche deployments are designed to materialize RAW tables (Snowflake Kafka connector-based schemas) into structured, queryable data tables. Each deployment is containerized and can be grouped by source (e.g., replicating an order transactions database).
Refer docs/avalanche_service_deployment.md
for details on how to deploy Avalanche service here
This section provides instructions for setting up a local development environment for Avalanche. It is designed to help developers quickly get started with Avalanche development and testing. It heavily relies on the make
command to automate the setup process, including dependency installation, environment variable generation, and configuration.
Refer docs/local_development_environment_setup.md
for details on how to set up a local development environment here
Avalanche in the wild
Avalanche is currently being used to ingest Terabytes of data in WW supporting over 1400 topics spanning multiple data sources - Postgres, MySQL, Oracle, MongoDB, and schematized application events.
This section provides references to additional components in the recommended stack
- Debezium Connector
- Snowflake Kafka Connector
- Apache Kafka Documentation
- Confluent Kafka - quick start guide
- Confluent Kafka Schema Registry
Thanks to all the people who have contributed to this project! Maintainers:
Star Contributors:
Want to contribute? For the time being, the best way is to open an issue in the repo, and we will get back to you.