Inspired by original 1brc challenge created by Gunnar Morling: https://www.morling.dev/blog/one-billion-row-challenge
- docker engine and docker compose
- about XXGB free space
- challenge will run on these supported architectures only:
- Linux - x86_64
- Darwin (Mac) - x86_64 and arm
- Windows
- Kafka cluster with 3 brokers. Cluster must be local only. Reserve approximately XXGB for data.
- Topic with 32 partitions, replication factor 3 and LogAppendTime named data for input
- Topic with 32 partitions, replication factor 3 named results for output
- Kafka cluster must run using the script run/bootstrap.sh from this repository. bootstrap will also create input and output topics.
- Brokers will listen on port 9092, 9093 and 9094. No Authentication, no SSL.
-
Implement a solution with kafka APIs, kafka streams, flink, ksql, spark, NiFi, camel-kafka, spring-kafka... reading input data from data topic and sink results to results topics. and run it!. This is not limited to JAVA!
-
Ingest data into a kafka topic:
- Create 10 csv files using script run/data.sh or run/windows/data.exe from this repository. Reserve approximately 19GB for it. This will take minutes to end.
- Each row is one data in the format <string: customer id>;<string: order id>;<double: price in EUR>, with the price value having exactly 2 fractional digits.
ID672;IQRWG;363.81 ID016;OEWET;9162.02 ID002;IOIUD;15017.20 ..........
- There are 999 different customers
- Price value: not null double between 0.00 (inclusive) and 50000.00 (inclusive), always with 2 fractional digits
- Read from csv files AND send continuously data to data topic using the script producer.sh from this repository
-
Output topic must contain messages with key/value and no additional headers:
- Key: customer id, example ID672
- Value: order counts | order counts_with_price > 40000 | min price | max price, example 1212 | 78 | 4.22 | 48812.22 grouped by key.
- Expected to have 999 different messages
💡 Kafka Cluster runs cp-kafka, Official Confluent Docker Image for Kafka (Community Version) version 7.6.0, shipping Apache Kafka version 3.6.x
💡 Verify messages published into data topic with run/consumer.sh script using https://raw.githubusercontent.com/confluentinc/librdkafka/master/examples/consumer.c. Tu run the consumer, verify that you have installed librdkafka
- Run script run/data.sh or run/windows/data.exe to create 1B rows split in 10 csv files.
- Run script run/bootstrap.sh to setup a Kafka clusters and required topics.
- Deploy your solution and run it, publishing data to results topic.
- Run script run/producer.sh in a new terminal. Producer will read from input files and publish to data topic.
At the end clean up with script run/tear-down.sh
- Fork this repo
- Add your solution to folder challenge-YOURNAME, example challenge-hifly
- Open a Pull Request detailing your solution with instructions on how to deploy it
✅ Your solution will be tested using the same docker-compose file. Results will be published on this page.
💻 Solutions will be tested on a (TODO) server
💡 A sample implementation is present in folder challenge with Kafka Streams. Test it with:
cd challenge
mvn clean compile && mvn exec:java -Dexec.mainClass="io.hifly.onebrcstreaming.SampleApp"