streaming-data-pipeline

Streaming pipeline repo for data engineering training program

See producers and consumers set up README in their respective directories

#local environment setup

###Prerequisites:

Make sure you have sbt installed.
Make sure you have docker installed and running.
Make sure you don't have a previous instance of Zookeeper, Kafka or Spark running before executing the script (it won't be able to allocate the port)

###Steps

Run ./sbin/buildAndRunLocal.sh. This creates various Docker containers (each with an independent purpose) for running and testing this setup on your local machine.
If everything us up and running, you should be able to see data in hadoop. To check for data:
1. docker ps | grep hadoop - you should see at least one container referencing hadoop (we can ignore hadoop_seed for now)
2. docker exec -it $CONTAINER_ID bash
3. /usr/local/hadoop/bin/hadoop fs -ls /free2wheelers/stationMart/data
4. Tada! We have data! (if you don't -- something went wrong, check "Considerations")

###Considerations

Your docker machine may need at least CPUs: 2/Memory: 4GiB/Swap: 512 MiB; remember to "Apply & Restart"
When running the script run docker stats for some insights
There's a script for stopping: ./sbin/stopAndRemoveLocal.sh, try stopping and restarting
If you're interested in execution logs: docker logs $CONTAINER_ID

Name		Name	Last commit message	Last commit date
Latest commit History 466 Commits
.circleci		.circleci
CitibikeApiProducer		CitibikeApiProducer
FileChecker		FileChecker
RawDataSaver		RawDataSaver
StationConsumer		StationConsumer
airflow/dags		airflow/dags
docker		docker
hdfs		hdfs
hive_queries		hive_queries
kafka		kafka
sbin		sbin
visualization		visualization
zookeeper		zookeeper
.gitignore		.gitignore
.talismanrc		.talismanrc
README.md		README.md
bastion-ssh-config.sh		bastion-ssh-config.sh
e2e.sh		e2e.sh
utils.sh		utils.sh

Provide feedback