hadoop-playground

📚 Learning and exploring core Apache Hadoop and its surrounding ecosystem.

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

-- https://hadoop.apache.org

Pre-requisites

Docker
Java 11

Instructions

This project makes use of Docker. Hadoop is a framework for distributed computing. As such, it's necessary to simulate a distributed computing environment in order to effectively learn and explore Hadoop in its natural form. So, Docker to the rescue. We can use Docker containers to virtualize computers and create a distributed environment all on our own personal computer. There is an open source project for running Hadoop in Docker: docker-hadoop. This project uses docker-hadoop via a Git sub-module.

WARNING: I've experienced significant slowness with the cluster. The terminal will block for a while (10+ seconds) for a simple operation like -ls or -put to complete. And the "WordCount" job takes around 2 minutes to execute! It might be a file system issue because of the combo of an M1 mac with Docker... I'm not sure. I've added the time command in most cases to illustrate how long the commands take.

Also, the instances themselves are flaky and it might be due to more than data corruption (see the later note about volumes). I think there are multiple race conditions and environmental problems. I've had each of "namenode", "datanode" and "historyserver" fail to become healthy. This is a frustrating experience!

Initialize the docker-hadoop Git sub-module
- git submodule update --init
Destroy old volumes
- I've found that subsequent attempts to start the Hadoop-in-Docker cluster will not work after the first. A few containers will never come up as "healthy" and I don't know why. It's hard to read the logs and there are no obvious errors when you tail the logs. I suspect it is a data corruption problem because when I blow away the data from the previous containers, the new containers will come up healthy. Specifically, the way to delete the old data is to delete the old Docker volumes. I wish I understood Hadoop better, so I could solve this without deleting data. But this works.
- docker-compose --project-directory docker-hadoop down --volumes
Start the "namenode" Hadoop Docker container
- docker-compose --project-directory docker-hadoop up --detach namenode
- Continually run docker container ls until the container "STATUS" shows "healthy"
Perform the inaugural HDFS format operation
- Why? From the Hadoop Cluster Setup docs:
  
  The first time you bring up HDFS, it must be formatted.
- docker exec -it namenode bash -c 'hdfs namenode -format'
- It will prompt for a "Y/N". Answer the prompt.
- Confirm that you see this message in the last 10 lines of output:
  
  Storage directory /hadoop/dfs/name has been successfully formatted
Start the "datanode" container:
- Why? To handle race conditions.
- docker-compose --project-directory docker-hadoop up --detach datanode
- Continually run docker container ls until the container "STATUS" shows "healthy"
Start the rest of the Docker containers:
- docker-compose --project-directory docker-hadoop -f docker-hadoop/docker-compose.yml -f docker-compose.override.yml up --detach resourcemanager nodemanager1 historyserver
- Continually run docker container ls until all containers show "healthy". It will take over one minute.

Set up some test data that will later be consumed by a MapReduce job:

This is taken from the official Hadoop WordCount example

docker cp word-count-map-reduce-job/data/input/file01.txt namenode:/
docker cp word-count-map-reduce-job/data/input/file02.txt namenode:/

time docker exec namenode bash -c 'hadoop fs -mkdir /input'
time docker exec namenode bash -c 'hadoop fs -put file01.txt /input'
time docker exec namenode bash -c 'hadoop fs -put file02.txt /input'

docker exec namenode bash -c 'rm file01.txt'
docker exec namenode bash -c 'rm file02.txt'

Build a "WordCount" MapReduce job:
- ./gradlew word-count-map-reduce-job:jar
- Note: A MapReduce "job" can take on many forms. For this project, it takes the form of a Java .jar file. The other forms are out-of-scope for this playground repo. For example, a MapReduce job can be written in Python.

Copy the jar to the Hadoop cluster and submit the job for execution:

docker cp word-count-map-reduce-job/build/libs/word-count-map-reduce-job.jar namenode:/
time docker exec -it namenode bash -c 'hadoop jar word-count-map-reduce-job.jar dgroomes.WordCount /input /output'

Wait patiently for it to complete

Verify the output results:
- time docker exec -it namenode bash -c 'hadoop fs -cat /output/part-r-00000'
- It should look something like this:
```
Bye        1
Goodbye    1
Hadoop     2
Hello      2
World      2
```

Notes

Jump into a Bash shell session in one of the Hadoop Docker containers explore. Use an alias, too!
- ```
alias doBash="docker exec -it resourcemanager bash"
doBash
```

Wish List

General clean-ups, changes and things I wish to implement for this project:

DONE Expose the ResourceManager port to the host (port 8088)
- I'm not exactly sure how to do this because I don't want to make source file changes to docker-hadoop. Layer in an override docker-compose.yml somehow?
Reduce the memory limits. I think the containers are hogging memory but there isn't much given to Docker (well I mean 8GB is kind of a lot...) so maybe it's paging to disk and being super slow. I would really like to make this thing faster and more reliable. Memory is my best idea now, or trying on a non-M1 computer with stable Docker.
Sprinkle in Yarn somehow
- Or is Yarn just implicitly there anyway if I run a Map Reduce job?
Sprinkle in Oozie for scheduling
Figure out secrets management.
- I.e. figure out passwords files, file permissions and the "CredentialProvider" thing.
Add a Spark example job

Reference Materials

Hadoop official site: Hadoop: Setting up a Single Node Cluster
Hadoop official site: Download
Hadoop official site: Hadoop Cluster Setup
Hadoop official site: Apache Hadoop Downstream Developer’s Guide
- A starting point for developing a program that runs in Hadoop
Hadoop official site: API docs for hadoop fs
Hadoop-in-Docker project big-data-europe/docker-hadoop

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docker-hadoop @ 8414e2b		docker-hadoop @ 8414e2b
gradle/wrapper		gradle/wrapper
word-count-map-reduce-job		word-count-map-reduce-job
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build.gradle.kts		build.gradle.kts
docker-compose.override.yml		docker-compose.override.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hadoop-playground

Pre-requisites

Instructions

Notes

Wish List

Reference Materials

About

Languages

dgroomes/hadoop-playground

Folders and files

Latest commit

History

Repository files navigation

hadoop-playground

Pre-requisites

Instructions

Notes

Wish List

Reference Materials

About

Topics

Resources

Stars

Watchers

Forks

Languages