Skip to content

📚 Learning and exploring core Apache Hadoop and its surrounding ecosystem

Notifications You must be signed in to change notification settings

dgroomes/hadoop-playground

Repository files navigation

hadoop-playground

📚 Learning and exploring core Apache Hadoop and its surrounding ecosystem.

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

-- https://hadoop.apache.org

Pre-requisites

  • Docker
  • Java 11

Instructions

This project makes use of Docker. Hadoop is a framework for distributed computing. As such, it's necessary to simulate a distributed computing environment in order to effectively learn and explore Hadoop in its natural form. So, Docker to the rescue. We can use Docker containers to virtualize computers and create a distributed environment all on our own personal computer. There is an open source project for running Hadoop in Docker: docker-hadoop. This project uses docker-hadoop via a Git sub-module.


WARNING: I've experienced significant slowness with the cluster. The terminal will block for a while (10+ seconds) for a simple operation like -ls or -put to complete. And the "WordCount" job takes around 2 minutes to execute! It might be a file system issue because of the combo of an M1 mac with Docker... I'm not sure. I've added the time command in most cases to illustrate how long the commands take.

Also, the instances themselves are flaky and it might be due to more than data corruption (see the later note about volumes). I think there are multiple race conditions and environmental problems. I've had each of "namenode", "datanode" and "historyserver" fail to become healthy. This is a frustrating experience!


  1. Initialize the docker-hadoop Git sub-module
    • git submodule update --init
  2. Destroy old volumes
    • I've found that subsequent attempts to start the Hadoop-in-Docker cluster will not work after the first. A few containers will never come up as "healthy" and I don't know why. It's hard to read the logs and there are no obvious errors when you tail the logs. I suspect it is a data corruption problem because when I blow away the data from the previous containers, the new containers will come up healthy. Specifically, the way to delete the old data is to delete the old Docker volumes. I wish I understood Hadoop better, so I could solve this without deleting data. But this works.
    • docker-compose --project-directory docker-hadoop down --volumes
  3. Start the "namenode" Hadoop Docker container
    • docker-compose --project-directory docker-hadoop up --detach namenode
    • Continually run docker container ls until the container "STATUS" shows "healthy"
  4. Perform the inaugural HDFS format operation
    • Why? From the Hadoop Cluster Setup docs:

      The first time you bring up HDFS, it must be formatted.

    • docker exec -it namenode bash -c 'hdfs namenode -format'
    • It will prompt for a "Y/N". Answer the prompt.
    • Confirm that you see this message in the last 10 lines of output:

      Storage directory /hadoop/dfs/name has been successfully formatted

  5. Start the "datanode" container:
    • Why? To handle race conditions.
    • docker-compose --project-directory docker-hadoop up --detach datanode
    • Continually run docker container ls until the container "STATUS" shows "healthy"
  6. Start the rest of the Docker containers:
    • docker-compose --project-directory docker-hadoop -f docker-hadoop/docker-compose.yml -f docker-compose.override.yml up --detach resourcemanager nodemanager1 historyserver
    • Continually run docker container ls until all containers show "healthy". It will take over one minute.
  7. Set up some test data that will later be consumed by a MapReduce job:
    • This is taken from the official Hadoop WordCount example
    • docker cp word-count-map-reduce-job/data/input/file01.txt namenode:/
      docker cp word-count-map-reduce-job/data/input/file02.txt namenode:/
      
      time docker exec namenode bash -c 'hadoop fs -mkdir /input'
      time docker exec namenode bash -c 'hadoop fs -put file01.txt /input'
      time docker exec namenode bash -c 'hadoop fs -put file02.txt /input'
      
      docker exec namenode bash -c 'rm file01.txt'
      docker exec namenode bash -c 'rm file02.txt'
      
  8. Build a "WordCount" MapReduce job:
    • ./gradlew word-count-map-reduce-job:jar
    • Note: A MapReduce "job" can take on many forms. For this project, it takes the form of a Java .jar file. The other forms are out-of-scope for this playground repo. For example, a MapReduce job can be written in Python.
  9. Copy the jar to the Hadoop cluster and submit the job for execution:
    • docker cp word-count-map-reduce-job/build/libs/word-count-map-reduce-job.jar namenode:/
      time docker exec -it namenode bash -c 'hadoop jar word-count-map-reduce-job.jar dgroomes.WordCount /input /output'
      
    • Wait patiently for it to complete
  10. Verify the output results:
    • time docker exec -it namenode bash -c 'hadoop fs -cat /output/part-r-00000'
    • It should look something like this:
      Bye        1
      Goodbye    1
      Hadoop     2
      Hello      2
      World      2
      

Notes

  • Jump into a Bash shell session in one of the Hadoop Docker containers explore. Use an alias, too!
    • alias doBash="docker exec -it resourcemanager bash"
      doBash
      

Wish List

General clean-ups, changes and things I wish to implement for this project:

  • DONE Expose the ResourceManager port to the host (port 8088)
    • I'm not exactly sure how to do this because I don't want to make source file changes to docker-hadoop. Layer in an override docker-compose.yml somehow?
  • Reduce the memory limits. I think the containers are hogging memory but there isn't much given to Docker (well I mean 8GB is kind of a lot...) so maybe it's paging to disk and being super slow. I would really like to make this thing faster and more reliable. Memory is my best idea now, or trying on a non-M1 computer with stable Docker.
  • Sprinkle in Yarn somehow
    • Or is Yarn just implicitly there anyway if I run a Map Reduce job?
  • Sprinkle in Oozie for scheduling
  • Figure out secrets management.
    • I.e. figure out passwords files, file permissions and the "CredentialProvider" thing.
  • Add a Spark example job

Reference Materials

About

📚 Learning and exploring core Apache Hadoop and its surrounding ecosystem

Topics

Resources

Stars

Watchers

Forks