Knowledge-Repository/Technology/Storage/Large/ApacheHadoop at master · MislavJaksic/Knowledge-Repository

Name	Name	Last commit message	Last commit date
parent directory ..
ApacheAmbari	ApacheAmbari
ClouderaHortonworks	ClouderaHortonworks
HDFS	HDFS
YARN	YARN
README.md	README.md
UnofficialInstallation.md	UnofficialInstallation.md

Hadoop

Apache Hadoop is a distributed computing framework.
It stores data, executes operations and gathers results from multiple remote computers.

Hadoop components	Description
Hadoop Common	Utility module that supports other modules
Hadoop Distributed File System (HDFS)	File system for distributing data in a robust way
Hadoop Yet Another Resource Negotiator (YARN)	Cluster manager and job scheduler
Hadoop MapReduce	Module for parallel processing

Single cluster installation

$: sudo apt-get install ssh
$: sudo apt-get install rsync

# Note: download Hadoop
$: tar -xzvf hadoop-x.y.z.tar.gz

# Note: edit /path/to/your/hadoop/etc/hadoop/hadoop-env.sh and add
  # export JAVA_HOME=/path/to/your/java/jre_x.y.z

$: bin/hadoop

Hadoop can run in one of three modes:

standalone (default, as a single Java process)
pseudo-distributed
fully-distributed

Execute a MapReduce job:

$: mkdir input
$: cp etc/hadoop/*.xml input
$: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.y.z.jar grep input output 'dfs[a-z.]+'
$: cat output/*

To run Hadoop in pseudo-distributed mode:

# Note: add to etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

# Note: add to etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

$: shh localhost
  # $: exit

# Note: if you cannot connect do
$: ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$: chmod 0600 ~/.ssh/authorized_keys

$: bin/hdfs namenode -format  # format the file system
$: sbin/start-dfs.sh  # start NameNode and DataNode daemons
# Note: visit the NameNode at http://localhost:50070/

Note: make HDFS directories
$: bin/hdfs dfs -mkdir /user
$: bin/hdfs dfs -mkdir /user/<username>

Note: execute a MapReduce job
$: bin/hdfs dfs -put etc/hadoop input
$: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.y.z.jar grep input output 'dfs[a-z.]+'

# Note: view the output
$: bin/hdfs dfs -get output output
$: cat output/*

$: sbin/stop-dfs.sh  # stop the deamons

Setup YARN in pseudo-distributed mode:

# Note: add to etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

# Note: add to etc/hadoop/yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

$: sbin/start-dfs.sh  # start NameNode and DataNode daemons
$: sbin/start-yarn.sh  # start ResourceManager and NodeManager daemons
# Note: visit the ResourceManager at http://localhost:8088/

# Note: execute a MapReduce job
$: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.y.z.jar grep input output 'dfs[a-z.]+'

# Note: view the output
$: bin/hdfs dfs -get output output
$: cat output/*

$: sbin/stop-yarn.sh  # stop the YARN deamons
$: sbin/stop-dfs.sh  # stop the DFS deamons

Cluster setup

The following site configurations are of great import:

etc/hadoop/core-site.xml
etc/hadoop/hdfs-site.xml
etc/hadoop/yarn-site.xml
etc/hadoop/mapred-site.xml

TODO: it continues by talking in greater detail about the configuration.

Running Hadoop

$: sbin/start-dfs.sh  # start NameNode and DataNode daemons
$: sbin/start-yarn.sh  # start ResourceManager and NodeManager daemons

$: bin/hadoop fs -ls /

# Note: visit the NameNode at http://localhost:50070/
# Note: visit the ResourceManager at http://localhost:8088/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ApacheHadoop

ApacheHadoop

README.md

Hadoop

Single cluster installation

Cluster setup

Running Hadoop

Files

ApacheHadoop

Directory actions

More options

Directory actions

More options

Latest commit

History

ApacheHadoop

Folders and files

parent directory

README.md

Hadoop

Single cluster installation

Cluster setup

Running Hadoop