Hadoop Installation

The steps are tested on Linux Ubuntu 10 using user 'yarn'. You can use any other user for the same.

Download Hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/common/. This document is done with hadoop-2.3.0/ release.
Unzip Hadoop in a directory. In this case I have created hadoop in the home directory and unzipped there.

tar -xvzf hadoop-2.3.0.tar.gz

This will create a direcotry "/home/yarn/hadoop/hadoop-2.3.0"
Set the environment variable for Java and Hadoop and also put Hadoop in the path. For bashrc shell it lools like:

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
export HADOOP_HOME=/home/yarn/hadoop/hadoop-2.3.0
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=/home/yarn/hadoop/hadoop-2.3.0
export HADOOP_HDFS_HOME=/home/yarn/hadoop/hadoop-2.3.0
export HADOOP_CONF_DIR=/home/yarn/hadoop/hadoop-2.3.0/etc/hadoop
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export PATH=$PATH:$HADOOP_HOME/bin

Check if hadoop is installed properly by using hadoop command

Hadoop 2.3.0

Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1567123

Compiled by jenkins on 2014-02-11T13:40Z

Compiled with protoc 2.5.0

From source with checksum dfe46336fbc6a044bc124392ec06b85

This command was run using /home/yarn/hadoop/hadoop-2.3.0/share/hadoop/common/hadoop-common-2.3.0.jar

Hadoop can be configured using XML configuration files which are at etc/hadoop directory inside hadoop folder. The improtant onces to be configured are for single node cluster

mapred-site.xml

Go to $HADOOP_HOME/etc/hadoop

cp mapred-site.xml.template mapred-site.xml

Open mapred-site.xml

Between configuration put the following properties (Make local and temp directory at the given location)

<name>mapreduce.cluster.temp.dir</name>

<value>file:/home/yarn/hadoop/temp</value>

<description>No description</description>

</property>

<name>mapreduce.cluster.local.dir</name>

<value>file:/home/yarn/hadoop/local</value>

<description>No description</description>

</property>

yarn-site.xml

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>localhost:8000</value>

<description>host is the hostname of the resource manager and

port is the port on which the NodeManagers contact the Resource Manager.

</description>

</property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>localhost:8001</value>

<description>host is the hostname of the resourcemanager and port is the port

on which the Applications in the cluster talk to the Resource Manager.

</description>

</property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

<description>In case you do not want to use the default scheduler</description>

</property>

<name>yarn.resourcemanager.address</name>

<value>localhost:8002</value>

<description>the host is the hostname of the ResourceManager and the port is the port on

which the clients can talk to the Resource Manager. </description>

</property>

<name>yarn.nodemanager.local-dirs</name>

<value>file:/home/yarn/hadoop/nodemanager</value>

<description>the local directories used by the nodemanager</description>

</property>

<name>yarn.nodemanager.address</name>

<value>localhost:8003</value>

<description>the nodemanagers bind to this port</description>

</property>

<name>yarn.nodemanager.resource.memory-mb</name>

<description>the amount of memory on the NodeManager in GB</description>

</property>

<name>yarn.nodemanager.remote-app-log-dir</name>

<value>file:/home/yarn/hadoop/app-logs</value>

<description>directory on hdfs where the application logs are moved to </description>

</property>

<name>yarn.nodemanager.log-dirs</name>

<value>file:/home/yarn/hadoop/app-logs</value>

<description>the directories used by Nodemanagers as log directories</description>

</property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

<description>shuffle service that needs to be set for Map Reduce to run </description>

</property>

capacity-schedular.xml : Make the follwoign changes

<name>yarn.scheduler.capacity.root.queues</name>

<value>unfunded,default</value>

</property>

<name>yarn.scheduler.capacity.root.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.unfunded.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.default.capacity</name>

</property>

Start resourcemanger and nodemanger

cd /home/yarn/hadoop/hadoop-2.3.0/sbin

./yarn-daemon.sh start resourcemanager

./yarn-daemon.sh start nodemanager

Run the example. Go to hadoop installation directory and run

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar randomwriter out

Technology

Thursday, August 14, 2014

Hadoop Installation

No comments:

Post a Comment