Thursday, August 14, 2014

Hadoop Installation

The steps are tested on Linux Ubuntu 10 using user 'yarn'. You can use any other user for the same.
           tar -xvzf hadoop-2.3.0.tar.gz
  •  This will create a direcotry "/home/yarn/hadoop/hadoop-2.3.0"
  • Set the environment variable for Java and Hadoop and also put Hadoop in the path. For bashrc shell it lools like:
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/
export HADOOP_HOME=/home/yarn/hadoop/hadoop-2.3.0
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=/home/yarn/hadoop/hadoop-2.3.0
export HADOOP_HDFS_HOME=/home/yarn/hadoop/hadoop-2.3.0
export HADOOP_CONF_DIR=/home/yarn/hadoop/hadoop-2.3.0/etc/hadoop
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export PATH=$PATH:$HADOOP_HOME/bin
  • Check if hadoop is installed properly by using hadoop command
Hadoop 2.3.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1567123
Compiled by jenkins on 2014-02-11T13:40Z
Compiled with protoc 2.5.0
From source with checksum dfe46336fbc6a044bc124392ec06b85
This command was run using /home/yarn/hadoop/hadoop-2.3.0/share/hadoop/common/hadoop-common-2.3.0.jar
  • Hadoop can be configured using XML configuration files which are at etc/hadoop directory inside hadoop folder. The improtant onces to be configured are for single node cluster
​mapred-site.xml

​Go to $HADOOP_HOME/etc/hadoop
cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml

Between configuration put the following properties (Make local and temp directory at the given location)

​ <property>
    <name>mapreduce.cluster.temp.dir</name>
    <value>file:/home/yarn/hadoop/temp</value>
    <description>No description</description>
    <final>true</final>
  </property>
  <property>
    <name>mapreduce.cluster.local.dir</name>
    <value>file:/home/yarn/hadoop/local</value>
    <description>No description</description>
    <final>true</final>
  </property>
yarn-site.xml
<!-- Site specific YARN configuration properties -->
<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>localhost:8000</value>
    <description>host is the hostname of the resource manager and 
    port is the port on which the NodeManagers contact the Resource Manager.
    </description>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>localhost:8001</value>
    <description>host is the hostname of the resourcemanager and port is the port
    on which the Applications in the cluster talk to the Resource Manager.
    </description>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    <description>In case you do not want to use the default scheduler</description>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>localhost:8002</value>
    <description>the host is the hostname of the ResourceManager and the port is the port on
    which the clients can talk to the Resource Manager. </description>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>file:/home/yarn/hadoop/nodemanager</value>
    <description>the local directories used by the nodemanager</description>
  </property>
  <property>
    <name>yarn.nodemanager.address</name>
    <value>localhost:8003</value>
    <description>the nodemanagers bind to this port</description>
  </property>  
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>10240</value>
    <description>the amount of memory on the NodeManager in GB</description>
  </property>
  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>file:/home/yarn/hadoop/app-logs</value>
    <description>directory on hdfs where the application logs are moved to </description>
  </property>
   <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>file:/home/yarn/hadoop/app-logs</value>
    <description>the directories used by Nodemanagers as log directories</description>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    <description>shuffle service that needs to be set for Map Reduce to run </description>
  </property>
capacity-schedular.xml : Make the follwoign changes
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>unfunded,default</value>
  </property>
  
  <property>
    <name>yarn.scheduler.capacity.root.capacity</name>
    <value>100</value>
  </property>
  
  <property>
    <name>yarn.scheduler.capacity.root.unfunded.capacity</name>
    <value>50</value>
  </property>
  
  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>50</value>
  </property>
  • Start resourcemanger and nodemanger
cd /home/yarn/hadoop/hadoop-2.3.0/sbin

./yarn-daemon.sh start resourcemanager

./yarn-daemon.sh start nodemanager

  • Run the example. Go to hadoop installation directory and run
             bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar randomwriter out

No comments:

Post a Comment