Hadoop Tutorial pdf | Hadoop Cluster Configuration 2

Hadoop Cluster Configuration 2

Download and configure Hadoop

You can download Hadoop in two different ways
          1. By using your web browser.
          2. By using Terminal(command prompt)

I will choose second option i.e. Terminal

1. Go to http://hadoop.apache.org/releases.html
          Click on ‘Download’ link
          Click on ‘Download a release now!’ link
          Click on ‘http://download.nextag.com/apache/hadoop/common’
          Click on ‘Stable’
          download hadoop-1.1.2-bin.tar.gz
OR
 You can copy the location of the file to install through terminal
          (Right click - copy link location, path will be like http://download.nextag.com/apache/hadoop/common/stable/hadoop-1.1.2-bin.tar.gz)

2. Open Master machine and open Termianl using command CTRL+ALT+T
Type below command
          $ wget http://download.nextag.com/apache/hadoop/common/stable/hadoop-1.1.2-bin.tar.gz
          This command will download hadoop files, it takes some time to download.
It will download in "Download" folder.

3. After downloading hadoop files you can extract in two way
one by using TERMINAL and another you can unzip using some software
          I will try to extract using TERMINAL
          $ tar xzf hadoop-1.1.2-bin.tar.gz or    $ tar xzfv hadoop-1.1.2-bin.tar.gz
          It will extract files in HOME folder hadoop-1.1.2

4. Now go to hadoop-1.2.0/config/
          Three files you generally need to change.
 
          a.hadoop-env.sh
                   First you need to change java class path(JVM path setting)
                   eg: export JAVA_HOME=/usr/lib/jvm/java-6-sun

          b.core-site.xml
                   There are three options how you want to run hadoop
1.      StandAlone or Local Mode: you need not to change anything - you just start working.
2.      Psedue Distributed Mode: NN, SNN, JT, TT and DN - all run on same machine
3.      Fully Distributed or Cluster mode: NameNode run on Master machine, Secondary NameNode run on some other machine and DN TT run on some other machine.
 c.mapred-site.xml


How to configure Hadoop on single system.
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

5. Change owner of hadoop folder
          $ chown -R yash hadoop-1.1.2 -> the owner of this folder has been changed
          $ chmod -R 755 hadoop-1.1.2 ->

6. Now open core-site.xml and copy below code in configuration tag
  <property>
          <name>hadoop.tmp.dir</name>
          <value>/home/yash/tempdir</value>
          <description>A base for other temporary directories.</description>
  </property>

  <property>
          <name>fs.default.name</name>
          <value>hdfs://yeshwanth1:9000</value>
          <description>The name of the default file system.  A URI whose
          scheme and authority determine the FileSystem implementation.  The
          uri's scheme determines the config property (fs.SCHEME.impl) naming
          the FileSystem implementation class.  The uri's authority is used to
          determine the host, port, etc. for a filesystem.</description>
  </property>

  **To run on local machine fs.dafult.name should point localhost
   eg:
   <property>  
          <name>fs.default.name</name>
          <value>hdfs://localhost:9000</value>
   </property>

 6. Now open mapred-site.xml file
  <property>
          <name>mapred.job.tracker</name>
          <value>yeshwanth1:9001</value>
          <description>The host and port that the MapReduce job tracker runs
          at.  If "local", then jobs are run in-process as a single map
          and reduce task.
          </description>
  </property>

  **To run on local machine fs.dafult.name should point localhost
  <property>
          <name>mapred.job.tracker</name>
          <value>localhost:9001</value>
  </property>

7. Open hdfs-site.xml add below configuration tags (Replication factor should not be more than DataNode "dfs.replication")
  <!-- Set replication factory -->
  <property>
          <name>dfs.replication</name>
          <value>1</value>
          <description>Default block replication.
          The actual number of replications can be specified when the file is created.
          The default is used if replication is not specified in create time.
          </description>
  </property>

  <!-- Here NameNode Data will be sotred -->
  <property>
          <name>dfs.name.dir</name>
          <value>/home/yash/namenodeanddatanode</value>
  </property>

  <!-- Data will be stored here, If you do specify this by default it will create dataname folder inside /tmp direcotry to store data -->
  <property>
          <name>dfs.data.dir</name>
          <value>/home/yash/namenodeanddatanode</value>
  </property>

8. Open masterfile
          Add text "yeshwanth1" -> because my master is running on yeshwanth1

9. Openslaves -> add below names
          yeshwanth1
          yeshwanth2
 I will keep Master machine as slave so that you can run DataNode on same machine

10. Now open TERMINAL
          You can copy hadoop folder to shared machine(i.e. master to slave)
          $ scp -r hadoop-1.1.2 yash@yeshwanth2:/home/yash
          Or
you can use any method to copy this folder(copy paste)
          After running above command you can able to see hadoop-1.1.2 in slave
          machine called yeshwanth2

          In master file - Location hadoop-1.1.2/config
          You can able to see below text
          yeshwanth1

          In slaves file you can see
          You can able to see below text
          yeshwanth1
          yeshwanth2

11. Now format your Hadoop using below commands
          $ cd hadoop-1.2.0/
          $ bin/hadoop namenode -format

12. Now start all the jobs using below command
          $ bin/start-all.sh

 $ jps (java processes)
          It should show below jobs to be running
          a. JobTracker
          b. NameNode
          c. SecondaryNameNode
          d. TaskTracker
          DataNode will not be working due to some reason

13. Now go to slave machine i.e. yeshwanth2
          Open TERMINAL and run below command
          $ jps
          You can see below jobs running in slave
          a. DataNode
          b. TaskTracker

14. Now go to yeshwanth1
          $ jps
          $ bin/start-all.sh (It start the process on Master as well as on Slave also automatically)
          $ jps
          Now you can see all the jobs are running
          NN,DN,TT,JT and SNN

** Suppose if you have three nodes, you make second or third node as SNN because if master goes down process can be handled by SNN which is on other machine.

15. Now go to yeshwanth2
          $ jps
          You can see below are running on this machine
          a. DataNode
          b. TaskTracker

No comments:

Post a Comment