Thursday, October 30, 2014

Hadoop Pseudo Distributed Mode Cluster

In this blog I have recorded detailed steps with supported screenshots to install and setup Hadoop cluster in a Pseudo Distributed Mode using your Windows 64 bit PC or laptop

This is a 3 step process
Step 1 – Install VM Player
Step 2 – Setup Lubuntu Virtual Machine
Step 3 – Install Hadoop


Step 1 – Install VM Player

1. Do google search for VMware player and Download VMware player. You can also download it from website
http://vmware-player.joydownload.com


Start the installation process by double clicking on the downloaded exe file



2. Continue clicking on Next button to complete the installation

3. Click on the VM Player desktop icon to open VM Player tool




STEP 2 – SETUP LUBUNTU VIRTUAL MACHINE

1. Download Lubuntu 12.04 image (lubuntu-12.04-alternate-i386.iso) file from http://cdimage.ubuntu.com/lubuntu/releases/12.04/release/




2. In VM Player click on “Create New Virtual Machine”

3. New Virtual Machine Wizard would Pop-up

4. Select radio button “Installer disc image file (iso):” Browse and select file “lubuntu-12.04-alternate-i386.iso” and click on NEXT.



5. On the Next window make no changes and just click NEXT



6. On the Next window changes Virtual machine name to “HadoopPseudoMode” (or as per your choice) and click NEXT



7. On the Next window make no changes and just click NEXT



8. On the Next window make no changes and just click FINISH



9. Your Lubuntu VM is ready



10. Next we have on Install the OS in this VM

11. Click on Play Virtual Machine link

12. You will get a popup to install updates. Do not do it. Just click “Remind me later”

13. Mouse will not work inside the Virtual machine until the OS is completely installed. You have to use keyboard only. At any point to come out of the VM machine and access your windows machine use hotkey CTRL + ATL

14. On the Language screen select “English” and press ENTER button.




15. On the next screen just press ENTER button


16. On the Language screen select “English” and press ENTER button.


17. On the next screen select “India” as location and press ENTER button.


18. On the next “Configure the Keyboard” screen, select “Yes” and press ENTER button 

19. On the next “Configure the Keyboard” screen, press  “Y” button from your keyboard 


20. On the next “Configure the Keyboard” screen, press  “W” button from your keyboard 


21. On the next set of screens, select “NO” option and keep pressing “ENTER” button until you get the “Configure Keyboard” confirmation screen


22. Click on “Continue” and OS components would start loading


23. On the next screen, just select “Continue” and press ENTER button


24. On the next screen, enter username “adminuser” and select “Continue” and press ENTER button. Same screen would be popped-up to re-enter the username. Enter username “adminuser” and select “Continue” and press ENTER button


25. On the next screen, enter password “adminuser” and select “Continue” and press ENTER button. Same screen would be popped-up to re-enter the password. Enter password “adminuser” and select “Continue” and press ENTER button


26. On the next screen, select “NO” and press ENTER button



27. On the next screen, select “YES” and press ENTER button



28. On the next screen, press ENTER button



29. On the next screen, press ENTER button



30. On the next screen, press ENTER button



31. On the next screen, press ENTER button



32. On the next screen, select “YES” and press ENTER button



33. On the next screen, select “Continue” press ENTER button



34. The installation will continue for almost 15 min. Wait for it to progress



35. On the next screen, select “YES” press ENTER button



36. On the next screen, select “YES” press ENTER button



37. On the next screen, select “continue” and press ENTER button



38. This will complete the OS installation and the login screen would showup



39. Login into the adminuser account and your Lubuntu machine is up and ready

STEP 3 – INSTALL HADOOP


1. We are doing the Pseudo mode setup in which one system will be used to host namenode, secondary namenode and datanode. However each of them will run on a separate JVM (java virtual machine)


2. Login to the Lubuntu machine adminuser username and passoword


3. From the Menu Bar go to Accessories >> LXTerminal





4. Open LXTerminal to type commands





5. Let’s add a new group called "hadoop" using following command

sudo addgroup hadoop


6. You will be prompted for password. Give password of user with whom you are logged in. Enter password as adminuser (sudo is used if we want to run any command as super user(you can say admin of system).

7. Let’s add a new user called hduser in group hadoop
sudo adduser --ingroup hadoop hduser

8. You would be asked to enter a password for the new user hduser we are creating. Enter password as hduser (or as per your choice however do ensure to remember it)

9. You would be asked to enter a name and work details for the new user hduser. Leave then blank by just pressing enter button. A confirmation would popup for Y (yes) or N (no). Type Y and press enter. New user hduser is created.

10. Let’s give admin rights to hduser
sudo adduser hduser sudo

11. Logout using the logout option from the menubar of your VM and re-login with the hduser account

12. Open the LXTerminal to type commands

13. Let’s install Java 6 as Hadoop is developed in Java
sudo apt-get install openjdk-6-jdk

14. You will be prompted for a password. Enter password as hduser and then a confirmation would be prompted. Type Y and press enter

15. Java version 6 will be downloaded and installed. This will take around 5 minutes or more based on your internet download speed

16. Next we need to install ssh server. ssh stands for secure shell. To login remotely from one linux machine to other linux machine ssh is used and it gives access of shell of the remote machine. 
sudo apt-get install openssh-server

17. You might be prompted for a password. Enter password as hduser and then a confirmation would be prompted. Type Y and press enter

18. SSH will be downloaded and installed. This will take few minutes

19. We can login to remote machine using following command
ssh <ip-address>

20. If you try ssh localhost, you will be prompted for password. We want to make this login password-less. One way of doing it is to use keys. We can generate keys using following command.
ssh-keygen -t rsa -P ""



21. It will prompt you to give path to store keys, don’t type anything, just press enter.

22. This command will generate two keys at "/home/hduser/.ssh/" path. id_rsa and id_rsa.pub.
a> id_rsa is private key.
b > id_rsa.pub is public key

23. To login into remote machine I will share my public key with that machine. In our case it is local machine, so following command is used.
ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@localhost



24. This will prompted for confirmation. Type yes and enter. And then for password. Give password for hduser.

25. Localhost added to the list of know hosts confirmation message would show on screen



26. Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
ssh localhost

27. Next step is to install Hadoop 1.3.0. You can download file hadoop-1.0.3.tar.gz from below listed paths in your windows system
    https://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz

28. For us to move the hadoop-1.0.3.tar.gz file from windows to Lubuntu system we can use winscp or filezilla tool. We will download winscp555setup.exe from below listed path and install on windows system
    http://winscp.net/download/winscp556setup.exe



29. Continue clicking next to finish the installation



30. Host name would be the ipaddress of the Lubuntu machine. To find that in your lubuntu machine enter command
Ifconfig

31. IP Address of my lubuntu machine is 192.168.xx.xxx



32. Enter the IP Address of your lubuntu machine in Winscp and click on login.



33. You would be prompted for username and password. Enter both as hduser

34. You will be connected to your lubuntu system on your right windows and windows system on the left. You can drag and drop and move files between system

35. Move the file hadoop-1.0.3.tar.gz file from windows to Lubuntu system under path /home/hduser/downloads




36. In the Lubuntu VM, go to folder path /home/hduser/downloads. Right click on hadoop-1.0.3.tar.gz file and click on extract



37. Rename the extracted folder from Hadoop-1.0.3 to Hadoop

38. Cut and move folder Hadoop to path /home/hduser



39. Now we need to make configurations in hadoop configuration file. You will find these files in "/home/hduser/hadoop/conf" folder.

40. There are 4 important files in this folder
hadoop-env.sh
hdfs-site.xml
mapred-site.xml
core-site.xml

41. hadoop-env.sh is a file which contains Hadoop environment related properties. Here we can set properties like where is java home, what is heap memory size, what is class path of hadoop, which version of IP to use etc. we will set java home in this file. For me java home is "/usr/lib/jvm/java-6-openjdk-i386". Open the file using leafpad



42. Search for # export JAVA_HOME and replace the entire line with the below line in file and save.

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386



43. hdfs-site.xml is file which contains properties related to hdfs(hadoop distributed file system.). We need to set here the replication factor here. By default replication factor is 3. Since we are installing hadoop in single machine. We will set it to 1. Copy following in-between the configuration tag in file.


<property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>Default block replication.
 The actual number of replications can be specified when the file is created.
 The default is used if replication is not specified in create time.
 </description>
</property>



44. mapred-site.xml is a file that contains properties related to map reduce. we will set here ip address and port of machine on which job tracker is running. copy following in between configuration tag


<property>
 <name>mapred.job.tracker</name>
 <value>localhost:54311</value>
 <description>The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 </description>
</property>



45. core-site.xml is property file which contains property which are common or used by both map reduce and hdfs. here we will set ip address and port number of machine on which namenode will be running. Other property tells where should hadoop store files like fsimage and blocks etc. Copy following in between configuration tag.

<property>
 <name>hadoop.tmp.dir</name>
 <value>/home/hduser/hadoop_tmp_files</value>
 <description>A base for other temporary directories.</description>
</property>

<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description>The name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem.</description>
</property>



46. Open terminal and format namenode with the following command. Namenode should be formatted only once, before you start using your hadoop cluster. If you format namnode later, you will lose all the data stored on hdfs. Notice that "/home/hduser/hadoop/bin/" folder contains all the important scripts to start hadoop, stop hadoop, access hdfs, format hdfs etc.

/home/hduser/hadoop/bin/hadoop namenode -format



47. Start hadoop using following command

/home/hduser/hadoop/bin/start-all.sh

48. Check if Hadoop is functioning using command
jps

49. It should show all java processes running. If not something went wrong during installation

namenode
secondary namenode
datanode
jobtracker
tasktracker



50. Hadoop has a webui. Open Chrome browser in your lubuntu machine and browse for Localhost:50070 This will show the complete summary of namenode and other details like Number of live nodes. Since we did a Pseudo Mode setup there is only one live node (datanode)



Hope this helped. Do reach out to me if you have any questions
If you found this blog useful, please convey your thanks by posting your comments

I have posted another blog on how to setup Hadoop Fully Distributed Mode (Multi Node) cluster. Link to view it http://hadoopfullydistributedmode.blogspot.in/

18 comments:

  1. This makes very easy to install Hadoop pseudo distributed mode It is very helpful.Thanks a lot.

    ReplyDelete
  2. Please let me know if I can install Hadoop on Core2Duo processor with 3-4 GB of RAM.

    ReplyDelete
    Replies
    1. Yes.Treat it as a test system just for learning hadoop

      Delete
  3. Can I use the lubunto in this configuration to install R, Rstudio and PSQL?

    ReplyDelete
    Replies
    1. Yes you can.. But you might face performance issue as its an virtual system.. So allocate more RAM to your VM

      Delete
  4. Thanks, its really helpful for Hadoop starters

    ReplyDelete
  5. Thanks a lot. You really saved me.
    i tried sandbox clodera ubuntu and was struggling from 30 hours.
    when i lost hpe your blog saved me. Thanks a lot.
    if possible plz tell how to run simple word count program in hadoop using same setup

    ReplyDelete
  6. I am getting problem in step 26 (2nd half).
    26th step is "Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
    ssh localhost"

    Upto step 25, it is running as per the steps. After 26 it is asking for the ' passphrase for /home/hduser/.ssh/id_rsa'
    Plz help me to solve this problem and download Hadoop.


    ReplyDelete
    Replies
    1. Seems to be that the SSH is not properly shared. Can you share the screen shot of the issue

      Delete
  7. Great thanks for this useful article. Please I would like to know can I apply the same steps for hadoop 2.7 and java 1.6. I only have physical 4 GB ram. How much memory should I allocate
    the lubuntu VM?

    ReplyDelete
  8. { Step 46 }
    Permission denied in namenode -format command. Please reply as soon as possible.

    ReplyDelete
  9. Thanks for sharing this Information. This content is so informative and helpful for many people.
    Hadoop Training in Noida

    ReplyDelete
  10. Thanks for sharing the valuable information to share with us. For more information please visit our website. Book Online for Hadoop Training In Ameerpet@ Best Institute

    ReplyDelete
  11. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop Admin online training

    ReplyDelete