In this blog I have recorded detailed steps with supported screenshots to install and setup Hadoop cluster in a Pseudo Distributed Mode using your Windows 64 bit PC or laptop
This is a 3 step process
Step 1 – Install VM Player
Step 2 – Setup Lubuntu Virtual Machine
Step 3 – Install Hadoop
Step 1 – Install VM Player
1. Do google search for VMware player and Download VMware player. You can also download it from websitehttp://vmware-player.joydownload.com
2. Continue clicking on Next button to complete the installation
3. Click on the VM Player desktop icon to open VM Player tool
STEP 2 – SETUP LUBUNTU VIRTUAL MACHINE
1. Download Lubuntu 12.04 image (lubuntu-12.04-alternate-i386.iso) file from http://cdimage.ubuntu.com/lubuntu/releases/12.04/release/2. In VM Player click on “Create New Virtual Machine”
3. New Virtual Machine Wizard would Pop-up
4. Select radio button “Installer disc image file (iso):” Browse and select file “lubuntu-12.04-alternate-i386.iso” and click on NEXT.
5. On the Next window make no changes and just click NEXT
6. On the Next window changes Virtual machine name to “HadoopPseudoMode” (or as per your choice) and click NEXT
7. On the Next window make no changes and just click NEXT
8. On the Next window make no changes and just click FINISH
9. Your Lubuntu VM is ready
10. Next we have on Install the OS in this VM
11. Click on Play Virtual Machine link
12. You will get a popup to install updates. Do not do it. Just click “Remind me later”
13. Mouse will not work inside the Virtual machine until the OS is completely installed. You have to use keyboard only. At any point to come out of the VM machine and access your windows machine use hotkey CTRL + ATL
14. On the Language screen select “English” and press ENTER button.
15. On the next screen just press ENTER button
16. On the Language screen select “English” and press ENTER button.
17. On the next screen select “India” as location and press ENTER button.
18. On the next “Configure the Keyboard” screen, select “Yes” and press ENTER button
19. On the next “Configure the Keyboard” screen, press “Y” button from your keyboard
20. On the next “Configure the Keyboard” screen, press “W” button from your keyboard
21. On the next set of screens, select “NO” option and keep pressing “ENTER” button until you get the “Configure Keyboard” confirmation screen
22. Click on “Continue” and OS components would start loading
23. On the next screen, just select “Continue” and press ENTER button
24. On the next screen, enter username “adminuser” and select “Continue” and press ENTER button. Same screen would be popped-up to re-enter the username. Enter username “adminuser” and select “Continue” and press ENTER button
25. On the next screen, enter password “adminuser” and select “Continue” and press ENTER button. Same screen would be popped-up to re-enter the password. Enter password “adminuser” and select “Continue” and press ENTER button
26. On the next screen, select “NO” and press ENTER button
27. On the next screen, select “YES” and press ENTER button
28. On the next screen, press ENTER button
29. On the next screen, press ENTER button
30. On the next screen, press ENTER button
31. On the next screen, press ENTER button
32. On the next screen, select “YES” and press ENTER button
33. On the next screen, select “Continue” press ENTER button
34. The installation will continue for almost 15 min. Wait for it to progress
35. On the next screen, select “YES” press ENTER button
36. On the next screen, select “YES” press ENTER button
37. On the next screen, select “continue” and press ENTER button
38. This will complete the OS installation and the login screen would showup
39. Login into the adminuser account and your Lubuntu machine is up and ready
STEP 3 – INSTALL HADOOP
1. We are doing the Pseudo mode setup in which one system will be used to host namenode, secondary namenode and datanode. However each of them will run on a separate JVM (java virtual machine)
2. Login to the Lubuntu machine adminuser username and passoword
3. From the Menu Bar go to Accessories >> LXTerminal
4. Open LXTerminal to type commands
5. Let’s add a new group called "hadoop" using following command
sudo addgroup hadoop
6. You will be prompted for password. Give password of user with whom you are logged in. Enter password as adminuser (sudo is used if we want to run any command as super user(you can say admin of system).
7. Let’s add a new user called hduser in group hadoop
sudo adduser --ingroup hadoop hduser
8. You would be asked to enter a password for the new user hduser we are creating. Enter password as hduser (or as per your choice however do ensure to remember it)
9. You would be asked to enter a name and work details for the new user hduser. Leave then blank by just pressing enter button. A confirmation would popup for Y (yes) or N (no). Type Y and press enter. New user hduser is created.
10. Let’s give admin rights to hduser
sudo adduser hduser sudo
11. Logout using the logout option from the menubar of your VM and re-login with the hduser account
12. Open the LXTerminal to type commands
13. Let’s install Java 6 as Hadoop is developed in Java
sudo apt-get install openjdk-6-jdk
14. You will be prompted for a password. Enter password as hduser and then a confirmation would be prompted. Type Y and press enter
15. Java version 6 will be downloaded and installed. This will take around 5 minutes or more based on your internet download speed
16. Next we need to install ssh server. ssh stands for secure shell. To login remotely from one linux machine to other linux machine ssh is used and it gives access of shell of the remote machine.
sudo apt-get install openssh-server
17. You might be prompted for a password. Enter password as hduser and then a confirmation would be prompted. Type Y and press enter
18. SSH will be downloaded and installed. This will take few minutes
19. We can login to remote machine using following command
ssh <ip-address>
20. If you try ssh localhost, you will be prompted for password. We want to make this login password-less. One way of doing it is to use keys. We can generate keys using following command.
21. It will prompt you to give path to store keys, don’t type anything, just press enter.
22. This command will generate two keys at "/home/hduser/.ssh/" path. id_rsa and id_rsa.pub.
a> id_rsa is private key.
b > id_rsa.pub is public key
23. To login into remote machine I will share my public key with that machine. In our case it is local machine, so following command is used.
ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@localhost
24. This will prompted for confirmation. Type yes and enter. And then for password. Give password for hduser.
25. Localhost added to the list of know hosts confirmation message would show on screen
26. Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
ssh localhost
27. Next step is to install Hadoop 1.3.0. You can download file hadoop-1.0.3.tar.gz from below listed paths in your windows system
https://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz
28. For us to move the hadoop-1.0.3.tar.gz file from windows to Lubuntu system we can use winscp or filezilla tool. We will download winscp555setup.exe from below listed path and install on windows system
http://winscp.net/download/winscp556setup.exe
29. Continue clicking next to finish the installation
30. Host name would be the ipaddress of the Lubuntu machine. To find that in your lubuntu machine enter command
Ifconfig
31. IP Address of my lubuntu machine is 192.168.xx.xxx
32. Enter the IP Address of your lubuntu machine in Winscp and click on login.
33. You would be prompted for username and password. Enter both as hduser
34. You will be connected to your lubuntu system on your right windows and windows system on the left. You can drag and drop and move files between system
35. Move the file hadoop-1.0.3.tar.gz file from windows to Lubuntu system under path /home/hduser/downloads
36. In the Lubuntu VM, go to folder path /home/hduser/downloads. Right click on hadoop-1.0.3.tar.gz file and click on extract
37. Rename the extracted folder from Hadoop-1.0.3 to Hadoop
38. Cut and move folder Hadoop to path /home/hduser
39. Now we need to make configurations in hadoop configuration file. You will find these files in "/home/hduser/hadoop/conf" folder.
40. There are 4 important files in this folder
hadoop-env.sh
hdfs-site.xml
mapred-site.xml
core-site.xml
41. hadoop-env.sh is a file which contains Hadoop environment related properties. Here we can set properties like where is java home, what is heap memory size, what is class path of hadoop, which version of IP to use etc. we will set java home in this file. For me java home is "/usr/lib/jvm/java-6-openjdk-i386". Open the file using leafpad
42. Search for # export JAVA_HOME and replace the entire line with the below line in file and save.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
43. hdfs-site.xml is file which contains properties related to hdfs(hadoop distributed file system.). We need to set here the replication factor here. By default replication factor is 3. Since we are installing hadoop in single machine. We will set it to 1. Copy following in-between the configuration tag in file.
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
44. mapred-site.xml is a file that contains properties related to map reduce. we will set here ip address and port of machine on which job tracker is running. copy following in between configuration tag
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
45. core-site.xml is property file which contains property which are common or used by both map reduce and hdfs. here we will set ip address and port number of machine on which namenode will be running. Other property tells where should hadoop store files like fsimage and blocks etc. Copy following in between configuration tag.
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop_tmp_files</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
46. Open terminal and format namenode with the following command. Namenode should be formatted only once, before you start using your hadoop cluster. If you format namnode later, you will lose all the data stored on hdfs. Notice that "/home/hduser/hadoop/bin/" folder contains all the important scripts to start hadoop, stop hadoop, access hdfs, format hdfs etc.
/home/hduser/hadoop/bin/hadoop namenode -format
47. Start hadoop using following command
/home/hduser/hadoop/bin/start-all.sh
48. Check if Hadoop is functioning using command
jps
49. It should show all java processes running. If not something went wrong during installation
namenode
secondary namenode
datanode
jobtracker
tasktracker
50. Hadoop has a webui. Open Chrome browser in your lubuntu machine and browse for Localhost:50070 This will show the complete summary of namenode and other details like Number of live nodes. Since we did a Pseudo Mode setup there is only one live node (datanode)
Hope this helped. Do reach out to me if you have any questions
22. This command will generate two keys at "/home/hduser/.ssh/" path. id_rsa and id_rsa.pub.
a> id_rsa is private key.
b > id_rsa.pub is public key
23. To login into remote machine I will share my public key with that machine. In our case it is local machine, so following command is used.
ssh-copy-id -i /home/hduser/.ssh/id_rsa.pub hduser@localhost
24. This will prompted for confirmation. Type yes and enter. And then for password. Give password for hduser.
25. Localhost added to the list of know hosts confirmation message would show on screen
26. Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
ssh localhost
27. Next step is to install Hadoop 1.3.0. You can download file hadoop-1.0.3.tar.gz from below listed paths in your windows system
https://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz
28. For us to move the hadoop-1.0.3.tar.gz file from windows to Lubuntu system we can use winscp or filezilla tool. We will download winscp555setup.exe from below listed path and install on windows system
http://winscp.net/download/winscp556setup.exe
29. Continue clicking next to finish the installation
30. Host name would be the ipaddress of the Lubuntu machine. To find that in your lubuntu machine enter command
Ifconfig
31. IP Address of my lubuntu machine is 192.168.xx.xxx
32. Enter the IP Address of your lubuntu machine in Winscp and click on login.
33. You would be prompted for username and password. Enter both as hduser
34. You will be connected to your lubuntu system on your right windows and windows system on the left. You can drag and drop and move files between system
35. Move the file hadoop-1.0.3.tar.gz file from windows to Lubuntu system under path /home/hduser/downloads
36. In the Lubuntu VM, go to folder path /home/hduser/downloads. Right click on hadoop-1.0.3.tar.gz file and click on extract
37. Rename the extracted folder from Hadoop-1.0.3 to Hadoop
38. Cut and move folder Hadoop to path /home/hduser
39. Now we need to make configurations in hadoop configuration file. You will find these files in "/home/hduser/hadoop/conf" folder.
40. There are 4 important files in this folder
hadoop-env.sh
hdfs-site.xml
mapred-site.xml
core-site.xml
41. hadoop-env.sh is a file which contains Hadoop environment related properties. Here we can set properties like where is java home, what is heap memory size, what is class path of hadoop, which version of IP to use etc. we will set java home in this file. For me java home is "/usr/lib/jvm/java-6-openjdk-i386". Open the file using leafpad
42. Search for # export JAVA_HOME and replace the entire line with the below line in file and save.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386
43. hdfs-site.xml is file which contains properties related to hdfs(hadoop distributed file system.). We need to set here the replication factor here. By default replication factor is 3. Since we are installing hadoop in single machine. We will set it to 1. Copy following in-between the configuration tag in file.
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
44. mapred-site.xml is a file that contains properties related to map reduce. we will set here ip address and port of machine on which job tracker is running. copy following in between configuration tag
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
45. core-site.xml is property file which contains property which are common or used by both map reduce and hdfs. here we will set ip address and port number of machine on which namenode will be running. Other property tells where should hadoop store files like fsimage and blocks etc. Copy following in between configuration tag.
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop_tmp_files</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
46. Open terminal and format namenode with the following command. Namenode should be formatted only once, before you start using your hadoop cluster. If you format namnode later, you will lose all the data stored on hdfs. Notice that "/home/hduser/hadoop/bin/" folder contains all the important scripts to start hadoop, stop hadoop, access hdfs, format hdfs etc.
/home/hduser/hadoop/bin/hadoop namenode -format
47. Start hadoop using following command
/home/hduser/hadoop/bin/start-all.sh
48. Check if Hadoop is functioning using command
jps
49. It should show all java processes running. If not something went wrong during installation
namenode
secondary namenode
datanode
jobtracker
tasktracker
50. Hadoop has a webui. Open Chrome browser in your lubuntu machine and browse for Localhost:50070 This will show the complete summary of namenode and other details like Number of live nodes. Since we did a Pseudo Mode setup there is only one live node (datanode)
Hope this helped. Do reach out to me if you have any questions
If you found this blog useful, please convey your thanks by
posting your comments
I have posted another blog on how to setup Hadoop Fully Distributed Mode (Multi Node) cluster. Link to view it http://hadoopfullydistributedmode.blogspot.in/
I have posted another blog on how to setup Hadoop Fully Distributed Mode (Multi Node) cluster. Link to view it http://hadoopfullydistributedmode.blogspot.in/
Thanks
ReplyDeleteThis makes very easy to install Hadoop pseudo distributed mode It is very helpful.Thanks a lot.
ReplyDeletePlease let me know if I can install Hadoop on Core2Duo processor with 3-4 GB of RAM.
ReplyDeleteYes.Treat it as a test system just for learning hadoop
DeleteCan I use the lubunto in this configuration to install R, Rstudio and PSQL?
ReplyDeleteYes you can.. But you might face performance issue as its an virtual system.. So allocate more RAM to your VM
DeleteThanks, its really helpful for Hadoop starters
ReplyDeleteThanks a lot. You really saved me.
ReplyDeletei tried sandbox clodera ubuntu and was struggling from 30 hours.
when i lost hpe your blog saved me. Thanks a lot.
if possible plz tell how to run simple word count program in hadoop using same setup
Thanks alot brother..!
ReplyDeleteI am getting problem in step 26 (2nd half).
ReplyDelete26th step is "Now enter below command and you will not be prompted for any password which confirms the keys are shared properly
ssh localhost"
Upto step 25, it is running as per the steps. After 26 it is asking for the ' passphrase for /home/hduser/.ssh/id_rsa'
Plz help me to solve this problem and download Hadoop.
Seems to be that the SSH is not properly shared. Can you share the screen shot of the issue
DeleteGreat thanks for this useful article. Please I would like to know can I apply the same steps for hadoop 2.7 and java 1.6. I only have physical 4 GB ram. How much memory should I allocate
ReplyDeletethe lubuntu VM?
{ Step 46 }
ReplyDeletePermission denied in namenode -format command. Please reply as soon as possible.
nice information
ReplyDeleteThanks for sharing this Information. This content is so informative and helpful for many people.
ReplyDeleteHadoop Training in Noida
Thanks for sharing the valuable information to share with us. For more information please visit our website. Book Online for Hadoop Training In Ameerpet@ Best Institute
ReplyDeleteGreat work, thank you...! Hadoop Pune
ReplyDeleteNice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop Admin online training
ReplyDelete