What is apache hadoop which is project develops open-source software for reliable, scalable, distributed computing.The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers.
When we are setting up the apache hadoop in practically bring up the lots of problems Because there are less correct document in hadoop for some time it depend on using operating system.All of batch members were faced lots of problems for that finally Thilina and I configured apache hadoop cluster setup within 4 hours.So I try to give that knowledge to others because It help for you.This tutorial can be divided two parts which are set ssh connection each nodes and configure the hadoop.
Ubuntu 11.04,11.10
JDK 1.6.0_24
Please doing all the thing according to this order
1.Install the ssh and rsync package to your machine
$ sudo apt-get install ssh
$ sudo apt-get install rsync
if it is getting some problem you will update your machine repository like this and run above code again
$sudo apt-get update
2 First you need to disable the firewall because some process may be used the port of machine result of that hard to configure hadoop.
$sudo ufw disable
3.create user group using super user
$su root
$sudo groupadd hadoop_user
4.create user Hadoop and assign that user to created user group
$sudo useradd --home-dir /home/hadoop --create-home --shell /bin/bash -U hadoop
$sudo usermod -a -G hadoop_user hadoop
5.create password for created user account what ever you wan,enter twice your password
$passwd hadoop
6.Then check whether it is working
$su hadoop
7.Restart the machine and login hadoop account
8.Genarate the key pairs to the machine
$ ssh-keygen -t rsa
after enter key press,create pub key and private key in ssh for hadoop account,Do the above orders in master node and slave machine also.
9.write the public key to one place to anoter in same machine
$cat /home/hadoop/.ssh/id rsa.pub >> /home/hadoop/.ssh/authorized keys
10.copy the master publick key to all slave node
$scp /home/hadoop/.ssh/id rsa.pub IPADDRESS of slave:/home/hadoop/.ssh/master.pub
11.Then login each slave node from master node and run this(write the master pub key in slave node to authorized_keys of slave node )
$cat /home/hadoop/.ssh/master.pub >> /home/hadoop/.ssh/authorized_keys
12.check whether now you can login to slave machine without password and to your localhost also
$ssh IPADDRESS of slave
$ssh localhost
13.Login to hadoop account and create project folder in hadoop home account
$mkdir -p /home/hadoop/project
14.install the haddop in to hadoop folder(copy the hadoop-0.20.0.tar.gz in to project folder)
$cd /home/hadoop/project
$tar -xzvf ./hadoop-0.20.0.tar.gz
15change the environmental variable in .profile file or .bashrc file
export JAVA HOME=/home/hadoop/jdk1.6.0_24(if you install the jdk externally )
export HADOOP_HOME=/home/hadoop/project/hadoop-0.20.2
if you didn't install JDK in externally to remove first line of above
16.change the JAVA_HOME environmental variable in hadoop/conf/hadoop-env.sh
uncoment and change the java home according to java location
export JAVA_HOME=/home/hadoop/jdk1.6.0_24
17.configure the hadoop parameters in conf.xml in conf folder
change the HADOOP HOME/src/core/core-default.xml.If you use masternode ip address for this.It's ok instead of this use domain name you need to change the hosts file in the /etc/ folders
change the HADOOP HOME/src/core/hdfs-default.xml.
change the HADOOP HOME/src/core/mapred-default.xml.you can run the job tracker in different machine if you want you need to change only ip address of job tracker
After changing xml files then put the master IP Address in master file in conf folder and also put the slaves IP Address in slave file in conf folder
18.After configure the master node you can copy the hadoop to other machine.
$scp -r /home/hadoop/project IPADDRESS of slave:/home/hadoop/
run above code for each nodes changing ip address of slave
19.Now you need to change the environtal varible in each slave node as above way in .bashrc file or .profle and hadoop/conf/hadoop-env.sh file.
20.Now login to master node as hadoop acount and format the namenode(go to inside bin)
$hadoop namenode -format
21.run the server
22.Run your jar file in hadoop cluster
$hadoop jar /home/hadoop/smscount.jar org.sms.SmsCount /home/hadoop/smscount/input /home/hadoop/smscount/output
Before it is running we need to know how to handle HDFS folders
Create folder : $hadoop dfs -mkdir /home/hadoop/smscount/input
List file and folder : $hadoop dfs -ls /home/hadoop/smscount/input
Remove folder : $hadoop dfs -rmr /home/hadoop/smscount/input
put file : $hadoop dfs -put /home/hadoop/ file1 /home/hadoop/smscount/input
After running smscount jar automatically create hdfs output folder remember if you need to run this twice remove the output folder because it is throwing the exception.
23.To whatch the result
$hadoop dfs -cat /home/hadoop/smscount/output/part-00000
24.stop the server
25.if it is running you can check following links and enjoy it.
Hadoop Distributed File System (HDFS):
http://IPADDRES of namenode:50070
Hadoop Jobtracker:
http://IPADDRES of jobtracker:50030
Hadoop Tasktracker:
http://IPADDRES of map-reduce processor:50060
Ubuntu 11.04,11.10
JDK 1.6.0_24
Please doing all the thing according to this order
1.Install the ssh and rsync package to your machine
$ sudo apt-get install ssh
$ sudo apt-get install rsync
if it is getting some problem you will update your machine repository like this and run above code again
$sudo apt-get update
2 First you need to disable the firewall because some process may be used the port of machine result of that hard to configure hadoop.
$sudo ufw disable
3.create user group using super user
$su root
$sudo groupadd hadoop_user
4.create user Hadoop and assign that user to created user group
$sudo useradd --home-dir /home/hadoop --create-home --shell /bin/bash -U hadoop
$sudo usermod -a -G hadoop_user hadoop
5.create password for created user account what ever you wan,enter twice your password
$passwd hadoop
6.Then check whether it is working
$su hadoop
7.Restart the machine and login hadoop account
8.Genarate the key pairs to the machine
$ ssh-keygen -t rsa
after enter key press,create pub key and private key in ssh for hadoop account,Do the above orders in master node and slave machine also.
9.write the public key to one place to anoter in same machine
$cat /home/hadoop/.ssh/id rsa.pub >> /home/hadoop/.ssh/authorized keys
10.copy the master publick key to all slave node
$scp /home/hadoop/.ssh/id rsa.pub IPADDRESS of slave:/home/hadoop/.ssh/master.pub
11.Then login each slave node from master node and run this(write the master pub key in slave node to authorized_keys of slave node )
$cat /home/hadoop/.ssh/master.pub >> /home/hadoop/.ssh/authorized_keys
12.check whether now you can login to slave machine without password and to your localhost also
$ssh IPADDRESS of slave
$ssh localhost
13.Login to hadoop account and create project folder in hadoop home account
$mkdir -p /home/hadoop/project
14.install the haddop in to hadoop folder(copy the hadoop-0.20.0.tar.gz in to project folder)
$cd /home/hadoop/project
$tar -xzvf ./hadoop-0.20.0.tar.gz
15change the environmental variable in .profile file or .bashrc file
export JAVA HOME=/home/hadoop/jdk1.6.0_24(if you install the jdk externally )
export HADOOP_HOME=/home/hadoop/project/hadoop-0.20.2
if you didn't install JDK in externally to remove first line of above
16.change the JAVA_HOME environmental variable in hadoop/conf/hadoop-env.sh
uncoment and change the java home according to java location
export JAVA_HOME=/home/hadoop/jdk1.6.0_24
17.configure the hadoop parameters in conf.xml in conf folder
change the HADOOP HOME/src/core/core-default.xml.If you use masternode ip address for this.It's ok instead of this use domain name you need to change the hosts file in the /etc/ folders
change the HADOOP HOME/src/core/hdfs-default.xml.
change the HADOOP HOME/src/core/mapred-default.xml.you can run the job tracker in different machine if you want you need to change only ip address of job tracker
After changing xml files then put the master IP Address in master file in conf folder and also put the slaves IP Address in slave file in conf folder
18.After configure the master node you can copy the hadoop to other machine.
$scp -r /home/hadoop/project IPADDRESS of slave:/home/hadoop/
run above code for each nodes changing ip address of slave
19.Now you need to change the environtal varible in each slave node as above way in .bashrc file or .profle and hadoop/conf/hadoop-env.sh file.
20.Now login to master node as hadoop acount and format the namenode(go to inside bin)
$hadoop namenode -format
21.run the server
22.Run your jar file in hadoop cluster
$hadoop jar /home/hadoop/smscount.jar org.sms.SmsCount /home/hadoop/smscount/input /home/hadoop/smscount/output
Before it is running we need to know how to handle HDFS folders
Create folder : $hadoop dfs -mkdir /home/hadoop/smscount/input
List file and folder : $hadoop dfs -ls /home/hadoop/smscount/input
Remove folder : $hadoop dfs -rmr /home/hadoop/smscount/input
put file : $hadoop dfs -put /home/hadoop/ file1 /home/hadoop/smscount/input
After running smscount jar automatically create hdfs output folder remember if you need to run this twice remove the output folder because it is throwing the exception.
23.To whatch the result
$hadoop dfs -cat /home/hadoop/smscount/output/part-00000
24.stop the server
25.if it is running you can check following links and enjoy it.
Hadoop Distributed File System (HDFS):
http://IPADDRES of namenode:50070
Hadoop Jobtracker:
http://IPADDRES of jobtracker:50030
Hadoop Tasktracker:
http://IPADDRES of map-reduce processor:50060