Wednesday, May 14, 2014

Ubuntu Server 14.04 - Install single node Hadoop 2.4.0

Following steps to install single node Hadoop 2.4.0
Prerequisites:
- Installed JDK 1.7.x (if not, see this link: Install JDK 1.7.x)

A. System Configuration

1. Add Hadoop system user (Optional step)
This step is optional but we recommend to create Hadoop system user to separate with other software applications
- Add new group
1
root@ubuntu:~# addgroup hadoop

- Add new user in hadoop group
1
root@ubuntu:~# adduser ingroup hadoop hduser

2. Config SSH access
Prerequisites:
- Make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.

Generate an SSH key for the hduser:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$

Enable SSH access with the newly created key above:
1
hduser@ubuntu:~$ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys

Make sure the SSH access is applied to hduser user
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
root@ubuntu:~# ssh hduser@localhost
hduser@localhost's password:
Welcome to Ubuntu 14.04 LTS (GNU/Linux 3.13.0-24-generic i686)

 * Documentation:  https://help.ubuntu.com/

  System information as of Tue May 13 11:36:31 ICT 2014

  System load:  0.08              Processes:           85
  Usage of /:   5.3% of 37.04GB   Users logged in:     2
  Memory usage: 2%                IP address for eth0: 192.168.1.101
  Swap usage:   0%

  Graph this data and manage this system at:
    https://landscape.canonical.com/

Last login: Tue May 13 11:36:32 2014 from localhost
hduser@ubuntu:~$

3. Disable IPV6
"Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems."

In root user, edit /etc/sysctl.conf to disable IPv6
1
root@ubuntu:~# vi /etc/sysctl.conf

Add the following lines to the end of the file
1
2
3
4
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Save and quit (:wq)

Reboot the system to update the configurations correctly
1
root@ubuntu:~# reboot

B. Hadoop Installation
1. Download hadoop-2.4.0.tar.gz
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
root@ubuntu:~# wget http://apache.mirrors.pair.com/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gz
--2014-05-13 11:16:02--  http://apache.mirrors.pair.com/hadoop/common/hadoop-2.4.0/hadoop-2.4.0.tar.gz
Resolving apache.mirrors.pair.com (apache.mirrors.pair.com)... 216.92.2.131
Connecting to apache.mirrors.pair.com (apache.mirrors.pair.com)|216.92.2.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 138943699 (133M) [application/x-gzip]
Saving to: hadoop-2.4.0.tar.gz

100%[======================================>] 138,943,699  247KB/s   in 9m 19s

2014-05-13 11:25:22 (243 KB/s) - hadoop-2.4.0.tar.gz saved [138943699/138943699]

root@ubuntu:~#

2. Move downloaded file to /usr/local
1
root@ubuntu:~# mv hadoop-2.4.0.tar.gz /usr/local/

3. Extract file
1
2
root@ubuntu:~# cd /usr/local/
root@ubuntu:/usr/local# tar xzf hadoop-2.4.0.tar.gz

4. Rename hadoop-2.4.0 folder to hadoop folder
1
root@ubuntu:~# mv hadoop-2.4.0 hadoop

5. Change owner of all files in hadoop folder
1
root@ubuntu:/usr/local# chown -R hduser:hadoop hadoop

6. Config Hadoop files for single node:
1
hduser@ubuntu:/usr/local# cd hadoop/etc/hadoop/

a. Modify yarn-site.xml
1
hduser@ubuntu:/usr/local/hadoop/etc/hadoop# vi yarn-site.xml

Insert code below
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
<configuration>
 <!-- Site specific YARN configuration properties -->
 <property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
 </property>
 <property>
  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
</configuration>

b. Modify core-site.xml
1
hduser@ubuntu:/usr/local/hadoop/etc/hadoop# vi core-site.xml

Insert code below
1
2
3
4
5
6
<configuration>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
 </property>
</configuration>

c. Create mapred-site.xml
1
hduser@ubuntu:/usr/local/hadoop/etc/hadoop# vi mapred-site.xml

Insert code below
1
2
3
4
5
6
<configuration>
 <property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
 </property>
</configuration>

d. Modify hdfs-site.xml
1
hduser@ubuntu:/usr/local/hadoop/etc/hadoop# vi hdfs-site.xml

Insert code below
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
 </property>
 <property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/usr/local/hadoop/yarn_data/hdfs/namenode</value>
 </property>
 <property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/usr/local/hadoop/yarn_data/hdfs/datanode</value>
 </property>
</configuration>

7. Create folders to store data files of namenode and datanode
1
2
hduser@ubuntu:/usr/local/hadoop/etc/hadoop# mkdir -p yarn_data/hdfs/namenode
hduser@ubuntu:/usr/local/hadoop/etc/hadoop# mkdir -p yarn_data/hdfs/datanode

8. Update $HOME/.bashrc with hduser user
1
2
3
4
5
6
7
8
9
# Hadoop variables
export JAVA_HOME=/usr/local/java/jdk1.7.0_55/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

9. Open and modify hadoop-env.sh
1
hduser@ubuntu:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Modify JAVA_HOME like this
1
export JAVA_HOME=/usr/local/java/jdk1.7.0_55/

10. Reboot the system to apply new configuration
1
root@ubuntu:~# reboot

11. After rebooting, verify the Hadoop Version installed using the following command in the terminal
1
2
3
4
5
6
7
8
hduser@ubuntu:~$ hadoop version
Hadoop 2.4.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1583262
Compiled by jenkins on 2014-03-31T08:29Z
Compiled with protoc 2.5.0
From source with checksum 375b2832a6641759c6eaf6e3e998147
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.4.0.jar
hduser@ubuntu:~$

C. Hadoop Start-up
1. Format the HDFS filesystem via the NameNode
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
hduser@ubuntu:~$ hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/05/14 09:36:44 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.4.0
STARTUP_MSG:   classpath = ........................................
STARTUP_MSG:   build = http://svn.apache.org/repos/asf/hadoop/common -r 1583262; compiled by 'jenkins' on 2014-03-31T08:29Z
STARTUP_MSG:   java = 1.7.0_55
************************************************************/
14/05/14 09:36:44 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
14/05/14 09:36:44 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-e9602272-9da3-414e-a3b2-0003f94927eb
14/05/14 09:36:45 INFO namenode.FSNamesystem: fsLock is fair:true
14/05/14 09:36:45 INFO namenode.HostFileManager: read includes:
HostSet(
)
14/05/14 09:36:45 INFO namenode.HostFileManager: read excludes:
HostSet(
)
14/05/14 09:36:45 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
14/05/14 09:36:45 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
14/05/14 09:36:45 INFO util.GSet: Computing capacity for map BlocksMap
14/05/14 09:36:45 INFO util.GSet: VM type       = 32-bit
14/05/14 09:36:45 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
14/05/14 09:36:45 INFO util.GSet: capacity      = 2^22 = 4194304 entries
14/05/14 09:36:45 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
14/05/14 09:36:45 INFO blockmanagement.BlockManager: defaultReplication         = 1
14/05/14 09:36:45 INFO blockmanagement.BlockManager: maxReplication             = 512
14/05/14 09:36:45 INFO blockmanagement.BlockManager: minReplication             = 1
14/05/14 09:36:45 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
14/05/14 09:36:45 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
14/05/14 09:36:45 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
14/05/14 09:36:45 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
14/05/14 09:36:45 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
14/05/14 09:36:45 INFO namenode.FSNamesystem: fsOwner             = hduser (auth:SIMPLE)
14/05/14 09:36:45 INFO namenode.FSNamesystem: supergroup          = supergroup
14/05/14 09:36:45 INFO namenode.FSNamesystem: isPermissionEnabled = true
14/05/14 09:36:45 INFO namenode.FSNamesystem: HA Enabled: false
14/05/14 09:36:45 INFO namenode.FSNamesystem: Append Enabled: true
14/05/14 09:36:45 INFO util.GSet: Computing capacity for map INodeMap
14/05/14 09:36:45 INFO util.GSet: VM type       = 32-bit
14/05/14 09:36:45 INFO util.GSet: 1.0% max memory 966.7 MB = 9.7 MB
14/05/14 09:36:45 INFO util.GSet: capacity      = 2^21 = 2097152 entries
14/05/14 09:36:45 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/05/14 09:36:45 INFO util.GSet: Computing capacity for map cachedBlocks
14/05/14 09:36:45 INFO util.GSet: VM type       = 32-bit
14/05/14 09:36:45 INFO util.GSet: 0.25% max memory 966.7 MB = 2.4 MB
14/05/14 09:36:45 INFO util.GSet: capacity      = 2^19 = 524288 entries
14/05/14 09:36:45 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
14/05/14 09:36:45 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
14/05/14 09:36:45 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
14/05/14 09:36:45 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
14/05/14 09:36:45 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
14/05/14 09:36:45 INFO util.GSet: Computing capacity for map NameNodeRetryCache
14/05/14 09:36:45 INFO util.GSet: VM type       = 32-bit
14/05/14 09:36:45 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
14/05/14 09:36:45 INFO util.GSet: capacity      = 2^16 = 65536 entries
14/05/14 09:36:45 INFO namenode.AclConfigFlag: ACLs enabled? false
Re-format filesystem in Storage Directory /usr/local/hadoop/yarn_data/hdfs/namenode ? (Y or N)

14/05/14 09:38:10 INFO namenode.FSImage: Allocated new BlockPoolId: BP-140501697-127.0.1.1-1400035090698
14/05/14 09:38:10 INFO common.Storage: Storage directory /usr/local/hadoop/yarn_data/hdfs/namenode has been successfully formatted.
14/05/14 09:38:11 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/05/14 09:38:11 INFO util.ExitUtil: Exiting with status 0
14/05/14 09:38:11 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:~$

2. Start Single-Node Cluster
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
hduser@ubuntu:~$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 64:0d:71:90:1b:7a:af:93:55:39:45:5d:ec:16:c7:44.
Are you sure you want to continue connecting (yes/no)? no
0.0.0.0: Host key verification failed.
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-ubuntu.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-ubuntu.out
hduser@ubuntu:~$

3. Check whether Hadoop processes are started successfully
1
2
3
4
5
6
7
hduser@ubuntu:~$ jps
1898 NodeManager
1407 NameNode
1556 DataNode
2179 Jps
1769 ResourceManager
hduser@ubuntu:~$

4. Stop Single-Node Cluster
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
hduser@ubuntu:~$ stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 64:0d:71:90:1b:7a:af:93:55:39:45:5d:ec:16:c7:44.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: no secondarynamenode to stop
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
hduser@ubuntu:~$ 

5. View Hadoop Web-Interface
Go to http://localhost:50070 or http://ubuntu_ip_address:50070 to connect Hadoop Web-Interface


0 comments:

Post a Comment