Build a Hadoop Cluster on Windows Azure

Getting started with Hadoop can be tough. There’s a steep learning curve involved with getting even vaguely familiar with all the ins and outs.

If you want to get hands-on experience the first step is going to be finding a platform to play with. Fortunately there’s a lot of good options available. The Hortonworks quick start VM, Amazon Elastic MapReduce and Microsoft’s HDInsights are all easy ways to hit the ground running.

If those preconfigured solutions seem a little bit too magical and/or if you’re a do-it-yourself-er then building your own cluster can be a fun (and educational) exercise. And if you’re one of those unlucky few who don’t have a spare rack of servers lying around, you’re still in luck! All you need to do is turn to the cloud.

This post will tackle building your own Hadoop cluster in the cloud. Specifically, I’ll be using the Windows Azure and Hortonworks HDP 2.1.

0 – Sign up for Windows Azure

If you haven’t already, sign up for Windows Azure. You can get a free trial for $220 worth of cloud services for 28 days. You can only use a maximum of 20 CPU cores during the free trial, but that should be enough for our purposes.

1 – Name The Components

The first step is to write down some basic information about your cluster. When you start adding the bits in Azure, its going to ask you to name things. I find it easier if you write down all the names first.

  • The domain name you plan to use. I chose albertlockett.ca
  • A naming scheme for the nodes in your cluster. I like to name my Namenodes and Datanodes differently and number them, so I called them agl-datanode<#> and agl-namenode<#>
  • A hostname for your DNS Server. I called mine agl-dns1
  • Name of the network you plan to use. I called mine agl-network

You should also choose where you want everything to be located. Azure has a number of data centres all over the world, so choose one close to you. I’m using East US because I live in Halifax and that one is closest.

With that information, here is a picture of what we’re going to build. 3 datanodes, 1 Namenode and 1 DNS Server.


picture.001


2 – Setup the Network

I like to keep my Namenodes, Datanodes and DNS on separate subnets, so we’re going to setup a network with 3 subnets

  • Subnet 1 10.124.1.0 /24 for the DNS
  • Subnet 2 10.124.2.0 /24 for the Datanodes
  • Subnet 3 10.124.3.0 /24 for the Namenodes

Go to your Azure management portal and choose new

+NEW -> NETWORK SERVICES -> VIRTUAL NETWORK -> CUSTOM CREATE

Configure the network like this:

Virtual Network Details

  • Name – The name of your network. I used agl-network
  • Location – The location you plan to use. I used East-US

DNS Servers and VPN Connectivity

  • I’m going to add 2 DNS Servers, an internal DNS and a public one.
  • In the first line add your internal DNS information:
    NAMEagl-dns1 IP10.124.1.4 
  • In the next kine add a public DNS. I used Google’s public DNS:
    Name – google-public-dns-a.google.com IP8.8.8.8

Virtual Network Address Spaces

  • Add 3 subnets with the following information:
  • NameSubnet 1  Starting Address10.124.1.0  CIDR/24
  • NameSubnet 2  Starting Address10.124.2.0  CIDR/24
  • NameSubnet 3  Starting Address10.124.3.0  CIDR/24

network_details

dns

subnets


3 – Build the DNS

We’re going to use Ubuntu Server 12.04 with Bind9 to do the DNS. The first step is to create the virtual server.

Create DNS Virtual Server

In the Azure management portal choose

+NEW -> COMPUTE -> VIRTUAL MACHINE – FROM GALLERY

Choose an Image

  • Pick Ubuntu from the list on the left
  • Pick Ubuntu Server 12.04 LTS from the list of Ubuntu Images

Virtual machine Configuration

  • Version Release Date – Just choose the newest
  • Virtual Machine Name – Choose the name of your DNS Server. I chose agl-dns1
  • Tier – Basic
  • Size – A1 (1core, 1.75 GB Memory)
  • New User Name – Choose your favourite username. I chose albertlockett
  • Authentication – Select Provide a Password and then type in your favourite password (twice)

Virtual machine Configuration (2)

  • Cloud ServiceCreate a new cloud service
  • Cloud Service DNS Name – Choose the name of your DNS Server. I chose agl-dns1
  • Region/Affinity Group/ Virtual Network – Choose the Virtual Network you created. I chose agl-network
  • Virtual Network Subnets – Subnet 1 (10.124.1.0 /24)
  • Storage Account – Use an automatically generated storage account
  • Availability Set(None)
  • Endpoints – Leave the default (SSH/TCP/22/22)

Virtual machine Configuration (3)

  • Install the VM Agent
  • Don’t install Chef, unless you want it
Choose_imageMachine_config1Machine Config2Machine Config3

Configure the DNS Server

We’re going to use Bind9 as the DNS Server for your internal domain and external requests will be forwarded to a public DNS.

Go ahead and log into the new DNS server to get started.

ssh albertlockett@agl-dns1.cloudapp.net

The first step is to do some network configuration. Setup resolve.conf to point DNS requests to this machine

sudo vim /etc/resolv.conf
nameserver 127.0.0.1
search albertlockett.ca

Next, setup your host file so it will recognize the hostname and fully qualified domain name of this box as local host

sudo vim /etc/hosts
127.0.0.1 localhost agl-dns1 agl-dns1.albertlockett.ca

Configure the domain and domain name on the main ethernet interface

sudo vim /etc/network/interfaces.d/eth0.cfg
auto eth0
iface eth0 inet dhcp
        dns-nameservers 127.0.0.1
        dns-search albertlockett.ca
        dns-domain albertlockett.ca

Install Bind9

sudo apt-get install bind9

Next we need to configure Bind9. First, setup the forwarding for external DNS lookups. Find the forwarders section on Bind9 in the configuration options file

sudo vim /etc/bind/named.conf.options
    forwarders {
        8.8.8.8;
    };

Now we need to configure the zones. We need to add 2 zones, one for forward and one for reverse lookups on our internal network.

sudo vim /etc/bind/named.conf.local
zone "albertlockett.ca" IN {
    type master;
    file "/etc/bind/zones/db.albertlockett.ca";
};

zone "124.10.in-addr.arpa" {
    type master;
    file "/etc/bind/zones/rev.124.10.in-addr.arpa";
};

Next we need to add a configuration file for each zone. I like to keep them in a directory within the Bind9 directory.

sudo mkdir /etc/bind/zones

Create the file for the forward zone. Make sure to add your hosts in alphabetical order.

sudo vim /etc/bind/zones/db.albertlockett.ca
; BIND data file for local loopback interface
;
$TTL 604800
@ IN SOA localhost. root.localhost. (
 2 ; Serial
 604800 ; Refresh
 86400 ; Retry
 2419200 ; Expire
 604800 ) ; Negative Cache TTL
;
@ IN NS localhost.
@ IN A 127.0.0.1
@ IN AAAA ::1
agl-datanode0 IN A 10.124.2.4
agl-datanode1 IN A 10.124.2.5
agl-datanode2 IN A 10.124.2.6
agl-dns2 IN A 10.124.1.4
agl-namenode IN A 10.124.3.4

Configure the reverse zone. Remember to type the last bytes of the IP addresses in backwards.

sudo vim /etc/bind/zones/rev.124.10.in-addr.apra
; IP Address-to-Host DNS Pointers
@ IN SOA agl-dns2.agl-dns.albertlockett.ca. hostmaster.agl-dns.albertlockett.ca. (
 2008080901 ; serial
 8H ; refresh
 4H ; retry
 4W ; expire
 1D ; minimum
)
; define the authoritative name server
 IN NS agl-dns2.agl-dns.albertlockett.ca.
; our hosts, in numeric order
4.2 IN PTR agl-datanode0.albertlockett.ca.
5.2 IN PTR agl-datanode1.albertlockett.ca.
6.2 IN PTR agl-datanode2.albertlockett.ca.
4.1 IN PTR agl-dns2.albertlockett.ca.
4.3 IN PTR agl-namenode.albertlockett.ca.

The final step is to restart bind9 and the network interface

sudo service bind9 restart
nohup sh -c "ifdown eth0 && ifup eth0"

Now our DNS Server is configured. Next we can create an image to use as a template for our Datanodes and Namenode


3 – Create Node Image

In this step we’re going to create a base image we can use copy to use for each node in the cluster. The machine we create this image from will also serve as our first datanode, so we’ll configure it as such. We’ll still have to do some configuration for each node individually, but creating this image will save us having to repeat a few steps. We’re going to use Centos 6 as the base operating system for our cluster, but Azure calls it OpenLogic 6.5

First create/configure the  new image from the gallery:

+NEW -> COMPUTE -> VIRTUAL MACHINE -> FROM GALLERY

Choose an Image

  • Pick Centos-Based from the list on the left
  • Pick OpenLogic 6.5 LTS from the list of Centos-Based images

Virtual machine Configuration

  • Virtual Machine Name – Choose the name of your first datanode. I chose agl-datanode0
  • Tier – Basic
  • Size – A3 (4cores, 7GB Memory)
  • New User Name – Choose your favourite username. I chose albertlockett
  • Authentication – Select Provide a Password and then type in your favourite password (twice)

Virtual machine Configuration (2)

  • Cloud Service – Create a new cloud service
  • Cloud Service DNS Name – Choose the name of your first datanode. I chose agl-datanode0
  • Region/Affinity Group/ Virtual Network – Choose the Virtual Network you created. I chose agl-network
  • Virtual Network Subnets – Subnet 2 (10.124.2.0 /24)
  • Storage Account – Use an automatically generated storage account
  • Availability Set – (None)
  • Endpoints – Leave the default (SSH/TCP/22/22)

Virtual machine Configuration (3)

  • Install the VM Agent

base_image1base_image2base_image3base_image4


Configure the Image

Now log into the new node and we’ll do some base configurations

ssh albertlockett@agl-datanode0.clouapp.net

Disable SELinux.

setenforce 0

And make sure it’s permanently disabled

vim /etc/selinux/sysconfig
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#     enforcing - SELinux security policy is enforced.
#     permissive - SELinux prints warnings instead of enforcing.
#     disabled - No SELinux policy is loaded.
SELINUX=disabled

Now edit the Hosts file.

sudo vim /etc/hosts
10.124.1.4 agl-dns agl-dns1.albertlockett.ca
10.124.2.4 agl-datanode0 agl-datanode0.albertlockett.ca
10.124.2.5 agl-datanode1 agl-datanode1.albertlockett.ca
10.124.2.6 agl-datanode2 agl-datanode2.albertlockett.ca
10.124.3.4 agl-namenode agl-namenode.albertlockett.ca
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

Finally, capture the image so we can reuse it and shutdown

sudo waagent -deprovision
sudo shutdown -h now
Capture the Image

Log into the Windows Azure manager. Select the virtual machine and click the Capture button

Capture_img

And name the image whatever you like. I called it hadoop_img

name_img


4 – Create Namenode and Datanodes

In this section we’ll create the name node and the other two datanodes. We’re going to create them from the image we produced in the previous section and we’ll configure them in the next section. Repeat the following procedure three times.

Create/configure the  new image from the gallery:

+NEW -> COMPUTE -> VIRTUAL MACHINE -> FROM GALLERY

Choose an Image

  • Pick My Images from the list on the left
  • Pick the image you saved in the previous section from the list of images. I called mine hadoop_img

Virtual machine Configuration

  • Virtual Machine Name – Choose the name of your new machine. I called my datanodes agl-datanode1 & agl-datanode2 and I called my namenode agl-namenode
  • Tier – Basic
  • Size – I used A3 (4cores, 7GB Memory) for the datanodes and A4 (8 cores, 14GB Memory) for the namenode.
    Note: if you’re on the Azure free trial, you should use an A3 machine for the namenode so we don’t exceed the maximum number of virtual cores you’re using.

Virtual machine Configuration (2)

  • Cloud Service – Create a new cloud service
  • Region/Affinity Group/ Virtual Network – Choose the Virtual Network you created. I chose agl-network
  • Virtual Network Subnets – For the datanodes use Subnet 2 (10.124.2.0 /24) and for the namenode use Subnet 3 (10.124.3.0 /24)
  • Availability Set – (None)
  • Endpoints – Leave the default (SSH/TCP/22/22)

Virtual machine Configuration (3)

  • Install the VM Agent

choose_image datanode1_1 datanode1_2


Add Storage

We’ll attach storage to each node (3 datanodes and the namenode). Select the node in the list of virtual machines and click ‘Attach’ then choose ‘Add empty disk’.

In the next window, add some storage. I added 100GB to each node.

attach_storage

Screen Shot 2014-10-08 at 9.10.33 PM


Configure Each Node

Log into each node and do the following configurations.

First set the root password

sudo passwd root

Then type in your login password, and then set the new root password to whatever you like.

Next, set the hostname on each machine. I’ll use the name node as an example, but be sure to repeat this procedure for each datanode (replacing agl-namenode with agl-datanode<#>, or your own node names).

sudo hostname agl-namenode.albertlockett.ca
sudo vim /etc/sysconfig/network
HOSTNAME=agl-namenode.albertlockett.ca

Disable the firewall as well

sudo service iptables stop
sudo chkconfig iptables off

Attach Storage Disks

Next we’ll set up a place to mount the attached storage disks and make sure they mount automatically when our nodes are booted. Perform this procedure on each node.

First log in as root if you haven’t already

su root

Then get a list of the disks. Look for one that is about 100GB.

fdisk -l
Disk /dev/sdb: 128.8 GB, 128849018880 bytes
255 heads, 63 sectors/track, 15665 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x22dab996

 Device Boot Start End Blocks Id System
/dev/sdb1 * 1 15665 125827072 83 Linux

Disk /dev/sda: 32.2 GB, 32212254720 bytes
255 heads, 63 sectors/track, 3916 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c23d3

 Device Boot Start End Blocks Id System
/dev/sda1 * 1 3789 30432256 83 Linux
/dev/sda2 3789 3917 1024000 82 Linux swap / Solaris

Disk /dev/sdc: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

I’m using /dev/sdc. Next we’ll format the disk. At the various prompts you’ll enter: p, n, p, 1, <blank>, <blank>, p, w

[root@agl-datanode1 albertlockett]# fidsk /dev/sdc
bash: fidsk: command not found
[root@agl-datanode1 albertlockett]# fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x95a48226.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
 switch off the mode (command 'c') and change display units to
 sectors (command 'u').

Command (m for help): p

Disk /dev/sdc: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x95a48226

 Device Boot Start End Blocks Id System

Command (m for help): n
Command action
 e extended
 p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-13054, default 1): 
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-13054, default 13054): 
Using default value 13054

Command (m for help): p

Disk /dev/sdc: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x95a48226

 Device Boot Start End Blocks Id System
/dev/sdc1 1 13054 104856223+ 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Next we’ll format the disk and mount it

mkfs.ext3 /dev/sdc
mkdir /mnt/data
mount /dev/sdc1 /mnt/data

Finally we’ll edit the fstab file to mount this drive automatically

vi /etc/fstab
#
# /etc/fstab
# Created by anaconda on Wed Jan 15 04:45:47 2014
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=6d089360-3e14-401d-91d0-378f3fd09332 / ext4 defaults 1 1
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/mnt/data /dev/sdc1 ext3 defaults 1 2

5 – Setup Passwordless SSH

Next we’re going to set up passwordless SSH from the namenode to the datanodes.

On the namenode, generate the SSH keys. Log into the namenode as root and run this

ssh-keygen

Leave all the prompts as default.

Copy the public key to each of the datanodes and the namenode

scp /root/.ssh/id_rsa.pub root@agl-datanode0:id_rsa.pub
scp /root/.ssh/id_rsa.pub root@agl-datanode1:id_rsa.pub
scp /root/.ssh/id_rsa.pub root@agl-datanode2:id_rsa.pub

Log into each datanode and run the following

mkdir /root/.ssh
chmod 700 /root/.ssh
cat /root/id_rsa.pub >> /root/.ssh/authorized_keys
chmod 600 /root/.ssh/authorized_keys

Add the public key to the authorized keys file on the namenode too

cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys

6 – Install Ambari

ssh into the namenode as root, download the Ambari repository and add it to the Yum repo list

wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo
cp ambari.repo /etc/yum.repos.d

Next, install Ambari Server

yum install ambari-server

Now run the setup

ambari-server setup

Take the default installation options.

By default the installer will download Oracle 7 JDK

Now start the Ambari server in preparation to run the cluster setup wizard

ambari-server start

7 – Run Ambari Wizard to Create Cluster

In the Windows Azure dashboard, we’ll need to open an endpoint on the namenode for port 8080 so we can access the setup wizard.

Select the namenode from the list of virtual machines, select endpoints from the menu in the dashboard and then in the bottom menu click ‘+ ADD’.

Select ‘Add a Standalone Endpoint’ in the first menu

Name the endpoint whatever you like, TCP is the protocol and public/private ports should both be 8080

Screen Shot 2014-10-08 at 10.16.26 PM

Screen Shot 2014-10-08 at 10.19.05 PM

Screen Shot 2014-10-08 at 10.19.56 PM


Now open your favorite web browser and navigate to your namenode at port 8080.

http://agl-namenode.cloudapp.net:8080

Log into Ambari using the default credentials (admin/admin)


ambari_1


Next Name your cluster whatever you like. I named mine  aglhadoopambari_2


Choose whatever version of the stack you like. I chose the newest, 2.1 and click Next

If you get an error, make sure the namenode can connect to the internet and can perform DNS lookups.

ambari_3


Next type the fully qualified domain name of each node in the text box. Include the namenode too.

Put the ssh private key into the second text box. The key is on the namenode in the file /root/.ssh/id_rsa. Copy and paste the whole contents of that file into the second text box, or you can download the file to your laptop and point to it using the Choose File button.

Click Register and Confirm

ambari_4


Ambari will try to connect to each host and run some registration scripts.

This step can be a little flakey. If a node fails, check the logs by clicking Failed in the status column then search google for the error. Sometimes failures can be resolved by simply re-running this step multiple times.

I usually ignore the warnings, but you can deal with them if you like.
ambari_5


Click off the text boxes of the services you want to install. I like to install them all. You can stop the ones you don’t want later.
ambari_6


Assign the masters in this next step. I like to assign all the masters to the namenode but install zookeeper servers on every node.

Click Next

ambari_7


Assign Clients and Slaves to each datanode and click Nextambari_8


In this step you can customize the configuration of each service. Fill in the boxes that are highlighted red, and accept the defaults if you like.

The only default I like to change is the to set the YARN remote log location to /tmp/logs, but this is entirely optional.

Click Nextambari_9


Ambari will start installing all the services. This step can take a while so sit back and relax.

Sometimes this step will fail too, so try re-running it if it doesn’t work.

Click Next when it’s finished
ambari_10


Click through the final confirmation screens and you’ll see your Ambari dashboard

ambari_11


What’s next?

Congratultions, hopefully you now have a working Hadoop Cluster on Azure.

You can write some MapReduce jobs or Pig Scripts, set up a Hive Database and process some streaming data with Storm.

Stay tuned for my next post where I’ll cover what it takes to write your own YARN application.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s