Getting started with Hadoop can be tough. There’s a steep learning curve involved with getting even vaguely familiar with all the ins and outs.
If you want to get hands-on experience the first step is going to be finding a platform to play with. Fortunately there’s a lot of good options available. The Hortonworks quick start VM, Amazon Elastic MapReduce and Microsoft’s HDInsights are all easy ways to hit the ground running.
If those preconfigured solutions seem a little bit too magical and/or if you’re a do-it-yourself-er then building your own cluster can be a fun (and educational) exercise. And if you’re one of those unlucky few who don’t have a spare rack of servers lying around, you’re still in luck! All you need to do is turn to the cloud.
This post will tackle building your own Hadoop cluster in the cloud. Specifically, I’ll be using the Windows Azure and Hortonworks HDP 2.1.
0 – Sign up for Windows Azure
If you haven’t already, sign up for Windows Azure. You can get a free trial for $220 worth of cloud services for 28 days. You can only use a maximum of 20 CPU cores during the free trial, but that should be enough for our purposes.
1 – Name The Components
The first step is to write down some basic information about your cluster. When you start adding the bits in Azure, its going to ask you to name things. I find it easier if you write down all the names first.
- The domain name you plan to use. I chose albertlockett.ca
- A naming scheme for the nodes in your cluster. I like to name my Namenodes and Datanodes differently and number them, so I called them agl-datanode<#> and agl-namenode<#>
- A hostname for your DNS Server. I called mine agl-dns1
- Name of the network you plan to use. I called mine agl-network
You should also choose where you want everything to be located. Azure has a number of data centres all over the world, so choose one close to you. I’m using East US because I live in Halifax and that one is closest.
With that information, here is a picture of what we’re going to build. 3 datanodes, 1 Namenode and 1 DNS Server.
2 – Setup the Network
I like to keep my Namenodes, Datanodes and DNS on separate subnets, so we’re going to setup a network with 3 subnets
- Subnet 1 10.124.1.0 /24 for the DNS
- Subnet 2 10.124.2.0 /24 for the Datanodes
- Subnet 3 10.124.3.0 /24 for the Namenodes
Go to your Azure management portal and choose new
+NEW -> NETWORK SERVICES -> VIRTUAL NETWORK -> CUSTOM CREATE
Configure the network like this:
Virtual Network Details
- Name – The name of your network. I used agl-network
- Location – The location you plan to use. I used East-US
DNS Servers and VPN Connectivity
- I’m going to add 2 DNS Servers, an internal DNS and a public one.
- In the first line add your internal DNS information:
NAME – agl-dns1 IP – 10.124.1.4 - In the next kine add a public DNS. I used Google’s public DNS:
Name – google-public-dns-a.google.com IP – 8.8.8.8
Virtual Network Address Spaces
- Add 3 subnets with the following information:
- Name – Subnet 1 Starting Address – 10.124.1.0 CIDR – /24
- Name – Subnet 2 Starting Address – 10.124.2.0 CIDR – /24
- Name – Subnet 3 Starting Address – 10.124.3.0 CIDR – /24
3 – Build the DNS
We’re going to use Ubuntu Server 12.04 with Bind9 to do the DNS. The first step is to create the virtual server.
Create DNS Virtual Server
In the Azure management portal choose
+NEW -> COMPUTE -> VIRTUAL MACHINE – FROM GALLERY
Choose an Image
- Pick Ubuntu from the list on the left
- Pick Ubuntu Server 12.04 LTS from the list of Ubuntu Images
Virtual machine Configuration
- Version Release Date – Just choose the newest
- Virtual Machine Name – Choose the name of your DNS Server. I chose agl-dns1
- Tier – Basic
- Size – A1 (1core, 1.75 GB Memory)
- New User Name – Choose your favourite username. I chose albertlockett
- Authentication – Select Provide a Password and then type in your favourite password (twice)
Virtual machine Configuration (2)
- Cloud Service – Create a new cloud service
- Cloud Service DNS Name – Choose the name of your DNS Server. I chose agl-dns1
- Region/Affinity Group/ Virtual Network – Choose the Virtual Network you created. I chose agl-network
- Virtual Network Subnets – Subnet 1 (10.124.1.0 /24)
- Storage Account – Use an automatically generated storage account
- Availability Set – (None)
- Endpoints – Leave the default (SSH/TCP/22/22)
Virtual machine Configuration (3)
- Install the VM Agent
- Don’t install Chef, unless you want it
Configure the DNS Server
We’re going to use Bind9 as the DNS Server for your internal domain and external requests will be forwarded to a public DNS.
Go ahead and log into the new DNS server to get started.
ssh albertlockett@agl-dns1.cloudapp.net
The first step is to do some network configuration. Setup resolve.conf to point DNS requests to this machine
sudo vim /etc/resolv.conf
nameserver 127.0.0.1 search albertlockett.ca
Next, setup your host file so it will recognize the hostname and fully qualified domain name of this box as local host
sudo vim /etc/hosts
127.0.0.1 localhost agl-dns1 agl-dns1.albertlockett.ca
Configure the domain and domain name on the main ethernet interface
sudo vim /etc/network/interfaces.d/eth0.cfg
auto eth0 iface eth0 inet dhcp dns-nameservers 127.0.0.1 dns-search albertlockett.ca dns-domain albertlockett.ca
Install Bind9
sudo apt-get install bind9
Next we need to configure Bind9. First, setup the forwarding for external DNS lookups. Find the forwarders section on Bind9 in the configuration options file
sudo vim /etc/bind/named.conf.options
forwarders { 8.8.8.8; };
Now we need to configure the zones. We need to add 2 zones, one for forward and one for reverse lookups on our internal network.
sudo vim /etc/bind/named.conf.local
zone "albertlockett.ca" IN { type master; file "/etc/bind/zones/db.albertlockett.ca"; }; zone "124.10.in-addr.arpa" { type master; file "/etc/bind/zones/rev.124.10.in-addr.arpa"; };
Next we need to add a configuration file for each zone. I like to keep them in a directory within the Bind9 directory.
sudo mkdir /etc/bind/zones
Create the file for the forward zone. Make sure to add your hosts in alphabetical order.
sudo vim /etc/bind/zones/db.albertlockett.ca
; BIND data file for local loopback interface ; $TTL 604800 @ IN SOA localhost. root.localhost. ( 2 ; Serial 604800 ; Refresh 86400 ; Retry 2419200 ; Expire 604800 ) ; Negative Cache TTL ; @ IN NS localhost. @ IN A 127.0.0.1 @ IN AAAA ::1 agl-datanode0 IN A 10.124.2.4 agl-datanode1 IN A 10.124.2.5 agl-datanode2 IN A 10.124.2.6 agl-dns2 IN A 10.124.1.4 agl-namenode IN A 10.124.3.4
Configure the reverse zone. Remember to type the last bytes of the IP addresses in backwards.
sudo vim /etc/bind/zones/rev.124.10.in-addr.apra
; IP Address-to-Host DNS Pointers @ IN SOA agl-dns2.agl-dns.albertlockett.ca. hostmaster.agl-dns.albertlockett.ca. ( 2008080901 ; serial 8H ; refresh 4H ; retry 4W ; expire 1D ; minimum ) ; define the authoritative name server IN NS agl-dns2.agl-dns.albertlockett.ca. ; our hosts, in numeric order 4.2 IN PTR agl-datanode0.albertlockett.ca. 5.2 IN PTR agl-datanode1.albertlockett.ca. 6.2 IN PTR agl-datanode2.albertlockett.ca. 4.1 IN PTR agl-dns2.albertlockett.ca. 4.3 IN PTR agl-namenode.albertlockett.ca.
The final step is to restart bind9 and the network interface
sudo service bind9 restart nohup sh -c "ifdown eth0 && ifup eth0"
Now our DNS Server is configured. Next we can create an image to use as a template for our Datanodes and Namenode
3 – Create Node Image
In this step we’re going to create a base image we can use copy to use for each node in the cluster. The machine we create this image from will also serve as our first datanode, so we’ll configure it as such. We’ll still have to do some configuration for each node individually, but creating this image will save us having to repeat a few steps. We’re going to use Centos 6 as the base operating system for our cluster, but Azure calls it OpenLogic 6.5
First create/configure the new image from the gallery:
+NEW -> COMPUTE -> VIRTUAL MACHINE -> FROM GALLERY
Choose an Image
- Pick Centos-Based from the list on the left
- Pick OpenLogic 6.5 LTS from the list of Centos-Based images
Virtual machine Configuration
- Virtual Machine Name – Choose the name of your first datanode. I chose agl-datanode0
- Tier – Basic
- Size – A3 (4cores, 7GB Memory)
- New User Name – Choose your favourite username. I chose albertlockett
- Authentication – Select Provide a Password and then type in your favourite password (twice)
Virtual machine Configuration (2)
- Cloud Service – Create a new cloud service
- Cloud Service DNS Name – Choose the name of your first datanode. I chose agl-datanode0
- Region/Affinity Group/ Virtual Network – Choose the Virtual Network you created. I chose agl-network
- Virtual Network Subnets – Subnet 2 (10.124.2.0 /24)
- Storage Account – Use an automatically generated storage account
- Availability Set – (None)
- Endpoints – Leave the default (SSH/TCP/22/22)
Virtual machine Configuration (3)
- Install the VM Agent
Configure the Image
Now log into the new node and we’ll do some base configurations
ssh albertlockett@agl-datanode0.clouapp.net
Disable SELinux.
setenforce 0
And make sure it’s permanently disabled
vim /etc/selinux/sysconfig
# This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # enforcing - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of enforcing. # disabled - No SELinux policy is loaded. SELINUX=disabled
Now edit the Hosts file.
sudo vim /etc/hosts
10.124.1.4 agl-dns agl-dns1.albertlockett.ca 10.124.2.4 agl-datanode0 agl-datanode0.albertlockett.ca 10.124.2.5 agl-datanode1 agl-datanode1.albertlockett.ca 10.124.2.6 agl-datanode2 agl-datanode2.albertlockett.ca 10.124.3.4 agl-namenode agl-namenode.albertlockett.ca 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
Finally, capture the image so we can reuse it and shutdown
sudo waagent -deprovision sudo shutdown -h now
Capture the Image
Log into the Windows Azure manager. Select the virtual machine and click the Capture button
And name the image whatever you like. I called it hadoop_img
4 – Create Namenode and Datanodes
In this section we’ll create the name node and the other two datanodes. We’re going to create them from the image we produced in the previous section and we’ll configure them in the next section. Repeat the following procedure three times.
Create/configure the new image from the gallery:
+NEW -> COMPUTE -> VIRTUAL MACHINE -> FROM GALLERY
Choose an Image
- Pick My Images from the list on the left
- Pick the image you saved in the previous section from the list of images. I called mine hadoop_img
Virtual machine Configuration
- Virtual Machine Name – Choose the name of your new machine. I called my datanodes agl-datanode1 & agl-datanode2 and I called my namenode agl-namenode
- Tier – Basic
- Size – I used A3 (4cores, 7GB Memory) for the datanodes and A4 (8 cores, 14GB Memory) for the namenode.
Note: if you’re on the Azure free trial, you should use an A3 machine for the namenode so we don’t exceed the maximum number of virtual cores you’re using.
Virtual machine Configuration (2)
- Cloud Service – Create a new cloud service
- Region/Affinity Group/ Virtual Network – Choose the Virtual Network you created. I chose agl-network
- Virtual Network Subnets – For the datanodes use Subnet 2 (10.124.2.0 /24) and for the namenode use Subnet 3 (10.124.3.0 /24)
- Availability Set – (None)
- Endpoints – Leave the default (SSH/TCP/22/22)
Virtual machine Configuration (3)
- Install the VM Agent
Add Storage
We’ll attach storage to each node (3 datanodes and the namenode). Select the node in the list of virtual machines and click ‘Attach’ then choose ‘Add empty disk’.
In the next window, add some storage. I added 100GB to each node.
Configure Each Node
Log into each node and do the following configurations.
First set the root password
sudo passwd root
Then type in your login password, and then set the new root password to whatever you like.
Next, set the hostname on each machine. I’ll use the name node as an example, but be sure to repeat this procedure for each datanode (replacing agl-namenode with agl-datanode<#>, or your own node names).
sudo hostname agl-namenode.albertlockett.ca
sudo vim /etc/sysconfig/network
HOSTNAME=agl-namenode.albertlockett.ca
Disable the firewall as well
sudo service iptables stop sudo chkconfig iptables off
Attach Storage Disks
Next we’ll set up a place to mount the attached storage disks and make sure they mount automatically when our nodes are booted. Perform this procedure on each node.
First log in as root if you haven’t already
su root
Then get a list of the disks. Look for one that is about 100GB.
fdisk -l
Disk /dev/sdb: 128.8 GB, 128849018880 bytes 255 heads, 63 sectors/track, 15665 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x22dab996 Device Boot Start End Blocks Id System /dev/sdb1 * 1 15665 125827072 83 Linux Disk /dev/sda: 32.2 GB, 32212254720 bytes 255 heads, 63 sectors/track, 3916 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000c23d3 Device Boot Start End Blocks Id System /dev/sda1 * 1 3789 30432256 83 Linux /dev/sda2 3789 3917 1024000 82 Linux swap / Solaris Disk /dev/sdc: 107.4 GB, 107374182400 bytes 255 heads, 63 sectors/track, 13054 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000
I’m using /dev/sdc. Next we’ll format the disk. At the various prompts you’ll enter: p, n, p, 1, <blank>, <blank>, p, w
[root@agl-datanode1 albertlockett]# fidsk /dev/sdc bash: fidsk: command not found [root@agl-datanode1 albertlockett]# fdisk /dev/sdc Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel with disk identifier 0x95a48226. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) WARNING: DOS-compatible mode is deprecated. It's strongly recommended to switch off the mode (command 'c') and change display units to sectors (command 'u'). Command (m for help): p Disk /dev/sdc: 107.4 GB, 107374182400 bytes 255 heads, 63 sectors/track, 13054 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x95a48226 Device Boot Start End Blocks Id System Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-13054, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-13054, default 13054): Using default value 13054 Command (m for help): p Disk /dev/sdc: 107.4 GB, 107374182400 bytes 255 heads, 63 sectors/track, 13054 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x95a48226 Device Boot Start End Blocks Id System /dev/sdc1 1 13054 104856223+ 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
Next we’ll format the disk and mount it
mkfs.ext3 /dev/sdc mkdir /mnt/data mount /dev/sdc1 /mnt/data
Finally we’ll edit the fstab file to mount this drive automatically
vi /etc/fstab
# # /etc/fstab # Created by anaconda on Wed Jan 15 04:45:47 2014 # # Accessible filesystems, by reference, are maintained under '/dev/disk' # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # UUID=6d089360-3e14-401d-91d0-378f3fd09332 / ext4 defaults 1 1 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 /mnt/data /dev/sdc1 ext3 defaults 1 2
5 – Setup Passwordless SSH
Next we’re going to set up passwordless SSH from the namenode to the datanodes.
On the namenode, generate the SSH keys. Log into the namenode as root and run this
ssh-keygen
Leave all the prompts as default.
Copy the public key to each of the datanodes and the namenode
scp /root/.ssh/id_rsa.pub root@agl-datanode0:id_rsa.pub scp /root/.ssh/id_rsa.pub root@agl-datanode1:id_rsa.pub scp /root/.ssh/id_rsa.pub root@agl-datanode2:id_rsa.pub
Log into each datanode and run the following
mkdir /root/.ssh chmod 700 /root/.ssh cat /root/id_rsa.pub >> /root/.ssh/authorized_keys chmod 600 /root/.ssh/authorized_keys
Add the public key to the authorized keys file on the namenode too
cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
6 – Install Ambari
ssh into the namenode as root, download the Ambari repository and add it to the Yum repo list
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.6.1/ambari.repo cp ambari.repo /etc/yum.repos.d
Next, install Ambari Server
yum install ambari-server
Now run the setup
ambari-server setup
Take the default installation options.
By default the installer will download Oracle 7 JDK
Now start the Ambari server in preparation to run the cluster setup wizard
ambari-server start
7 – Run Ambari Wizard to Create Cluster
In the Windows Azure dashboard, we’ll need to open an endpoint on the namenode for port 8080 so we can access the setup wizard.
Select the namenode from the list of virtual machines, select endpoints from the menu in the dashboard and then in the bottom menu click ‘+ ADD’.
Select ‘Add a Standalone Endpoint’ in the first menu
Name the endpoint whatever you like, TCP is the protocol and public/private ports should both be 8080
Now open your favorite web browser and navigate to your namenode at port 8080.
http://agl-namenode.cloudapp.net:8080
Log into Ambari using the default credentials (admin/admin)
Next Name your cluster whatever you like. I named mine aglhadoop
Choose whatever version of the stack you like. I chose the newest, 2.1 and click Next
If you get an error, make sure the namenode can connect to the internet and can perform DNS lookups.
Next type the fully qualified domain name of each node in the text box. Include the namenode too.
Put the ssh private key into the second text box. The key is on the namenode in the file /root/.ssh/id_rsa. Copy and paste the whole contents of that file into the second text box, or you can download the file to your laptop and point to it using the Choose File button.
Click Register and Confirm
Ambari will try to connect to each host and run some registration scripts.
This step can be a little flakey. If a node fails, check the logs by clicking Failed in the status column then search google for the error. Sometimes failures can be resolved by simply re-running this step multiple times.
I usually ignore the warnings, but you can deal with them if you like.
Click off the text boxes of the services you want to install. I like to install them all. You can stop the ones you don’t want later.
Assign the masters in this next step. I like to assign all the masters to the namenode but install zookeeper servers on every node.
Click Next
Assign Clients and Slaves to each datanode and click Next
In this step you can customize the configuration of each service. Fill in the boxes that are highlighted red, and accept the defaults if you like.
The only default I like to change is the to set the YARN remote log location to /tmp/logs, but this is entirely optional.
Ambari will start installing all the services. This step can take a while so sit back and relax.
Sometimes this step will fail too, so try re-running it if it doesn’t work.
Click through the final confirmation screens and you’ll see your Ambari dashboard
What’s next?
Congratultions, hopefully you now have a working Hadoop Cluster on Azure.
You can write some MapReduce jobs or Pig Scripts, set up a Hive Database and process some streaming data with Storm.
Stay tuned for my next post where I’ll cover what it takes to write your own YARN application.