Cloudera hadoop amazon howto

INTRO I've published some howtos here and here in the past, and a lot has changed over at AWS and Cloudera hadoop in the last year, so it's time for an update.


Why did I want to update my old setup? Simple: new and updated tools. A lot of people are using Hadoop 0.20.2 or higher, Hive and Pig have come a long way lately, and there was this new R-for-hadoop thing called Rhipe that I heard about at Hadoop World in December. It sounded like (and I can now say is) one of the best Hadoop tools ever. Just as Hive gives you a kind of mysql shell for Hadoop, so Rhipe gives you an R shell for Hadoop. A thousand thanks to Saptarshi Guha for this awesome tool. The hadoop-lzo package, thanks to Kevin Weil and Todd Lipcon, is more solid and easier to use than ever. There's a new cluster launcher called Whirr. Dumbo just works on this platform, so no more patching my own version like before, which was a bit of a pain, as I described in one of those previous howtos.  I've now got all this working with the Cloudera hadoop 0.20.2 distribution, CDH3B3, to be specific, and it seems to be all kinds of awesome.


Step one: launch hadoop clusters.

This used to work great with Tom White's python scripts.  Now it has all changed, and the replacement for those scripts, Whirr, is still in the process of emerging as a useful product.  It's much more powerful and flexible than the old scripts, but at least for now, you need to obtain the not-yet-released version 0.4 from subversion, to get the most out of it. In versions prior to that, you had to put your customized scripts in a publicly readable S3 location, and that doesn't work for me.  Documentation is evolving because the project is changing quickly, but the responsiveness of the committers on the user list is superb.  I recently reported an issue with overriding a type of hadoop configuration property: it took about 24 hours of back and forth to confirm it was a problem, and then Andrei Savu had a patch up within a couple of hours, and it was discussed, reviewed by Tom White and committed in very short order.  This is how opensource software should always be!

With Whirr 0.4+, the basic procedure is to untar (for now, check out) the distribution, set a WHIRR_HOME environment variable to the root of it, copy the vanilla customization scripts from 'services/cdh/src/main/resources/functions' to a directory where you can version them or whatever you want, and symlink from $WHIRR_HOME/functions to that directory.  Then set up a configuration file.  There are 'recipes' for various types of clusters in the distribution.  I started with a plain one, and hacked on it a bit until I got this:


whirr.cluster-name=snoopy whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,6 hadoop-datanode+hadoop-tasktracker whirr.hadoop-install-function=install_cdh_hadoop whirr.hadoop-configure-function=configure_cdh_hadoop whirr.provider=aws-ec2 whirr.identity=... whirr.credential=... whirr.private-key-file=${sys:user.home}/.ssh/my_key_file_name whirr.public-key-file=${sys:user.home}/.ssh/ whirr.cluster-user=myhadoopusername # Ubuntu 10.04 LTS Lucid, 64-bit. See whirr.image-id=us-east-1/ami-da0cf8b3 whirr.hardware-id=c1.xlarge # Ubuntu 10.04 LTS Lucid, 32-bit. See, # whirr.image-id=us-east-1/ami-7000f019 # whirr.hardware-id=c1.medium # Amazon linux, S3-backed, 64-bit: # whirr.image-id=us-east-1/ami-221fec4b # whirr.hardware-id=c1.xlarge # Amazon linux, S3-backed, 32-bit # whirr.image-id=us-east-1/ami-2a1fec43 # whirr.hardware-id=c1.medium whirr.location-id=us-east-1d hadoop-hdfs.dfs.permissions=false hadoop-hdfs.dfs.replication=2 hadoop-mapreduce.mapred.child.env=JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native,,,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

Notice the four flavors of linux that I have tried out: Amazon linux 32-bit and 64-bit, and Ubuntu images from Alestic. The 64-bit Alestic one was the default in the whirr recipe. I usually roll a bit more Redhat/yum, so I wanted to try 'Amazon linux', which gives you a stripped down, yum-managed linux with python 2.6. This was a bit tricky. Amazon has two kinds of 'Amazon linux': ebs-backed and S3-backed. The ebs-backed ones have a small root partition and no extra disk space at all. I guess they assume you will just spin up an ebs volume and mount that if you need more space--fair enough. The S3-backed ones have a big extra partition. However, it's not mounted on /mnt, as it is on every other Amazon machine image in the world, but rather on /media/ephemeral0. Amazon linux is eccentric that way. Or maybe, since it's Amazon, it's the rest of the world that's eccentric. They explain it all here. Suffice it to say there are a *lot* of scripts out there that assume that /mnt will be big enough for a truckload of data. So in your cdh install script you'll need something like rm -rf /mnt ln -s /media/ephemeral0 /mnt and you'll want that to happen before the script starts installing hadoop.

My script installs a few things. On ubuntu it looks like this: apt-get install -y hadoop-hive apt-get install -y hadoop-pig apt-get install -y hadoop-hbase apt-get install -y s3cmd libxaw7-dev libcairo2-dev r-base liblzo2-dev lzop python-setuptools git-arch subversion easy_install dumbo wget tar zxf protobuf-2.3.0.tar.gz cd protobuf-2.3.0 ./configure make make install wget R CMD INSTALL Rhipe_0.65.3.tar.gz and, on Amazon linux, like this: yum install -y hadoop-hive yum install -y hadoop-pig yum install -y hadoop-hbase yum -y install gcc python-devel easy_install dumbo cd /etc/yum.repos.d && wget i386count=`uname -a | egrep "i386|i686" | wc -l` if [ $i386count = '1' ]; then rpm -Uvh else rpm -Uvh fi yum install -y s3cmd screen readline-devel texinfo gcc-c++ gcc-gfortran cairo-devel lzo-devel lzop subversion protobuf -devel

There are a few more things I put in the script, but they are very much in the personal preference category. In particular, I set some users, groups and permissions, so that the user I'm doing all this as can move files around to appropriate locations on the slaves. You can do that stuff after the cluster is up too, so it's not supremely important to get everything happening on startup.

This takes care of some of the tools we want, and all of the prerequisites for the remaining ones.

Notes on R and Rhipe:

Rhipe's documentation says it needs protocol buffers version 2.3 and above, but at least for now you need exactly version 2.3. In order to get Rhipe to work, you'll need to set LD_LIBRARY_PATH. Put something in the install script that appends a line to /usr/lib/hadoop/conf/ In my case that's this for Amazon Linux, where both R and Rhipe are compiled from source: echo "export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib/R/lib:/usr/local/lib/R/ library/Rhipe/libs:\$LD_LIBRARY_PATH" >> /usr/lib/hadoop/conf/ On Ubuntu, where R-base is a deb, and only Rhipe is compiled from source, it's /usr/local/lib:/usr/lib/R/lib:/usr/local/lib/R/site-library/Rhipe/libs.

Step 2: Tweak configuration post-launch. There are a few things you should put in place to make your life easy. First, you need to be able to run '' to run commands on all the slaves: ' uptime' or ' sudo yum install -y puppet', for example. This means you need the ip addresses or host names of the slaves to be in /usr/lib/hadoop/conf/slaves, one per line. I used 'whirr list-cluster', the 'cut' command, and some hackish unix tricks to make this happen. This should ideally be a whirr command, as it was in the old python scripts, but I'm so grateful for all the rest of the whirry goodness that I'm not really complaining about that.

What type of thing might you want to do, in this category? On the Ubuntu images, I have the install script download Rhipe and install it, since R is already there. On Amazon linux, though, R won't rpm-install, so I download it from source, compile and install it after the cluster is up. The compilation takes a long time, and I don't want things to time out too badly on startup. For that, you definitely want to be able to do something like (pseudo-code alert): ' wget R && tar zxf R.tar.gz && cd R && ./configure --with-x=no --with-cairo=yes --enable-R-shlib && make && sudo make install" or something like then, and *then* install Rhipe. Also, if you want to change any Hadoop configuration properties, you'll change a file such as /usr/lib/hadoop/conf/core-site.xml, and then you'll want to do something like 'for i in `cat /usr/lib/hadoop/conf/slaves`; do scp /usr/lib/hadoop/conf/core-site.xml $i:/usr/lib/hadoop/conf; done' and then restart all the data nodes, and then the name node. Those commands are sudo /etc/init.d/hadoop-0.20-datanode restart sudo /etc/init.d/hadoop-0.20-tasktracker restart #pause to make sure the data nodes have had a chance to start. #Otherwise you may encounter #" /tmp/hadoop....jobtracker-info file could #only be replicated to 0 nodes instead of 1" #when you try to run your first map-reduce sudo /etc/init.d/hadoop-0.20-namenode restart sudo /etc/init.d/hadoop-0.20-jobtracker restart


Step 3: Run a hello-world test. Don't skip this step. The one I use takes a text file and uses 'cat' as the mapper and 'wc -l' as the reducer. I run this on a plain text file and a .lzo file, properly indexed, to make sure my compression is working and it's not counting the .lzo_index file as data.

Step 4: Process some data We get data from a lot of sources. If I had a hadoop cluster running constantly, I would probably do this very differently, but for me the whole point of using Amazon is to be able to fire up a big cluster, process some data, produce some results, store them some handy place like S3, and shut down the cluster. How best to make things available to hadoop? I came up with the following expedient:

First, fire up an ec2 instance and a big ebs volume, mounting the latter on the former. Grab your raw files-up-to-yesterday and store them on the volume. Then write a script or three that pre-process the raw files for consumption by hadoop. There are a few things you may want to accomplish in this step. For example, you might take small files and cat them together, then lzop the resulting big file, so you have your data in the large chunks Hadoop prefers. Store these ready-for-hadoop files in another part of the volume. Then, set up a procedure that can grab everything between yesterday and today, or everything back three days, because your previous load might not have final results, or something like that. Finally, make a snapshot of this ebs volume.

When you want to process some data, fire up a hadoop cluster, create a volume from the snapshot, mount the volume on the name node, run the 'get-me-the-latest' procedure, and load the lzop files into hdfs. The compression reduces the file size here by a lot, and saves a lot of time at this point. Run a quick script along the lines of "for f in `ls`; do hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.9.jar com.hadoop.compression.lzo.LzoIndexer myhdfsinputdir/$; done" (supposing you are in a directory where you have all the files you just loaded into hdfs). Do your processing, store your results, and shut everything down.

One last thing: you might ask, why not Elastic Map Reduce? This setup is a lot like EMR, but I think I have more freedom to install the tools I want and automate jobs. Somebody please correct me if I'm wrong about that. I'm certainly in a better position to repurpose anything I develop for non-Amazon use, and I want the freedom to do that.

YMMV, but I'm getting a lot of cost-effective data processing, and useful information, out of this procedure these days.