The new mad men

Me: Information Week says I'm the Don Draper of Wayfair. Wife: I don't think that's what they actually said.

Me: Maybe not, but I don't think you can avoid that conclusion.  Jeff Bertolucci wrote a piece saying marketing now lives and dies on good math, not good creative, and data scientists are the new Mad Men. Before Don Draper became a partner at that advertising firm in the TV show, he was the head of the creative department.  I took a couple of math classes in college, I can do linear regressions in R, and more importantly I manage a band of data scientists, so....

Wife: For a data scientist, you're not too logical.

Me: Fuzzy logic. It's the next big thing.

Wife: Don't get any ideas.

No marriages were harmed in the making of this blog post.

A year of Wayfair Engineering

I joined Wayfair as a full-time employee on August 1st, 2011, to run internal search and customer recommendations, after a brief stint as a consultant.  I have other responsibilities now as well, but over at the company tech blog, http://engineering.wayfair.com, I've written mostly about the search and recommendations systems.  We've done a lot over there in a year.  Let's recap a few highlights:

It's hard work, I suppose, but mostly it just feels exhilarating and fun.

Wayfair

From 2009 to mid 2011 I was running Dalton, Clark & Associates LLC full time, and consulting to some awesome clients, especially boston.com/Boston Globe, and StyleFeeder (now retired), which was acquired by Time, Inc., where I did some things for Stylefind (now retired). But in May of 2011, I did a small consultation for Wayfair (then CSN Stores, rebranded as Wayfair in September) and by August 1st I had joined full time to be the manager/architect of the search and recommendations group within Wayfair Engineering. My joke name for the group is the department of Applied Calculus and Linear Algebra SOLR and Hadoop NumPy/SciPy Hackery Search and Recommendations Finding Useful Stuff. Anyway I've handed off the day-to-day operations of Dalton Clark & Associates LLC to the other owner, Abbe Dalton Clark. The views expressed in this blog are my own, and do not reflect the views of my employer.

Cloudera hadoop amazon howto

INTRO I've published some howtos here and here in the past, and a lot has changed over at AWS and Cloudera hadoop in the last year, so it's time for an update.

WHY

Why did I want to update my old setup? Simple: new and updated tools. A lot of people are using Hadoop 0.20.2 or higher, Hive and Pig have come a long way lately, and there was this new R-for-hadoop thing called Rhipe that I heard about at Hadoop World in December. It sounded like (and I can now say is) one of the best Hadoop tools ever. Just as Hive gives you a kind of mysql shell for Hadoop, so Rhipe gives you an R shell for Hadoop. A thousand thanks to Saptarshi Guha for this awesome tool. The hadoop-lzo package, thanks to Kevin Weil and Todd Lipcon, is more solid and easier to use than ever. There's a new cluster launcher called Whirr. Dumbo just works on this platform, so no more patching my own version like before, which was a bit of a pain, as I described in one of those previous howtos.  I've now got all this working with the Cloudera hadoop 0.20.2 distribution, CDH3B3, to be specific, and it seems to be all kinds of awesome.

HOWTO

Step one: launch hadoop clusters.

This used to work great with Tom White's python scripts.  Now it has all changed, and the replacement for those scripts, Whirr, is still in the process of emerging as a useful product.  It's much more powerful and flexible than the old scripts, but at least for now, you need to obtain the not-yet-released version 0.4 from subversion, to get the most out of it. In versions prior to that, you had to put your customized scripts in a publicly readable S3 location, and that doesn't work for me.  Documentation is evolving because the project is changing quickly, but the responsiveness of the committers on the user list is superb.  I recently reported an issue with overriding a type of hadoop configuration property: it took about 24 hours of back and forth to confirm it was a problem, and then Andrei Savu had a patch up within a couple of hours, and it was discussed, reviewed by Tom White and committed in very short order.  This is how opensource software should always be!

With Whirr 0.4+, the basic procedure is to untar (for now, check out) the distribution, set a WHIRR_HOME environment variable to the root of it, copy the vanilla customization scripts from 'services/cdh/src/main/resources/functions' to a directory where you can version them or whatever you want, and symlink from $WHIRR_HOME/functions to that directory.  Then set up a configuration file.  There are 'recipes' for various types of clusters in the distribution.  I started with a plain one, and hacked on it a bit until I got this:

 

whirr.cluster-name=snoopy whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,6 hadoop-datanode+hadoop-tasktracker whirr.hadoop-install-function=install_cdh_hadoop whirr.hadoop-configure-function=configure_cdh_hadoop whirr.provider=aws-ec2 whirr.identity=... whirr.credential=... whirr.private-key-file=${sys:user.home}/.ssh/my_key_file_name whirr.public-key-file=${sys:user.home}/.ssh/my_key_file_name.pub whirr.cluster-user=myhadoopusername # Ubuntu 10.04 LTS Lucid, 64-bit. See http://alestic.com/ whirr.image-id=us-east-1/ami-da0cf8b3 whirr.hardware-id=c1.xlarge # Ubuntu 10.04 LTS Lucid, 32-bit. See http://alestic.com/, # whirr.image-id=us-east-1/ami-7000f019 # whirr.hardware-id=c1.medium # Amazon linux, S3-backed, 64-bit: # whirr.image-id=us-east-1/ami-221fec4b # whirr.hardware-id=c1.xlarge # Amazon linux, S3-backed, 32-bit # whirr.image-id=us-east-1/ami-2a1fec43 # whirr.hardware-id=c1.medium whirr.location-id=us-east-1d hadoop-hdfs.dfs.permissions=false hadoop-hdfs.dfs.replication=2 hadoop-mapreduce.mapred.compress.map.output=true hadoop-mapreduce.mapred.child.env=JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native hadoop-mapreduce.mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec hadoop-common.io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec hadoop-common.io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec

Notice the four flavors of linux that I have tried out: Amazon linux 32-bit and 64-bit, and Ubuntu images from Alestic. The 64-bit Alestic one was the default in the whirr recipe. I usually roll a bit more Redhat/yum, so I wanted to try 'Amazon linux', which gives you a stripped down, yum-managed linux with python 2.6. This was a bit tricky. Amazon has two kinds of 'Amazon linux': ebs-backed and S3-backed. The ebs-backed ones have a small root partition and no extra disk space at all. I guess they assume you will just spin up an ebs volume and mount that if you need more space--fair enough. The S3-backed ones have a big extra partition. However, it's not mounted on /mnt, as it is on every other Amazon machine image in the world, but rather on /media/ephemeral0. Amazon linux is eccentric that way. Or maybe, since it's Amazon, it's the rest of the world that's eccentric. They explain it all here. Suffice it to say there are a *lot* of scripts out there that assume that /mnt will be big enough for a truckload of data. So in your cdh install script you'll need something like rm -rf /mnt ln -s /media/ephemeral0 /mnt and you'll want that to happen before the script starts installing hadoop.

My script installs a few things. On ubuntu it looks like this: apt-get install -y hadoop-hive apt-get install -y hadoop-pig apt-get install -y hadoop-hbase apt-get install -y s3cmd libxaw7-dev libcairo2-dev r-base liblzo2-dev lzop python-setuptools git-arch subversion easy_install dumbo wget http://protobuf.googlecode.com/files/protobuf-2.3.0.tar.gz tar zxf protobuf-2.3.0.tar.gz cd protobuf-2.3.0 ./configure make make install wget http://www.stat.purdue.edu/~sguha/rhipe/dn/Rhipe_0.65.3.tar.gz R CMD INSTALL Rhipe_0.65.3.tar.gz and, on Amazon linux, like this: yum install -y hadoop-hive yum install -y hadoop-pig yum install -y hadoop-hbase yum -y install gcc python-devel easy_install dumbo cd /etc/yum.repos.d && wget http://s3tools.org/repo/RHEL_6/s3tools.repo i386count=`uname -a | egrep "i386|i686" | wc -l` if [ $i386count = '1' ]; then rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm else rpm -Uvh http://download.fedora.redhat.com/pub/epel/6/x86_64/epel-release-6-5.noarch.rpm fi yum install -y s3cmd screen readline-devel texinfo gcc-c++ gcc-gfortran cairo-devel lzo-devel lzop subversion protobuf -devel

There are a few more things I put in the script, but they are very much in the personal preference category. In particular, I set some users, groups and permissions, so that the user I'm doing all this as can move files around to appropriate locations on the slaves. You can do that stuff after the cluster is up too, so it's not supremely important to get everything happening on startup.

This takes care of some of the tools we want, and all of the prerequisites for the remaining ones.

Notes on R and Rhipe:

Rhipe's documentation says it needs protocol buffers version 2.3 and above, but at least for now you need exactly version 2.3. In order to get Rhipe to work, you'll need to set LD_LIBRARY_PATH. Put something in the install script that appends a line to /usr/lib/hadoop/conf/hadoop-env.sh. In my case that's this for Amazon Linux, where both R and Rhipe are compiled from source: echo "export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib/R/lib:/usr/local/lib/R/ library/Rhipe/libs:\$LD_LIBRARY_PATH" >> /usr/lib/hadoop/conf/hadoop-env.sh On Ubuntu, where R-base is a deb, and only Rhipe is compiled from source, it's /usr/local/lib:/usr/lib/R/lib:/usr/local/lib/R/site-library/Rhipe/libs.

Step 2: Tweak configuration post-launch. There are a few things you should put in place to make your life easy. First, you need to be able to run 'slaves.sh' to run commands on all the slaves: 'slaves.sh uptime' or 'slaves.sh sudo yum install -y puppet', for example. This means you need the ip addresses or host names of the slaves to be in /usr/lib/hadoop/conf/slaves, one per line. I used 'whirr list-cluster', the 'cut' command, and some hackish unix tricks to make this happen. This should ideally be a whirr command, as it was in the old python scripts, but I'm so grateful for all the rest of the whirry goodness that I'm not really complaining about that.

What type of thing might you want to do, in this category? On the Ubuntu images, I have the install script download Rhipe and install it, since R is already there. On Amazon linux, though, R won't rpm-install, so I download it from source, compile and install it after the cluster is up. The compilation takes a long time, and I don't want things to time out too badly on startup. For that, you definitely want to be able to do something like (pseudo-code alert): 'slaves.sh wget R && tar zxf R.tar.gz && cd R && ./configure --with-x=no --with-cairo=yes --enable-R-shlib && make && sudo make install" or something like then, and *then* install Rhipe. Also, if you want to change any Hadoop configuration properties, you'll change a file such as /usr/lib/hadoop/conf/core-site.xml, and then you'll want to do something like 'for i in `cat /usr/lib/hadoop/conf/slaves`; do scp /usr/lib/hadoop/conf/core-site.xml $i:/usr/lib/hadoop/conf; done' and then restart all the data nodes, and then the name node. Those commands are slaves.sh sudo /etc/init.d/hadoop-0.20-datanode restart slaves.sh sudo /etc/init.d/hadoop-0.20-tasktracker restart #pause to make sure the data nodes have had a chance to start. #Otherwise you may encounter #"java.io.IOException: /tmp/hadoop....jobtracker-info file could #only be replicated to 0 nodes instead of 1" #when you try to run your first map-reduce sudo /etc/init.d/hadoop-0.20-namenode restart sudo /etc/init.d/hadoop-0.20-jobtracker restart

 

Step 3: Run a hello-world test. Don't skip this step. The one I use takes a text file and uses 'cat' as the mapper and 'wc -l' as the reducer. I run this on a plain text file and a .lzo file, properly indexed, to make sure my compression is working and it's not counting the .lzo_index file as data.

Step 4: Process some data We get data from a lot of sources. If I had a hadoop cluster running constantly, I would probably do this very differently, but for me the whole point of using Amazon is to be able to fire up a big cluster, process some data, produce some results, store them some handy place like S3, and shut down the cluster. How best to make things available to hadoop? I came up with the following expedient:

First, fire up an ec2 instance and a big ebs volume, mounting the latter on the former. Grab your raw files-up-to-yesterday and store them on the volume. Then write a script or three that pre-process the raw files for consumption by hadoop. There are a few things you may want to accomplish in this step. For example, you might take small files and cat them together, then lzop the resulting big file, so you have your data in the large chunks Hadoop prefers. Store these ready-for-hadoop files in another part of the volume. Then, set up a procedure that can grab everything between yesterday and today, or everything back three days, because your previous load might not have final results, or something like that. Finally, make a snapshot of this ebs volume.

When you want to process some data, fire up a hadoop cluster, create a volume from the snapshot, mount the volume on the name node, run the 'get-me-the-latest' procedure, and load the lzop files into hdfs. The compression reduces the file size here by a lot, and saves a lot of time at this point. Run a quick script along the lines of "for f in `ls`; do hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.9.jar com.hadoop.compression.lzo.LzoIndexer myhdfsinputdir/$; done" (supposing you are in a directory where you have all the files you just loaded into hdfs). Do your processing, store your results, and shut everything down.

One last thing: you might ask, why not Elastic Map Reduce? This setup is a lot like EMR, but I think I have more freedom to install the tools I want and automate jobs. Somebody please correct me if I'm wrong about that. I'm certainly in a better position to repurpose anything I develop for non-Amazon use, and I want the freedom to do that.

YMMV, but I'm getting a lot of cost-effective data processing, and useful information, out of this procedure these days.

Hadoop World NYC 2010

I went to Hadoop World in NYC last week. There was a lot of big talk about how this was where the alpha geeks had gathered. Self-serving, sure, but not false. There were a lot of excellent talks. I'll divide the highlights into two categories.

First, the practical items.

  • I followed all of @kevinweil of Twitter's suggestions from last year, and they were very helpful, so I went back for more this year. He spoke about file formats. Sexy! But all my experience of the last 1.5 years confirms this is very important in Hadoop world. He's using protocol buffers + lzo compression, and has a project called Elephant Bird for protobuf-handling code generation and other things. As usual he puts it all up on github, and I'll be using it. It's interesting that they're not using Avro or Thrift, the other versions of the same binary-format idea as Google's protocol buffers (Avro is the Hadoop version, Thrift is from Facebook). Twitter does not otherwise fear to use the always-poorly-documented but sometimes excellent Facebook offerings, including Scribe for log collection. They "work closely with Facebook" to make that work. Hmmm.
  • I will not be using Scribe, because the Flume project (an agent-based collector for log files) seems to have achieved critical mass. That's, at least, what Otis Gospodnetić was saying at his excellent talk about search analytics. His reputation, company, and awesome on-topic demo site Search Hadoop, made for a packed room. He's using Flume to collect everything, so we'll give that a try at some point. For the moment, I will no doubt continue to get good results on ec2-based clusters, by pushing files to S3 with cron or a daemon of some kind, and using SQS to manage consumption by Hadoop and other things. (Who needs an Enterprise messaging infrastructure any more, now that we have SQS! Heh.) If I start running an all-internal cluster, I would think Flume would help a lot.
  • Jonathan Gray from Facebook, who is an HBase committer, spoke about HBase. A lot of good stuff using Hadoop is going live soon at FB. He was secretive about which parts of Facebook were involved, but not about the core technology. Hive + HBase is growing quickly, as a way not to have to stream everything you have, when you want to look at some little thing in a Hive query. I'll be doing more of that this year.
  • Chris Gillette of Visible Measures took us all to school on how to manage your own Hadoop cluster on a budget, through hairy things like major version upgrades. All I can say is that with more control than you have on ec2, comes more responsibility.
  • The MPP database vendors were there: Asterdata, Greenplum (now part of EMC), Vertica, Netezza. An emerging pattern is something like this:
    1. Install a Hadoop cluster.
    2. Realize that the Hive and Pig shells, although useful, don't quite give your analysts everything they need. They might want a full-featured SQL interface, perhaps with some SAS-like features. Maybe if you're very advanced with Hive+HBase+R you could keep up with their demands, but without a decent-sized team of crack people, this could easily lead to expertise bottlenecks, delay and disappointment.
    3. Install a Vertica/Asterdata/Greenplum/Netezza appliance/cluster on the side, into which your Hadoop results, less fully baked than they would have to be without such a thing, are being fed. This might work pretty well with Infobright as well. At least in the case of Greenplum, you can do an interesting variant of this idea by installing a Greenplum cluster, side by side with your Hadoop cluster, on the same nodes. That would give you a full-featured Postgres-style sql interface (or whatever the vendors are giving you) to sharded data in column stores, on which you could do joins without waiting very long for results. And that would free the analysts, perhaps to a significant degree. YMMV. Make sure you're getting your file formats and storage engines right, because more data size in these MPP DBs will cost you a lot in license fees.

Secondly, the more theoretical stuff.

  • RHIPE is an R + Hadoop thing by Saptarshi Guha of Purdue University, that allows you to define Map-Reduce jobs as list mapping operations without leaving the R shell. Spooky. You can't exactly just install it from CRAN yet, but this could be awesome for the modeler crowd.
  • My prize for the most fascinating talk goes to Daniel Abadi of Yale University, expanding on a blog post from last year. He humorously revisited the whole Stonebraker/DeWitt blog post that stoked an infamous Hadoop/DBMS controversy a while ago. Asking for a show of hands, he surveyed the room as to whether Hadoop and the MPP DBs were cooperative or competitive technologies. The sense of the room was, 'cooperative'. See the last of the practical suggestions above: there's a lot of momentum in the field around that cooperation. But Professor Abadi is a contrarian, and, in fact, a contrarian with benchmarks! He was asserting that Hadoop and the MPP DBs are actually competitive technologies, even if there are successful to highly varying degrees in different areas at the moment (and so are usefully being made to cooperate for now). In particular, Hadoop does not do well on performance of certain kinds of SQL-query-like MR jobs, and the MPP DBs do not do as well at scaling past a certain point. Enter Hadoop DB. The idea is to integrate Hadoop and a good column-store engine at a deeper level than what the MPP DB vendors have done thus far through superficial systems integration with Hadoop. The researchers (Professor Abadi's students, if I have this right) tried a couple of column stores, and got very good performance results with an Ingres-based one, while preserving the linear scalability of Hadoop. IMHO this is potentially very disruptive, and very threatening to the MPP DB vendors. We'll be keeping an eye on this effort.

Thanks to Cloudera for putting this on.

Me on Hadoop Setup, at StyleFeeder

My colleagues and clients at StyleFeeder are good enough to let me post on their tech blog from time to time. I'm exploring Hadoop on their behalf, as partially described here: http://seventhfloor.whirlycott.com/2010/01/14/hadoop-for-the-lone-analyst/. That's basically a HOWTO for Hadoop 0.20 + Apache logs + MySQL on EC2, with tips on streaming, compression, Pig, Redhat/CentOS and the Cloudera Python scripts for EC2.

Windows 7 Upgrade: worth it?

After a couple of virtual machine upgrades that worked pretty well, I decided to upgrade my main workhorse laptop (a 15" Lenovo W500) from Windows Vista 64-bit Ultimate to Windows 7 64-bit Ultimate. You can only upgrade to an identical subtype, without 3rd-party tools or a lot of hacking. The result? More disk space than when I started, and suddenly all my devices work. Skype is no longer helpless at finding the built-in video camera. It had been complaining that the camera was in already in use. When I plug in headphones, the speakers go silent without my having to find that really obscure control panel check box. The ATI Radeon HD 3650, which Lenovo is calling the ATI Mobility FireGL V5700, seems to have discovered the DisplayPort connection to my Dell 2408WFP, and the external monitor picture is really crisp now. You have to change the display settings twice: first have it duplicate the display on both monitors, then extend it to the other monitor. Now it can find the external monitor without having to reboot, and without getting a link failure at some point (usually when you're working on something really interesting). I upgraded to VMWare 7 while I was at it, and instead of doing this: vmware-splitscreen (split across two screens), which was totally useless, and in a way comical, it's allowing me to run Centos or Ubuntu across two monitors. I think it's the better drivers/OS combo, not the VMware, that's working now. Come to think of it, it's pretty outrageous what all was broken before. Better late than never.

Hadoop in the Enterprise

At Hadoop World NYC 2009, one of the most interesting presentations, from a business point of view, was by JP Morgan Chase. They couldn't share too many details for obvious reasons, but they were talking about cost savings of one, two or three orders of magnitude compared to existing technology. Peter Krey said humorously that anyone can save 30-40%: if you demand at least an order of magnitude, it takes a lot of fluff projects off the table. 'Fluff' wasn't the word he used, but you get the idea. Heh. To state the obvious, Hadoop is a disruptive technology. One way this might play out is as a replacement for ETL and data warehousing setups in big companies. Picture a pipeline of (1) DB2 tables (2) VSAM and other structured files, (3) Oracle OLTP databases, (4) Informatica/Ab Initio/Data Stage/whatever jobs filling up (5) Oracle data warehouses, and finally (6) SQL Server cubes connected to front-end applications in the hands of analysts. There are a lot of variations on this idea out there, but let's call it an example of a common pattern. 1, 2, 3 and 6 are hard to dislodge, because they're actual operations and high-level-user-facing apps, respectively, but 4 and 5 are pretty ripe, in many cases, to be moved from the special-purpose clusters they tend to run on to a general-purpose Hadoop cluster, on commodity hardware, with probable increased parallelization and massive cost savings. There's a lot of Oracle in that space, and Oracle now has a commanding position relative to the fate of java. So let's work this angle, but not make it too obvious, or go to far, or maybe java will start languishing like MySQL. I jest. Sort of.

eBay's Mobius Query Language at Hadoop World

I went to Cloudera's Hadoop World NYC 2009 on Friday: it was quite a show. One theme that played out through many presentations was abstraction layers on top of raw Map Reduce. The two biggest are Pig and Hive, which are Yahoo's and Facebook's solutions to the same basic problem, of how to write less code for repetitive Map Reduce tasks. There's a lot of good commentary out there on those. Hive is more like a sql shell, and if you want to extend it, I think you're going to be writing, say, Python mappers/reducers and streaming them into/out-of your Hive setup. With Pig, you're operating, as they put it in the training/documentation/O'Reilly book, which collectively document Pig very well, more at the level of a SQL query optimizer. You have some iteration facilities, and you can extend it with java. Pig does more exactly what you tell it to do, and Hive is something you 'hint' at. These are general-purpose tools.

In the more specialized area of web analytics, eBay has a very interesting internal tool, called Mobius Query Language, on which Neel Sundaresan gave a fascinating talk. I'll update with a link if Cloudera posts the presentation, but it helps you model visits with landmarks, duration, and some other concepts I didn't take notes on. It clearly helps them wrap their code around the maddeningly amorphous user visit: participating in an auction, bidding, winning, abandoning, etc. The language seemed general-purpose enough for application to any user-behavior modeling. The interface is a SQL-like query language that seems, like Hive, to generate Map Reduce jobs based on nicely abstracted view of exactly the sorts of questions you want to ask your web analytics system. For the moment, I'm doing what web analytics I'm doing by extending Pig, but I hereby declare the Movement to Get eBay to Opensource the Mobius Query Language. Who's with me?

On the conference in general, there is some good commentary out there, from Dan Milstein, Steve Laniel, Hilary Mason, and no doubt others.