I went to Hadoop World in NYC last week. There was a lot of big talk about how this was where the alpha geeks had gathered. Self-serving, sure, but not false. There were a lot of excellent talks. I'll divide the highlights into two categories.
First, the practical items.
- I followed all of @kevinweil of Twitter's suggestions from last year, and they were very helpful, so I went back for more this year. He spoke about file formats. Sexy! But all my experience of the last 1.5 years confirms this is very important in Hadoop world. He's using protocol buffers + lzo compression, and has a project called Elephant Bird for protobuf-handling code generation and other things. As usual he puts it all up on github, and I'll be using it. It's interesting that they're not using Avro or Thrift, the other versions of the same binary-format idea as Google's protocol buffers (Avro is the Hadoop version, Thrift is from Facebook). Twitter does not otherwise fear to use the always-poorly-documented but sometimes excellent Facebook offerings, including Scribe for log collection. They "work closely with Facebook" to make that work. Hmmm.
- I will not be using Scribe, because the Flume project (an agent-based collector for log files) seems to have achieved critical mass. That's, at least, what Otis Gospodnetić was saying at his excellent talk about search analytics. His reputation, company, and awesome on-topic demo site Search Hadoop, made for a packed room. He's using Flume to collect everything, so we'll give that a try at some point. For the moment, I will no doubt continue to get good results on ec2-based clusters, by pushing files to S3 with cron or a daemon of some kind, and using SQS to manage consumption by Hadoop and other things. (Who needs an Enterprise messaging infrastructure any more, now that we have SQS! Heh.) If I start running an all-internal cluster, I would think Flume would help a lot.
- Jonathan Gray from Facebook, who is an HBase committer, spoke about HBase. A lot of good stuff using Hadoop is going live soon at FB. He was secretive about which parts of Facebook were involved, but not about the core technology. Hive + HBase is growing quickly, as a way not to have to stream everything you have, when you want to look at some little thing in a Hive query. I'll be doing more of that this year.
- Chris Gillette of Visible Measures took us all to school on how to manage your own Hadoop cluster on a budget, through hairy things like major version upgrades. All I can say is that with more control than you have on ec2, comes more responsibility.
- The MPP database vendors were there: Asterdata, Greenplum (now part of EMC), Vertica, Netezza. An emerging pattern is something like this:
- Install a Hadoop cluster.
- Realize that the Hive and Pig shells, although useful, don't quite give your analysts everything they need. They might want a full-featured SQL interface, perhaps with some SAS-like features. Maybe if you're very advanced with Hive+HBase+R you could keep up with their demands, but without a decent-sized team of crack people, this could easily lead to expertise bottlenecks, delay and disappointment.
- Install a Vertica/Asterdata/Greenplum/Netezza appliance/cluster on the side, into which your Hadoop results, less fully baked than they would have to be without such a thing, are being fed. This might work pretty well with Infobright as well. At least in the case of Greenplum, you can do an interesting variant of this idea by installing a Greenplum cluster, side by side with your Hadoop cluster, on the same nodes. That would give you a full-featured Postgres-style sql interface (or whatever the vendors are giving you) to sharded data in column stores, on which you could do joins without waiting very long for results. And that would free the analysts, perhaps to a significant degree. YMMV. Make sure you're getting your file formats and storage engines right, because more data size in these MPP DBs will cost you a lot in license fees.
Secondly, the more theoretical stuff.
- RHIPE is an R + Hadoop thing by Saptarshi Guha of Purdue University, that allows you to define Map-Reduce jobs as list mapping operations without leaving the R shell. Spooky. You can't exactly just install it from CRAN yet, but this could be awesome for the modeler crowd.
- My prize for the most fascinating talk goes to Daniel Abadi of Yale University, expanding on a blog post from last year. He humorously revisited the whole Stonebraker/DeWitt blog post that stoked an infamous Hadoop/DBMS controversy a while ago. Asking for a show of hands, he surveyed the room as to whether Hadoop and the MPP DBs were cooperative or competitive technologies. The sense of the room was, 'cooperative'. See the last of the practical suggestions above: there's a lot of momentum in the field around that cooperation. But Professor Abadi is a contrarian, and, in fact, a contrarian with benchmarks! He was asserting that Hadoop and the MPP DBs are actually competitive technologies, even if there are successful to highly varying degrees in different areas at the moment (and so are usefully being made to cooperate for now). In particular, Hadoop does not do well on performance of certain kinds of SQL-query-like MR jobs, and the MPP DBs do not do as well at scaling past a certain point. Enter Hadoop DB. The idea is to integrate Hadoop and a good column-store engine at a deeper level than what the MPP DB vendors have done thus far through superficial systems integration with Hadoop. The researchers (Professor Abadi's students, if I have this right) tried a couple of column stores, and got very good performance results with an Ingres-based one, while preserving the linear scalability of Hadoop. IMHO this is potentially very disruptive, and very threatening to the MPP DB vendors. We'll be keeping an eye on this effort.
Thanks to Cloudera for putting this on.