<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dalton Clark &#187; hadoop</title>
	<atom:link href="http://www.daltonclark.com/blog/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.daltonclark.com/blog</link>
	<description>Ben Clark&#039;s technology blog</description>
	<lastBuildDate>Fri, 22 Jan 2010 14:59:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Me on Hadoop setup at StyleFeeder, part 2</title>
		<link>http://www.daltonclark.com/blog/2010/01/22/hadoop-setup-at-stylefeeder-part-2-patch-rpm/</link>
		<comments>http://www.daltonclark.com/blog/2010/01/22/hadoop-setup-at-stylefeeder-part-2-patch-rpm/#comments</comments>
		<pubDate>Fri, 22 Jan 2010 14:56:02 +0000</pubDate>
		<dc:creator>Ben Clark</dc:creator>
				<category><![CDATA[technology]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.daltonclark.com/blog/?p=47</guid>
		<description><![CDATA[This post on the StyleFeeder tech blog is a HOWTO for taking a Cloudera Hadoop distribution in the 0.20 series, patching it for yourself, and running a Hadoop cluster on EC2 based on it.]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.tech.stylefeeder.com/2010/01/22/hadoop-for-the-lone-analyst-part-2-patching-and-releasing-to-yourself/">This post</a> on the StyleFeeder tech blog is a HOWTO for taking a Cloudera Hadoop distribution in the 0.20 series, patching it for yourself, and running a Hadoop cluster on EC2 based on it.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.daltonclark.com%2Fblog%2F2010%2F01%2F22%2Fhadoop-setup-at-stylefeeder-part-2-patch-rpm%2F&amp;linkname=Me%20on%20Hadoop%20setup%20at%20StyleFeeder%2C%20part%202"><img src="http://www.daltonclark.com/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://www.daltonclark.com/blog/2010/01/22/hadoop-setup-at-stylefeeder-part-2-patch-rpm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Me on Hadoop Setup, at StyleFeeder</title>
		<link>http://www.daltonclark.com/blog/2010/01/14/ben-clark-on-hadoop-setup-at-stylefeeder/</link>
		<comments>http://www.daltonclark.com/blog/2010/01/14/ben-clark-on-hadoop-setup-at-stylefeeder/#comments</comments>
		<pubDate>Thu, 14 Jan 2010 15:06:08 +0000</pubDate>
		<dc:creator>Ben Clark</dc:creator>
				<category><![CDATA[technology]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.daltonclark.com/blog/?p=37</guid>
		<description><![CDATA[My colleagues and clients at StyleFeeder are good enough to let me post on their tech blog from time to time. I&#8217;m exploring Hadoop on their behalf, as partially described here: http://blog.tech.stylefeeder.com/2010/01/14/hadoop-for-the-lone-analyst/. That&#8217;s basically a HOWTO for Hadoop 0.20 + Apache logs + MySQL on EC2, with tips on streaming, compression, Pig, Redhat/CentOS and the [...]]]></description>
			<content:encoded><![CDATA[<p>My colleagues and clients at <a href="http://www.stylefeeder.com">StyleFeeder</a> are good enough to let me post on their tech blog from time to time.  I&#8217;m exploring Hadoop on their behalf, as partially described here: <a href="http://blog.tech.stylefeeder.com/2010/01/14/hadoop-for-the-lone-analyst/">http://blog.tech.stylefeeder.com/2010/01/14/hadoop-for-the-lone-analyst/</a>.  That&#8217;s basically a HOWTO for Hadoop 0.20 + Apache logs + MySQL on EC2, with tips on streaming, compression, Pig, Redhat/CentOS and the <a href="http://www.cloudera.com">Cloudera</a> Python scripts for EC2.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.daltonclark.com%2Fblog%2F2010%2F01%2F14%2Fben-clark-on-hadoop-setup-at-stylefeeder%2F&amp;linkname=Me%20on%20Hadoop%20Setup%2C%20at%20StyleFeeder"><img src="http://www.daltonclark.com/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://www.daltonclark.com/blog/2010/01/14/ben-clark-on-hadoop-setup-at-stylefeeder/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop in the Enterprise</title>
		<link>http://www.daltonclark.com/blog/2009/10/08/hadoop-in-the-enterprise/</link>
		<comments>http://www.daltonclark.com/blog/2009/10/08/hadoop-in-the-enterprise/#comments</comments>
		<pubDate>Thu, 08 Oct 2009 12:06:57 +0000</pubDate>
		<dc:creator>Ben Clark</dc:creator>
				<category><![CDATA[technology]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.daltonclark.com/blog/?p=17</guid>
		<description><![CDATA[At Hadoop World NYC 2009, one of the most interesting presentations, from a business point of view, was by JP Morgan Chase. They couldn&#8217;t share too many details for obvious reasons, but they were talking about cost savings of one, two or three orders of magnitude compared to existing technology. Peter Krey said humorously that [...]]]></description>
			<content:encoded><![CDATA[<p>At Hadoop World NYC 2009, one of the most interesting presentations, from a business point of view, was by JP Morgan Chase.  They couldn&#8217;t share too many details for obvious reasons, but they were talking about cost savings of one, two or three orders of magnitude compared to existing technology.  Peter Krey said humorously that anyone can save 30-40%: if you demand at least an order of magnitude, it takes a lot of fluff projects off the table.  &#8216;Fluff&#8217; wasn&#8217;t the word he used, but you get the idea.  Heh.</p>
<p>To state the obvious, Hadoop is a disruptive technology.  One way this might play out is as a replacement for ETL and data warehousing setups in big companies.  Picture a pipeline of (1) DB2 tables (2) VSAM and other structured files, (3) Oracle OLTP databases, (4) Informatica/Ab Initio/Data Stage/whatever jobs filling up (5) Oracle data warehouses, and finally (6) SQL Server cubes connected to front-end applications in the hands of analysts.  There are a lot of variations on this idea out there, but let&#8217;s call it an example of a common pattern.  1, 2, 3 and 6 are hard to dislodge, because they&#8217;re actual operations and high-level-user-facing apps, respectively, but 4 and 5 are pretty ripe, in many cases, to be moved from the special-purpose clusters they tend to run on to a general-purpose Hadoop cluster, on commodity hardware, with probable increased parallelization and massive cost savings.  There&#8217;s a lot of Oracle in that space, and Oracle now has a commanding position relative to the fate of java.  So let&#8217;s work this angle, but not make it too obvious, or go to far, or maybe java will start languishing like MySQL.  I jest.  Sort of.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.daltonclark.com%2Fblog%2F2009%2F10%2F08%2Fhadoop-in-the-enterprise%2F&amp;linkname=Hadoop%20in%20the%20Enterprise"><img src="http://www.daltonclark.com/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://www.daltonclark.com/blog/2009/10/08/hadoop-in-the-enterprise/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>eBay&#8217;s Mobius Query Language at Hadoop World</title>
		<link>http://www.daltonclark.com/blog/2009/10/08/ebays_mobius_query_language_at_hadoop_world/</link>
		<comments>http://www.daltonclark.com/blog/2009/10/08/ebays_mobius_query_language_at_hadoop_world/#comments</comments>
		<pubDate>Thu, 08 Oct 2009 11:39:06 +0000</pubDate>
		<dc:creator>Ben Clark</dc:creator>
				<category><![CDATA[technology]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.daltonclark.com/blog/?p=13</guid>
		<description><![CDATA[I went to Cloudera&#8216;s Hadoop World NYC 2009 on Friday: it was quite a show. One theme that played out through many presentations was abstraction layers on top of raw Map Reduce. The two biggest are Pig and Hive, which are Yahoo&#8217;s and Facebook&#8217;s solutions to the same basic problem, of how to write less [...]]]></description>
			<content:encoded><![CDATA[<p>I went to <a href="http://www.cloudera.com" title="Cloudera--services for Hadoop">Cloudera</a>&#8216;s Hadoop World NYC 2009 on Friday: it was quite a show.  </p>
<p>One theme that played out through many presentations was abstraction layers on top of raw Map Reduce.  The two biggest are <a href="http://hadoop.apache.org/pig/">Pig</a> and <a href="http://hadoop.apache.org/hive/">Hive</a>, which are Yahoo&#8217;s and Facebook&#8217;s solutions to the same basic problem, of how to write less code for repetitive Map Reduce tasks.   There&#8217;s a lot of good commentary out there on those.  Hive is more like a sql shell, and if you want to extend it, I think you&#8217;re going to be writing, say, Python mappers/reducers and streaming them into/out-of your Hive setup.  With Pig, you&#8217;re operating, as they put it in the training/documentation/O&#8217;Reilly book, which collectively document Pig very well, more at the level of a SQL query optimizer.  You have some iteration facilities, and you can extend it with java.   Pig does more exactly what you tell it to do, and Hive is something you &#8216;hint&#8217; at. These are general-purpose tools.</p>
<p>In the more specialized area of web analytics, eBay has a very interesting internal tool, called Mobius Query Language, on which Neel Sundaresan gave a fascinating talk.  I&#8217;ll update with a link if Cloudera posts the presentation, but it helps you model visits with landmarks, duration, and some other concepts I didn&#8217;t take notes on.  It clearly helps them wrap their code around the maddeningly amorphous user visit: participating in an auction, bidding, winning, abandoning, etc.  The language seemed general-purpose enough for application to any user-behavior modeling.  The interface is a SQL-like query language that seems, like Hive, to generate Map Reduce jobs based on nicely abstracted view of exactly the sorts of questions you want to ask your web analytics system.  For the moment, I&#8217;m doing what web analytics I&#8217;m doing by extending Pig, but I hereby declare the Movement to Get eBay to Opensource the Mobius Query Language.  Who&#8217;s with me?</p>
<p>On the conference in general, there is some good commentary out there, from <a href="http://dev.hubspot.com/bid/27047/Hadoop-World-NYC-2009" title="Comments on Hadoop World NYC 2009 by Dan Milstein">Dan Milstein</a>, <a href="http://dev.hubspot.com/bid/27054/Hadoop-World-impressions" title="Comments on Hadoop World NYC 2009 by Steve Laniel">Steve Laniel</a>, <a href="http://www.hilarymason.com/blog/hadoop-world-nyc/" title="Comments on Hadoop World NYC 2009 by Hilary Mason">Hilary Mason</a>, and no doubt others.</p>
<a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fwww.daltonclark.com%2Fblog%2F2009%2F10%2F08%2Febays_mobius_query_language_at_hadoop_world%2F&amp;linkname=eBay%26%238217%3Bs%20Mobius%20Query%20Language%20at%20Hadoop%20World"><img src="http://www.daltonclark.com/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://www.daltonclark.com/blog/2009/10/08/ebays_mobius_query_language_at_hadoop_world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
