Analyze This! Big Data | Apache Hadoop | Dell | Cloudera | Crowbar

This article about Target using buying patterns to expose a teen was pregnant before she told her parents puts big data analysis into everyday terms better than the following 555 words (of course, I recommend that you read both).

Recently, I had the pleasure of being one of our team presenting Dell’s BIG DATA story at an internal conference. From the questions and buzz, it’s clear that the big data is big news this year. My team is at the center of that storm because we are responsible for the Dell | Cloudera Apache™ Hadoop™ solution. The solution is significant because we’ve integrated many pieces necessary to build and sustain a Hadoop cluster: that includes Dell servers, the Cloudera Hadoop distribution, the Crowbar framework and Services to make it useful.

Big Data Analytics spins data straws into information gold.

Before I jump into technical details, it’s worth stating the big data analytics value proposition. The problem is that we are awash in a tsunami of data: we’ve grown beyond the neat rows and columns of application databases, data today include source like website click logs and emails to call records and cash register receipts to including social media tweets and posts. While much of the data is unstructured noise, there is also incredibility valuable information.  (video of my Hadoop “escalator pitch”)

Value is not just hidden inside the bulk data; it lies in correlations between sets of the data.

The big data analytics value proposition is to provide a system to hold a lot of loosely structured information (thus “big data”) and then sift and correlate the information (thus “analytics”). The result is a technology that helps us make data driven decisions. In many applications, the analysis is fed directly back into applications so they can alter behavior in near real-time. For example, an online retail store could offer you purple bunny slippers as you browse for crowbars in the hardware section knowing that you’re reading this post. That is the type of correlations on disparate data that I’m talking about.

This is really two problems: storing a lot of data and then computing over it.

Hadoop, the leading open source big data analytics project, is a suite of applications that implement and extend two core capabilities: a distributed file system (HDFS) and the map-reduce (M-R) algorithm. My point is not to define Hadoop (others have done better and here); instead, I want to highlight that it’s a combination big data analysis is a merger of storage and compute. When learning about any big data analysis solution, you cannot decouple how the data is stored from how the data is analyzed – storage and compute are fundamentally linked.

For that reason, the architecture of a Hadoop cluster is different than either a traditional database or compute cluster. The IO and the resiliency patterns are different. Since Hadoop is a distributed system, hardware redundancy is less important and eliminating IO bottlenecks is paramount. For this reason, our Hadoop clusters use a lot of local, non-RAID drives with a target of delivering a 1:1 CPU core to spindle ratio (ratios are tuned based on planned loads).

Imagine that you are looking for correlations in web click data. To do that analysis, Hadoop need to spend a lot of time cracking open log files, sifting for specific data and then reporting back its results. That process involves thousands of jobs each doing disk IO, CPU & RAM workload and then network transfer; consequently, contention between network and disk demands reduces performance.

Wow… that’s a lot of description and just scratching the surface of Big Data Analytics. I’ll going to have to add the technical details about the Dell solution architecture (Hardware) and software components (Cloudera & Crowbar) in another post.

Seattle Cloud Camp, Dec 2010

While I was in Seattle for Azure training preparing for Dell’s Azure Appliance , Dave @McCrory suggested that we also attend the Seattle Cloud Camp (SCC Tweets).  This event was very well attended (200 people!).  With heavy attendance by Amazon (at their HQ), Microsoft (in the ‘hood), and Google, there was a substantial cloud vendor presence (>25% from those vendors alone).  Notable omission: VMware.

My reflection about the event by segment.

Opening Sessions:

  • Most of the opening sessions were too light for the audience.  I thought we were past the “what is cloud” level, sigh.
  • Of note, the Amazon security presentation by Steve Rileywas fun and entertaining.
  • Picking on a Dell competitor specifically: calling your cloud solution “WAS” is a branding #fail (not that DCSWA much is better).

Unpanel of self-appointed cloud extroverts experts:

  • The unpanel covered some decent topics (@adronbh captured them on twitter), unfortunately none of the answers really stood out to me.  Except for NoSQL.
  • The unpanel discussion about NoSQL drew 2 answers.  1) It’s not NoSQL, it’s eventually consistent instead of strictly consistent.  (note: I’ve been calling it “Storage++”) 2) We’ll see more and more choices in this area as we tune the models for utility then we’ll see some consolidation.  The suggestion was that NoSQL would follow the same explosion/contraction pattern of SQL databases.

Session on Cloud APIs (my suggested topic)

  • The Cloud API topic was well attended (30+).  The vast overwhelming majority or the attendees were using Amazon.
  • There was some interest in having “standard” APIs for cloud functions was not well received because it was felt to stifle innovation.  We are still to early.
  • It was postulated but not generally agreed that cloud aggregation (DeltaCloud, RightScale, etc) is workable.  This was considered a reason to not require standard clouds.
  • CloudCamp sponsor, Skytap, has their own API.  These APIs are value added and provide extra abstraction levels.
  • It was said that there are a LOT (50 now, 500 soon) smaller hosts that want to enter the cloud space.  These hosts will need an API – some are inventing their own.
  • I brought up the concept discussed at OpenStack that the logical abstraction for cloud network APIs is a “vlan.”  This created confusion because some thought that I meant actual 802.1q tags.  NO!  I just meant that is was the ABSTRACTION of a VLAN connecting VMs together.
  • There was agreement from the clouderati in the room that cloud networking was f’ed up, but most people were not ready to discuss.
  • Cloud APIs have some basics that are working (semantics around VMs) but still have lots of wholes.  Notably: networking, application, services, and identity)

Session on Google App Engine (GAE)

  • GAE is got a lot going on, especially in the social/mobile space.
  • Do not think a lack of news about GAE means that they are going slow, it’s just the opposite.  It looks like they are totally kicking ass with a very focused strategy.  I suspect that they are just waiting for the market to catch-up.
  • GAE understands what a “platform” really is.  They talk about their platform as the SERVICES that they are offering.  The code is just code.  The services are impressive and include identity, mail, analysis, SQL (business only), map (as in Map-Reduce), prediction (yes, prediction!), storage, etc.  The total list was nearly 20 distinct services.
  • GAE compared them selves to Azure, not Amazon.