Recently, I had the pleasure of being one of our team presenting Dell’s BIG DATA story at an internal conference. From the questions and buzz, it’s clear that the big data is big news this year. My team is at the center of that storm because we are responsible for the Dell | Cloudera Apache™ Hadoop™ solution. The solution is significant because we’ve integrated many pieces necessary to build and sustain a Hadoop cluster: that includes Dell servers, the Cloudera Hadoop distribution, the Crowbar framework and Services to make it useful.
Big Data Analytics spins data straws into information gold.
Before I jump into technical details, it’s worth stating the big data analytics value proposition. The problem is that we are awash in a tsunami of data: we’ve grown beyond the neat rows and columns of application databases, data today include source like website click logs and emails to call records and cash register receipts to including social media tweets and posts. While much of the data is unstructured noise, there is also incredibility valuable information. (video of my Hadoop “escalator pitch”)
Value is not just hidden inside the bulk data; it lies in correlations between sets of the data.
The big data analytics value proposition is to provide a system to hold a lot of loosely structured information (thus “big data”) and then sift and correlate the information (thus “analytics”). The result is a technology that helps us make data driven decisions. In many applications, the analysis is fed directly back into applications so they can alter behavior in near real-time. For example, an online retail store could offer you purple bunny slippers as you browse for crowbars in the hardware section knowing that you’re reading this post. That is the type of correlations on disparate data that I’m talking about.
This is really two problems: storing a lot of data and then computing over it.
Hadoop, the leading open source big data analytics project, is a suite of applications that implement and extend two core capabilities: a distributed file system (HDFS) and the map-reduce (M-R) algorithm. My point is not to define Hadoop (others have done better and here); instead, I want to highlight that it’s a combination big data analysis is a merger of storage and compute. When learning about any big data analysis solution, you cannot decouple how the data is stored from how the data is analyzed – storage and compute are fundamentally linked.
For that reason, the architecture of a Hadoop cluster is different than either a traditional database or compute cluster. The IO and the resiliency patterns are different. Since Hadoop is a distributed system, hardware redundancy is less important and eliminating IO bottlenecks is paramount. For this reason, our Hadoop clusters use a lot of local, non-RAID drives with a target of delivering a 1:1 CPU core to spindle ratio (ratios are tuned based on planned loads).
Imagine that you are looking for correlations in web click data. To do that analysis, Hadoop need to spend a lot of time cracking open log files, sifting for specific data and then reporting back its results. That process involves thousands of jobs each doing disk IO, CPU & RAM workload and then network transfer; consequently, contention between network and disk demands reduces performance.
Wow… that’s a lot of description and just scratching the surface of Big Data Analytics. I’ll going to have to add the technical details about the Dell solution architecture (Hardware) and software components (Cloudera & Crowbar) in another post.
Here’s something from my employer (Dell) that may be interesting to you: it’s about using GPUs for Big Data Analytics. I meant to discuss/post this earlier, but… oh well. Here’s the information
Premieres LIVE: 2pm EST (11 AM PST) TODAY Free - Register Now!
Please don’t confuse a lack of posts with a lack of activity! I’ve been in the center of a whirlwind of Crowbar, OpenStack and Hadoop for my team at Dell. I’ve also working on an interesting side project with Liquid Leadership author (and would-be star ship captain) Brad Szollose.
I just don’t have time to post all of the awesomeness. I can tell you that my team is very focused on Hadoop (RHEL 6.2/CentOS 6.2 + open Cloudera Distro) barclamps as we get some Diablo deployments done. Also the Crowbar list has been very active about Diablo. If you’re looking for advanced information, there is some inside scoop on the Crowbar FoodFight podcast I did with Bryan Berry & Matt Ray.
I’ll be in BOSTON THIS WEDNESDAY 2/1 for the OpenStack Meetup there. We’re going to be talking about Quantum and the OpenStack Foundation. I suspect that Keystone will come up too (but that’s the subject of another post). Of course, it’s not just your humble blogger: the whole Dell CloudEdge OpenStack/Crowbar team will be on hand! So put on your cloud geek hat and take a trip to Harvard for the meetup!
I don’t usually call out my credentials, but knowing the I have a Masters in Industrial Engineering helps (partially) explain my passion for process as being essential to successful software delivery. One of my favorite authors, Mary Poppendiek, explains undeployed code as perishable inventory that you need to get to market before it loses value. The big lessons (low inventory, high quality, system perspective) from Lean manufacturing translate directly into software and, lately, into operation as DevOps.
What we have observed from delivering our own cloud products, and working with customers on thier’s, is that the operations process for deployment is as important as the software and hardware. It is simply not acceptable for us to market clouds without a compelling model for maintaining the solution into the future. Clouds are simply moving too fast to be delivered without a continuous delivery story.
This white paper [link here!] has been available since the OpenStack conference, but not linked to the rest of our OpenStack or Crowbar content.
With the holiday rush, I neglected to post about Monday’s Crowbar v1.2 release (ISO here)!
The core focus for this release was to support the OpenStack Diablo Final bits (which my employer, Dell, includes as part of the “Dell OpenStack Powered Cloud Solution“); however, we added a lot of other capability as we continue to iterate on Crowbar.
I’m proud of our team’s efforts on this release on both on features and quality. I’m equally delighted about the Crowbar community engagement via the Crowbar list server. Crowbar is not hardware or operating system specific so it’s encouraging to hear about deployments on other gear and see the community helping us port to new operating system versions.
We driving more and more content to Crowbar’s Github as we are working to improve community visibility for Crowbar. As such, I’ve been regularly updating the Crowbar Roadmap. I’m also trying to make videos for Crowbar training (suggestions welcome!). Please check back for updates about upcoming plans and sprint activity.
Crowbar Added Features in v1.2:
Central feature was OpenStack Diablo Final barclamps (tag “openstack-os-build”)
Improved barclamp packaging
Added concepts for “meta” barclamps that are suites of other barclamps
Proposal queue and ordering
New UI states for nodes & barclamps (led spinner!)
Install includes self-testing
Service monitoring (bluepill)
Looking forward
Dell has a long list of pending Hadoop and OpenStack deployments using these bits so you can expect to see updates and patches matching our field experiences. We are very sensitive to community input and want to make Crowbar the best way to deliver a sustainable repeatable reference deployment of OpenStack, Hadoop and other cloud technologies.
We’re not limiting the agenda to OpenStack! We’ll happily talk about Hadoop, Crowbar, Opscode or any other cloud technology that’s on your mind. For 90 minutes, we’re offering Cloud Geeking as a Service (CGaaS).
This release raises the bar on open Hadoop deployments by making them faster, scalable, more integrated and repeatable.
These barclamps were developed in conjunction with our licensed Dell | Cloudera Solution. The licensed solution is for customers seeking large scale and professionally supported big data solutions. The purpose of the open barclamps (which pull the open source parts from the Cloudera distro) is to help you get started with Hadoop and reduce your learning curve. Our team invested significant testing effort in ensuring that these barclamps work smoothly because they are the foundational layer of our for-pay Hadoop solution.
Included in the Hadoop barclamp suite are Hadoop Map Reduce, Hive, Pig, ZooKeeper and Sqoop running on RHEL 5.7. These barclamps cover the core parts of the Hadoop suite. Like other Crowbar deployments (see OpenStack), the barclamps automatically discover the service configurations and interoperate. One of our team members (call him Scott Jensen) said it very simply “I can deploy a fully an integrated Hadoop cluster in a few hours. That friggin’ rocks!” I just can’t put it more eloquently than that!
I’ll post again when we flip the “open” bit and invite our community to dig in and help us continue to set the standards on open Hadoop deployments.
My team at Dell has been getting a great response from our community about Crowbar. Thanks! We’re actively working a rock solid OpenStack deployment that will raise the bar on ease of deploy and drive operational excellence.
We have also heard that we need to improve access to the team; consequently, I’m delighted to announce a long list of places and dates where you can access us online AND in person.
Opcode Community Summit 11/29-30. Greg and Rob are attending. We’re thinking about a Crowbar session (for attendees) and a non-summit informal evening drinks downtown on 11/30.