Analyze This! Big Data | Apache Hadoop | Dell | Cloudera | Crowbar February 20, 2012
Posted by Rob H in Big Data Analytics, Cloudera, Hadoop, Open source.Tags: Big Data, CloudEra, hadoop, HDFS, map-reduce, Unstructured Data
add a comment
This article about Target using buying patterns to expose a teen was pregnant before she told her parents puts big data analysis into everyday terms better than the following 555 words (of course, I recommend that you read both).
Recently, I had the pleasure of being one of our team presenting Dell’s BIG DATA story at an internal conference. From the questions and buzz, it’s clear that the big data is big news this year. My team is at the center of that storm because we are responsible for the Dell | Cloudera Apache™ Hadoop™ solution. The solution is significant because we’ve integrated many pieces necessary to build and sustain a Hadoop cluster: that includes Dell servers, the Cloudera Hadoop distribution, the Crowbar framework and Services to make it useful.
Big Data Analytics spins data straws into information gold.
Before I jump into technical details, it’s worth stating the big data analytics value proposition. The problem is that we are awash in a tsunami of data: we’ve grown beyond the neat rows and columns of application databases, data today include source like website click logs and emails to call records and cash register receipts to including social media tweets and posts. While much of the data is unstructured noise, there is also incredibility valuable information. (video of my Hadoop “escalator pitch”)
Value is not just hidden inside the bulk data; it lies in correlations between sets of the data.
The big data analytics value proposition is to provide a system to hold a lot of loosely structured information (thus “big data”) and then sift and correlate the information (thus “analytics”). The result is a technology that helps us make data driven decisions. In many applications, the analysis is fed directly back into applications so they can alter behavior in near real-time. For example, an online retail store could offer you purple bunny slippers as you browse for crowbars in the hardware section knowing that you’re reading this post. That is the type of correlations on disparate data that I’m talking about.
This is really two problems: storing a lot of data and then computing over it.
Hadoop, the leading open source big data analytics project, is a suite of applications that implement and extend two core capabilities: a distributed file system (HDFS) and the map-reduce (M-R) algorithm. My point is not to define Hadoop (others have done better and here); instead, I want to highlight that it’s a combination big data analysis is a merger of storage and compute. When learning about any big data analysis solution, you cannot decouple how the data is stored from how the data is analyzed – storage and compute are fundamentally linked.
For that reason, the architecture of a Hadoop cluster is different than either a traditional database or compute cluster. The IO and the resiliency patterns are different. Since Hadoop is a distributed system, hardware redundancy is less important and eliminating IO bottlenecks is paramount. For this reason, our Hadoop clusters use a lot of local, non-RAID drives with a target of delivering a 1:1 CPU core to spindle ratio (ratios are tuned based on planned loads).
Imagine that you are looking for correlations in web click data. To do that analysis, Hadoop need to spend a lot of time cracking open log files, sifting for specific data and then reporting back its results. That process involves thousands of jobs each doing disk IO, CPU & RAM workload and then network transfer; consequently, contention between network and disk demands reduces performance.
Wow… that’s a lot of description and just scratching the surface of Big Data Analytics. I’ll going to have to add the technical details about the Dell solution architecture (Hardware) and software components (Cloudera & Crowbar) in another post.
Why Governance Matters in Open Source: Discussing the OpenStack Foundation February 8, 2012
Posted by Rob H in RackSpace, OpenStack, Open source, Citrix, Meetup, Dell.Tags: Foundation, meetup, OpenStack, rackspace
1 comment so far
This post is part of my notes from the 2/1 Boston OpenStack meetup.
OpenStack Foundation
Your’s truly (Rob Hirschfeld) gave the presentation about the OpenStack Foundation. To readers of this blog, it’s obvious that I’m a believer in the OpenStack mission; however, it’s not obvious how creating a foundation helps with that mission and why OpenStack needs its own. As one person at the meetup put it, “Why not? Every major project needs a foundation!”
Governance does not sound sexy compared to writing code and deploying clouds, but it’s very important to the success of the project.
Here are my notes without the poetic elocution I exuded during the meetup…
The basics:
- What: Creating a neutral body to govern OpenStack. Rackspace has been leading OpenStack. This means that they own the copyrights, name and also pay the people who organize the community. They committed (to executives at Dell and others) that they would ultimately setup a standalone body to govern the project before the project was public and endorsed by those early partners. Dell (my employer), Citrix, Accenture and NASA were some of biggest names at the Austin conference launch.
- Why: A neutral body is needed because a lot of companies are committing significant time and money to the project. They cannot risk their investments on Rackspace good will alone. This may mean many things. It could be they don’t like Rackspace direction or they feel that Rackspace is not investing enough.
- When: Right now and over the next few releases. You should give feedback right now on the OpenStack Foundations mission. The actual foundation will take more time to establish because it requires legal work and funding commitments.
- Who: The community – all stakeholders. This is important stuff! While trying to standup a financially independent Foundation, which requires moneys, the little guys are not left out. There is a clear realization and desire to enable independent developers and contributors and small players to have a seat at the table.
- How Much: The amounts are unclear, but establishing a foundation will require a significant ongoing investment from highly involved and moneyed parties (Rackspace, Dell, Cisco, HP, Citrix, NTT, startups?, etc). The funding will pay salaries for people dedicated to the community doing the things that I’ll discuss below. Overall, the ROI for those investments must be clear!
The foundation does “governance.” But, what does that mean? Here is a list of vitally important work that the foundation is responsible for.
- Branding – Protecting, certifying, and promoting the OpenStack brand is important because it ensures that “OpenStack” has a valuable and predictable meaning to contributors and users. A strong the brand also means a stronger temptation for people to abuse the brand by claiming compatibility, participation and integration.
- API – Many would assume that the OpenStack API is the very heart of the project and there is merit to this position. As more and more OpenStack implementations emerge, it is essential that we have a body that can certify which implementations (and even which versions of the implementation!) are valid. This is a substantial value to the community because API integrity ensures project continuity and helps the ecosystem monetize the project. Note: my opinion differs from others here because I think we should favor API over implementation
- Community – The OpenStack community is not an accident. It is the function of deliberate actions and choices made by Rackspace and supported by key contributors. That community requires virtual and physical places to coalesce and leaders to organize and manage those meeting places. The excellent conferences, wikis, blogs, media awareness, documentation and meetups are a product of consistent community management.
- Arbitration – An open source community is a family and siblings do not always get along. Today, Rackspace must be very careful about balancing their own interests because they are like the oldest sibling playing the parent role – you can get away with it until something serious happens. We need a neutral party so that Rackspace can protect their own interests (alternate spin: because Rackspace protects their own interests at the expense of the community).
- Leadership – OpenStack today is a collection of projects with individual leadership. We will increasingly need coordinated leadership as the number of projects and users increases. Centralized leadership is essential because the good of the project as a whole may mean sacrifices within individual projects. It may even mean that some projects chose to leave the OpenStack tent. Stewarding these challenges will require a new level of leadership.
- Legal – This is a function of all the above but also something more. From a legal stand point, OpenStack be able to represent itself. There is a significant amount of intellectual property being created. It would be foolish to overlook that this property is valuable and needs adequate legal representation.
I used “vitally important” to describe the above items. Is that an exaggeration? Our goal is collaboration and that requires some infrastructure and rules to make it sustainable. We must have a foundation that encourages innovation (multiple implementations) and collaboration (discourages forking). Innovation and collaboration are the heartbeat of an open source project.

The foundation is vitally important because collaboration by competitors is fragile.
In addition to the core areas above, the foundation needs to handle routine tactical items such as:
- Delivering on milestones & releases
- Moving new subprojects into OpenStack
- Electing and maintaining Project Policy Board
- Electing and maintaining Project Technical Leads
- Ensuring adherence and extensions to the current bylaws
At the end of the day, OpenStack monetization is the central value for the Foundation.
In order for the OpenStack project, and thus its foundation, to flourish, the contributors, ecosystem, sponsors and users of the project must be able to see a reasonable return (ROI) on their investment. I would love to believe that the foundation is allow about people banding together to solve important problems for the benefit of all; however, it is more realistic to embrace that we can both collaborate and profit simultaneously. Acknowledging the pragmatic self-interested view allows us to create the right incentives and processes as embodied by the OpenStack foundation.
Concerns about SOPA January 18, 2012
Posted by Rob H in Culture, Open source.Tags: Blackout, Protest, SOPA
add a comment
Joining in the blackout; however, I’ll defer to more articulate bloggers on this topic.
2012: A year of Cloud Coalescence (whatever that means) January 5, 2012
Posted by Rob H in Clouds, Hadoop, Linux, Open source, OpenStack.Tags: 2012, DevOps, hadoop, OpenStack, PaaS, quantum
1 comment so far
This post is a collaboration between three Dell Cloud activists: Rob Hirschfeld (@zehicle), Joseph B George (@jbgeorge) and Stephen Spector (@SpectoratDell).
We’re not making predictions for the “whole” Cloud market, this is a relatively narrow perspective based on technologies that on our daily radar. These views are strictly our own and based on publicly available data. They do not reflect plans, commitments, or internal data from our
employer (Dell).
The major 2012 theme is cloud coalescence. However, Rob worries that we’ll see slower adoption due to lack of engineers and confusing names/concepts.
Here are our twelve items for 2012:
- Open source continues to be a disruptive technology delivery model. It’s not “free” software – there’s an emerging IT culture that is doing business differently, including a number of large enterprises. The stable of sleeping giant vendors are waking up to this in 2012 but full engagement will take time.
- Linux. It is the cloud operating system and had a great 2012. It seems silly pointing this out since it seems obvious, but it’s the foundation for open source acceleration.
- Tight market for engineering and product development talent will get tighter. The catch-22 of this is that potential mentors are busy breaking new ground and writing code, making it hard for new experts to be developed.
- On track, OpenStack moves into its awkward adolescence. It is still gangly and rebelling against authority, but coming into its own. Expect to see a groundswell of installations and an expected wave of issues and challenges that will drive the community. By the “F” release, expect to see OpenStack cement itself as a serious, stable contender with notable public deployments and a significant international private deployment foot print.
- We’ll start seeing OpenStack Quantum (networking) in near-production pilots by year end. OpenStack Quantum is the glue that holds the big players in OpenStack Nova together. The potential for next generation cloud networking based on open standards is huge, but it will emerge without a killer app (OpenStack Nova in this case) pushing it forward. The OpenStack community will pull together to keep Quantum on track.
- Hadoop will cross into mainstream awareness as the need for big data analysis grows exponentially along with the data. Hadoop is on fire in select circles and completely obscure in others. The challenge for Hadoop is there are not enough engineers who know how to operate it. We suspect that lack of expertise will throttle demand until we get more proprietary tools to simplify analysis. We also predict a lot of very rich entrepreneurs and VCs emerging from this market segment.
- DevOps will enter mainstream IT discussions. Marketers from major IT brands will struggle and fail to find a better name for the movement. Our prediction is that by 2015, it will just be the way that “IT” is done and the name won’t matter.
- KVM continues to gain believers as the open source hypervisor. In 2011, I would not have believed this prediction but KVM making great strides and getting a lot of love from the OpenStack community, though Xen is also a key open source technology as well. I believe that Libvirt compatibility between LXE & KVM will further accelerate both virtualization approaches.
Big Data and NoSQL will continue to converge. While NoSQL enthusiasm as a universal replacement for structured databases appears to be deflating, real applications will win.- Java will continue to encounter turbulence as a software platform under Oracle’s overly heady handed management.
- PaaS continues to be a confusing term. Cloud players will struggle with a definition but I don’t think a common definition will surface in 2012. I think the big news will be convergence between DevOps and PaaS; however, that will be under the radar since most of the market is still getting educated on both of those concepts.
- Hybrid cloud will continue to make strides but will not truly emerge in 2012 – we’ll try to develop this technology, and expose gaps that will get us there ultimately (see PaaS and Quantum above)
Thoughts? We’d love to hear your comments.
Rob, JBG, and Stephen
You can follow Rob at www.RobHirschfeld.com or @zehicle on Twitter.
You can follow Joseph at www.JBGeorge.net or @jbgeorge on Twitter.
You can follow Stephen at http://en.community.dell.com/members/dell_2d00_stephen-sp/blogs/default.aspx or @SpectoratDell on Twitter.


I’m very excited to announce that 
