In scale-out infrastructure, tools & automation matter

WiseScale out platforms like Hadoop have different operating rules.  I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.

In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.

Finding and correcting this type of failure can be difficult.  While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause.   If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.

Consequently, I’ve found that management tooling and automation are essential for success.  Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.

I’d also like to give a shout out to benchmarks as part of your tooling suite.  Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.

Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic.  A single person can have a disproportionate impact on overall team performance.

Open Source is The Power of We (Blog Action Day)

This post is part of a world wide “blog action day” where thousands of bloggers post their unique insights about a single theme. For 2012, it’s the “power of we is as a celebration of people working together to make a positive difference in the world, either for their own communities or for people they will never meet half way around the world.”

I’ve choosing open source software because I think that we are establishing models for building ideas collaboratively that can be extended beyond technology into broader use. The way we solve open source challenges translates broadly because we are the tool makers of the global interaction.

I started using open source¹ as a way to solve a problem; I did not understand community or how groups of loosely connected people came together to create something new. Frankly, the whole process of creating free software seemed to be some hybrid combination of ninja coders and hippy hackers. That changed when I got involve on the ground floor of the OpenStack project (of which I am now a Foundation board member).

I was not, could not have been, prepared for the power and reality of community and collaboration that fuels OpenStack and other projects. We have the same problems as any non-profit project except that we are technologists: we can make new tools to solve our teaming and process problems.

It is not just that open source projects solve problems that help people. The idea of OpenStack and Hadoop being used by medical researches to find cures for cancer is important; however, the learning how to build collaboratively is another critical dimension. Our world is getting more connected and interconnected by technology, but the actual tools for social media are only in their earliest stages.

Not only are the tools evolving, the people using the tools are changing too! We are training each other to work together in ways that were beyond our imagine even 10 years ago. It’s the combination of both new technology and new skills that is resetting the rules for collaboration.

Just a few years ago, open source technology was considered low quality, risky and fringe. Today, open source projects like OpenStack and Hadoop are seen as more innovative and equally secure and supportable compared to licensed products. This transformation represents a surprising alignment and collaboration between individuals and entities that would normally be competing. While the motivation for this behavior comes from many sources, we all share the desire to do collaborative effectively.

I don’t think that we have figured out how to really do this the best way yet. We are making progress and getting better and better. We are building tools (like etherpad, wikis, irc, twitter, github, jenkins, etc) that improve collaboration. We are also learning building a culture of collaboration.

Right now, I’m on a train bound for the semi-annual OpenStack summit that brings a world wide audience together for 4½ days of community work. The discussions will require a new degree of openness from people and companies that are normally competitive and secretive about product development. During the summit, we’ll be doing more than designing OpenStack, we will be learning the new skills of working together. Perhaps those are the most important deliverables.

Open source projects combination of both new technologies and new skills creates the Power of We.

——————

PS¹: Open source software is a growing class of applications in which the authors publish the instructions for running the software publicly so that other people can use the software. Sometimes (but not always) this includes a usage license that allows other people to run the software without paying the author royalties. In many cases, the author’s motivation is that other users will help them test, modify and improve the software so that improves more quickly than a single creator could do alone.

Big Data to tame Big Government? The answer is the Question.

Today my boss at Dell, John Igoe, is part of announcing of the report from the TechAmerica Federal Big Data Commission (direct pdf), I was fully expecting the report to be a real snoozer brimming with corporate synergies and win-win externalities. Instead, I found myself reading a practical guide to applying Big Data to government. Flipping past the short obligatory “what is…” section, the report drives right into a survey of practical applications for big data spanning nearly every governmental service. Over half of the report is dedicated to case studies with specific recommendations and buying criteria.

Ultimately, the report calls for agencies to treat data as an asset. An asset that can improve how government operates.

There are a few items that stand out in this report:

  1. Need for standards on privacy and governance. The report calls out a review and standardization of cross agency privacy policy (pg 35) and a Chief Data Officer position each agency (pg 37).
  2. Clear tables of case studies on page 16 and characteristics on page 11 that help pin point a path through the options.
  3. Definitive advice to focus on a single data vector (velocity, volume or variety) for initial success on page 28 (and elsewhere)

I strongly agree with one repeated point in the report: although there is more data available, our ability to comprehend this data is reduced. The sheer volume of examples the report cites is proof enough that agencies are, and will be continue to be, inundated with data.

One short coming of this report is that it does not flag the extreme storage of data scientists. Many of the cases discussed assume a ready army of engineers to implement these solutions; however, I’m uncertain how the government will fill positions in a very tight labor market. Ultimately, I think we will have to simply open the data for citizen & non-governmental analysis because, as the report clearly states, data is growing faster than capability to use it.

I commend the TechAmerica commission for their Big Data clarity: success comes from starting with a narrow scope. So the answer, ironically, is in knowing which questions we want to ask.

Crowbar’s early twins: Cloudera Hadoop & OpenStack Essex

I’m proud to see my team announce the twin arrival of the Dell | Cloudera Apache Hadoop (Manager v4) and Dell OpenStack-Powered Cloud (Essex) solutions.

Not only are we simultaneously releasing both of these solutions, they reflect a significant acceleration in pace of delivery.  Both solutions had beta support for their core technologies (Cloudera 4 & OpenStack Essex) when the components were released and we have dramatically reduced the lag from component RC to solution release compared to past (3.7 & Diablo) milestones.

As before, the core deployment logic of these open source based solutions was developed in the open on Crowbar’s github.  You are invited to download and try these solutions yourself.   For Dell solutions, we include validated reference architectures, hardware configuration extensions for Crowbar, services and support.

The latest versions of Hadoop and OpenStack represent great strides for both solutions.   It’s great to be able have made them more deployable and faster to evaluate and manage.

Crowbar 2.0 Objectives: Scalable, Heterogeneous, Flexible and Connected

The seeds for Crowbar 2.0 have been in the 1.x code base for a while and were recently accelerated by SuSE.  With the Dell | Cloudera 4 Hadoop and Essex OpenStack-powered releases behind us, we will now be totally focused bringing these seeds to fruition in the next two months.

Getting the core Crowbar 2.0 changes working is not a major refactoring effort in calendar time; however, it will impact current Crowbar developers by changing improving the programming APIs. The Dell Crowbar team decided to treat this as a focused refactoring effort because several important changes are tightly coupled. We cannot solve them independently without causing a larger disruption.

All of the Crowbar 2.0 changes address issues and concerns raised in the community and are needed to support expanding of our OpenStack and Hadoop application deployments.

Our technical objective for Crowbar 2.0 is to simplify and streamline development efforts as the development and user community grows. We are seeking to:

  1. simplify our use of Chef and eliminate Crowbar requirements in our Opscode Chef recipes.
    1. reduce the initial effort required to leverage Crowbar
    2. opens Crowbar to a broader audience (see Upstreaming)
  2. provide heterogeneous / multiple operating system deployments. This enables:
    1. multiple versions of the same OS running for upgrades
    2. different operating systems operating simultaneously (and deal with heterogeneous packaging issues)
    3. accommodation of no-agent systems like locked systems (e.g.: virtualization hosts) and switches (aka external entities)
    4. UEFI booting in Sledgehammer
  3. strengthen networking abstractions
    1. allow networking configurations to be created dynamically (so that users are not locked into choices made before Crowbar deployment)
    2. better manage connected operations
    3. enable pull-from-source deployments that are ahead of (or forked from) available packages.
  4. improvements in Crowbar’s core database and state machine to enable
    1. larger scale concerns
    2. controlled production migrations and upgrades
  5. other important items
    1. make documentation more coupled to current features and easier to maintain
    2. upgrade to Rails 3 to simplify code base, security and performance
    3. deepen automated test coverage and capabilities

Beyond these great technical targets, we want Crowbar 2.0 is to address barriers to adoption that have been raised by our community, customers and partners. We have been tracking concerns about the learning curve for adding barclamps, complexity of networking configuration and packaging into a single ISO.

We will kick off to community part of this effort with an online review on 7/16 (details).

PS: why a refactoring?

My team at Dell does not take on any refactoring changes lightly because they are disruptive to our community; however, a convergence of requirements has made it necessary to update several core components simultaneously. Specifically, we found that desired changes in networking, operating systems, packaging, configuration management, scale and hardware support all required interlocked changes. We have been bringing many of these changes into the code base in preparation and have reached a point where the next steps require changing Crowbar 1.0 semantics.

We are first and foremost an incremental architecture & lean development team – Crowbar 2.0 will have the smallest footprint needed to begin the transformations that are currently blocking us. There is significant room during and after the refactor for the community to shape Crowbar.

With Dell ARM-based “Copper” servers, Crowbar footprint grows

One of my team at Dell’s most critical lessons from hyperscale cloud deployments was the DevOps tooling and operations processes are key to success.  Our crowbar project was born out of this realization.

I have been tracking the progress the Copper ARM-based server from design to implementation internally.  Now, I’m excited to see it getting some deserved attention.

The Copper platform is really cool because the cost, power, and density ratios of the nodes are unparalleled.  This makes it an ideal platform for distributed mixed compute/store workloads like Hadoop.  The nodes in the platform have excellent RAM/CPU/Spindle ratios.

While Copper is driving huge density, it also drives forward the same hyperscale challenges that we’ve been trying to address with Crowbar; consequently, we’re already working to ensure that we can deploy and manage Copper with Crowbar at scale.

Copper and Crowbar make a natural team and we’re excited to be part of today’s announcement:

Dell is staging clusters of the Dell “Copper” ARM server within the Dell Solution Centers and with TACC so developers may book time on the platforms. Dell also will deliver an ARM-supported version of Crowbar, Dell’s open-source management infrastructure software, to the industry in the future.

Congratulations to the Copper team!

Quick turn OpenStack Essex on Crowbar (BOOM, now we’re at v1.4!)

Don’t blink if you’ve been watching the Crowbar release roadmap!

My team at Dell is about to turn another release of Crowbar. Version 1.3 released 5/14 (focused on Cloudera Apache Hadoop) and our original schedule showed several sprints of work on OpenStack Essex. Upon evaluation, we believe that the current community developed Essex barclamps are ready now.

The healthy state of the OpenStack Essex deployment is a reflection of 1) the quality of Essex and 2) our early community activity in creating deployments based Essex RC1 and Ubuntu Beta1.

We are planning many improvements to our OpenStack Essex and Crowbar Framework; however, most deployments can proceed without these enhancements.  This also enables participants in the 5/31 OpenStack Essex Deploy Day.

By releasing a core stable Essex reference deployment, we are accelerating field deployments and enabling the OpenStack ecosystem. In terms of previous posts, we are eliminating release interlocks to enable more downstream development. Ultimately, we hope that we are also creating a baseline OpenStack deployment.

We are also reducing the pressure to rush more disruptive Crowbar changes (like enabling high availability, adding multiple operating systems, moving to Rails 3, fewer crowbarisms in cookbooks and streamlining networking). With this foundational Essex release behind us (we call it an MVP), we can work on more depth and breadth of capability in OpenStack.

One small challenge, some of the changes that we’d expected to drop have been postponed slightly. Specifically, markdown based documentation (/docs) and some new UI pages (/network/nodes, /nodes/families). All are already in the product under but not wired into the default UI (basically, a split test).

On the bright side, we did manage to expose 10g networking awareness for barclamps; however, we have not yet refactored to barclamps to leverage the change.