Unknown's avatar

About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

OpenStack is ready, but are you? Get some operational cloud mojo and get started!

NOTE: This post is not intended as an endorsement of the company “CloudOps.”

This week, I’ve working to describe the “cloud operation model” or “cloud ops” to Dell internal and external customers.  CloudOps is really just DevOps but packaged more broadly to help explain how hardware, software, and operations interact.  The critical concept I’m trying to convey is that we’re not spending enough time working with customers on operations.

Running a cloud is driven by operational processes and choices.

Back in 2001 when virtualization was a shiny new thing, no one had any idea on how to operate a virtualized data center.  My company (now owned by Quest) struggled to win deals outside of our own data center because our customers did not know how to operate virtualized hardware.  Ultimately, VMware created the SAN based data center consolidation pattern and sales exploded.  That solution is much more about operations than hardware (SANs) or software (ESX).

So here in 2011, we have the same challenge with cloud.  (The majority of) Dell’s customers do not know how to operate a hyperscale data center because there is no commonly accepted pattern.  That’s where the cloud operation model comes into play – we have cloud proven hardware and cloud proven software, but we had been missing a description of the operational cloud mojo.

My team’s first OpenStack project started as a cloud installer (aka Crowbar), but we’ve learned that it is more fundamental than that.  To achieve “4 hours to cloud,” our approach embraced the DevOps philosophy that deployments should be automated, dynamic and repeatable.  Our choice to extend Opscode’s Chef Server allowed us to bring in more than just a software capability: it delivers a core operational foundation that enables customers to manage their data center at significant scale.

We had to deliver a CloudOps Foundation because Cloud is not a static configuration that can be distilled in a 10 page white paper!

Cloud scale requires an Operations Foundation that can respond and react because deployed software and infrastructure is constantly evolving and adapting.  I do not mean moving around assets like VMs.  I am talking about something that closer to refactoring code and writing software features.   Like the applications that run on the cloud, we need to recognize that cloud is a moving target and build systems that can handle that.

We’re delivering OpenStack using an operational platform that can respond to the code as it changes and expands.  There is more than enough stable code and proven capability in OpenStack for our customers with CloudOps mojo to start building their operational foundation and to create commercial public clouds.  These first providers are not waiting for a “final release” of OpenStack where it’s suddenly “production ready.”

The beauty of an open source cloud with an active community is that it will be constantly improving.

Some may be hoping that in 5 years we will have established patterns for hyperscale; however, I think those days are past.  Instead, we’ll see tools that accelerate infrastructure agility.  We already have those for public cloud deployments and now it’s time to bring those into the data center itself.  But that is the subject for another post (BlackOps).

Use the 80/80 rule to crush your competition: you have to know WHICH 20% matters. #Lean #Agile

 In software, the 80/20 rule is a harsh reality.  It has two equally distressing parts:

  1. 80% of your feature set is common while 20% is unique. 
  2. 80% of your time going into creating 20% of the features

Part 1 should be a good thing – 80% of what you build will help all your customers.  Unfortunately, “unique” means that 20% of what you invest in will only help a fraction of your audience.  No problem you say? 

How do you know WHICH 20% is the unique part and which is the 80% common part?

Not knowing the 80 from the 20 is where Part 2 is particularly unkind.  Since you spend the majority of your investment on features for a narrow audience, you’d better get that pick your top features wisely.

The cold reality is that is that it’s not obvious which features are included in the 80% and which are in the 20%.  If you want to build a successful product, you need a way to pick the right features.

At most 50% of the features for a product are obvious in advance.

Let me explain using my last “next big thing” as an example.  I’m built a mobile sandwich application called sAndroidwich™.  Here are my product manager’s 10 features (in rank order):

  1. Bread (top)
  2. Bread (bottom)
  3. Bacon
  4. Romaine Lettuce
  5. Tomato
  6. Tuna
  7. Smoked Turkey
  8. Hummus
  9. Pepper Jack Cheese
  10. Cheddar Cheese (developers think Cheddar is easy if you already know Jack)

It’s pretty obvious that we’d identified BLT as our core market because everyone loves bacon, but what about the next 5 features?  Our product manager has 25 years of experience consuming sandwiches and swears that he knows this market inside and out.  Will these features put me into the top 3 social food apps?  You bet!  Call up Y Combinator, we’re going to IPO!

My potential feature list should have looked more like this:

  • Feature #            Features
  • 1-5                          Bread, Bread, Bacon, Lettuce, Tomato
  • 6-8                          Turkey Market: Turkey, Jack, Mustard
  • 9-11                       Beef Market: Beef, Cheddar, Mayo
  • 12-14                     Tuna Market: Tuna, Munster, Pickles
  • 15-16                     Veggie Market: Sprouts, Hummus

That’s 16 features even though I only have time for 10!  In addition to simply listing more features, I’ve also added market segments.  It’s important to remember that 80/20 rule also applies to features by market so features for 1 market may not help (or even hurt) sales in an adjacent market.

The challenge to picking features is that 50% of them are common to all users and their use is obvious while 30% of them are common to all users but you can’t distinguish them from the unique features.  I consider these to be “nonobvious common.”  You should take the time to list 160% of your potential features if you hope to find the real 80%.

To figure out the 30% nonobvious common features, you must accept that your own experience and bias clouds your judgment.

If you make the assumption that you can predict which of the features in the 80% and which are 20% then you will be wrong about 50% of your feature set!  If you accept that the second 50% of your features can only be discovered by customer interactions then you’re open to discovering the hidden 30% of common features.

Discovering this hidden 30% is critical to success because they are your market differentiation!

If you can find the hidden 30% then your competitor is probably handing you the golden goose.  In most cases, they are waiting while their engineering team is building the wrong features or focusing their 80% effort on the less critical 20% features.  This behavior ultimately causes feature fan out – which will have to wait for a future post.

BTW: sAndroidwich™ never made it into the top 10 apps – my team’s bias toward tuna and hummus (omega 3s AND delicious) meant that we missed the super-hot Beef and Jack market.   If only we’d shipped the BLT features (using Lean) then market tested and added incrementally, we may have been able to adjust before iSubpad and Po’Berry got all the users.

Go read “Liquid Leadership” (@bradszollose, http://bit.ly/eaTWa6): gaming=job skillz, teams=privilege & coopetition

I like slow media that takes time to build and explain a point (aka books) and I have read plenty of business media that I think are important (Starfish & Spider, Peopleware, Coders At Work, Predictably Irrational) and fun to discuss; however, few have been as immediately practical as Brad Szollose’s Liquid Leadership.

On the surface, Liquid Leadership is about helping Boomers work better with Digital Natives (netizens).  Just below that surface, the book hits at the intersection of our brave new digital world and the workplace.  Szollose’s insights are smart, well supported and relevant.  Even better, I found that the deeper I penetrated into this ocean of insight, the more I got from it.

If you want to transform (or save) your company, read this book.

To whet your appetite, I will share the conversational points that have interested my peers at work, wife, friends and mother-in-law.

  • Membership on a team is a privilege: you have to earn it.  Not everyone shows up with trust, enthusiasm, humility and leadership needed.
  • Video games position digital natives for success.  It teaches risk taking, iterative attempts, remote social teaming and digital pacing.
  • Netizens leave organizations with hierarchal management.  Management in 2010 is about team leadership and facilitation.
  • Smart people are motivated by trust and autonomy not as much pay and status.
  • Relationship and social marketing puts to focus back on quality and innovation, not messaging and glossies.  Broadcast (uni-directional) marketing is dead.
  • Using speed of execution to manage risk. Szollose loves Agile (does not call it that) and mirrors the same concepts that I expound about Lean.
  • Being creative in business means working with your competitors.  My #1 project at Dell right now, OpenStack, requires this and it’s the best way to drive customer value.  The customers don’t care about your competitor – they just want good solutions.

PS: If you like reading books like this and are interested in a discussion group in Austin, please comment on this post.

Substituting Action for Knowledge – adopting “ready, fire, aim” as a strategy (and when to run like hell)

Today my mother-in-law (a practicing psychiatrist) was bemoaning the current medical practice of substituting action for knowledge. In her world, many doctors will make rapid changes to their patients’ therapy. Their goal is to address the issues immediately presented (patient feels sad so Dr prescribes antidepressants) rather than taking time to understand the patients’ history or make changes incrementally and measure impacts. It feels like another example of our cultural compulsion to fix problems as quickly as possible.

Her comments made me question the core way that I evangelize!

Do Lean and Agile substitute action for knowledge? No. We use action to acquire knowledge.

The fundamental assumption that drives poor decision-making is that we have enough information to make a design, solve a problem or define a market. Lean and Agile’s more core tenet is that we must attack this assumption. We must assume that we can’t gather enough information to fully define our objective. The good news, is that even without much analysis we know a lot! We know:

  • roughly what we want to do (road map)
  • the first steps we should take (tactics)
  • who will be working on the problem (team members)
  • generally how much effort it will take (time & team size)
  • who has the problem that we are trying to solve (market)

We also know that we’ll learn a lot more as we get closer to our target. Every delay in starting effectively pushed our “day of clarity” further into the future. For that reason, it is essential that we build a process that constantly reviews and adjusts its targets.

We need to build a process that acquires knowledge as progress is made and makes rapid progress.

In Agile, we translate this need into the decorations of our process: reviews for learning, retrospectives for adjustments, planning for taking action and short iterations to drive the feedback loop.  Agile’s mantra is “ready, fire, aim, fire, aim, fire, aim, …” which is very different from simply jumping out of a plane without a parachute and hoping you’ll find a haystack to land in.

For cloud deployments, this means building operational knowledge in stages.  Technology is simply evolving too quickly and best practices too slowly for anyone to wait for a packaged solution to solve all their cloud infrastructure problems.  We tried this and it does not work: clouds are a mixture hardware, software and operations.  More accurately, clouds are an operational model supported by hardware and software.

Currently, 80% of cloud deployment effort is operations (or “DevOps“).

When I listen to people’s plans about building product or deploying cloud, I get very skeptical when they take a lot of time to aim at objects far off on the horizon.  Perhaps they are worried that they will substitute action for knowledge; however, I think they would be better served to test their knowledge with a little action.

My MIL agrees – she sees her patients frequently and makes small adjustments to their treatment as needed.  Wow, that’s an Rx for Agile!

My EV, RAVolt, rides again. Brakes no longer broken.

My EV blog (RAVolt.com) is basically inactive so I’m cross posting the positive news that I’ve got the RAVolt road worthy again!  I was able to repair the brake line and learned how to bleed the brake lines (surprisingly easy).  Next step is to order some new batteries.

The RAVolt has been idle for over 18 months.  With $4 gas around the corner, my timing is looking very good.

Rackspace will balance control of OpenStack. It takes time & strong partners

Rick Clark’s post “Why I Left Rackspace and What About Openstack” (+ his softer post script) is part of a longer conversation that started when Rackspace acquired Anso Labs and was expanded with the resignation of Chris Kemp (NASA CTO & OpenStack #1 fanboy).

Building a community is a delicate balance: you need show leadership while you cultivate leadership.

Putting aside the context (resigning from Rackspace to join Cisco) of his post, I think that Rick’s comments do resonate with parts of the community.  OpenStack goverance became unbalanced when Anso became Rackspace.  The governance board formed at the Austin conference was dominated by a small number (2: NASA/Anso & Rackspace) of highly committed voices but there was no single master.

Considering OpenStack’s momentum, we are in a very good position to fix the single master problem.  However, it takes time.  While companies like Dell (my employer), NTT, Citrix, Cisco (Rick’s employer), and Microsoft are clearly investing in OpenStack, none have yet achieved NASA or Rackspace’s level of technical committment.

The challenge for Rackspace is to expand the OpenStack market and ecosystem so that partners are motivated to jump in more and more quickly.  If my experiences inside Dell are indicative of the broader community, Rackspace’s leadership makes it much easier for partners to increase their own commitment.  Like teaching my daughter to ride her bike, she needed to know that I was running next to her before she would pedal hard enough to balance by herself.

Like teaching bike riding – you can’t lead communities too hard or too lightly.

To build a community around OpenStack, we (the partners) need to stand up our own capability.  Until we have demonstrated more leadership, Rackspace must cultivate both a community and a market.  This is a challenging role to balance.  While the community wants distributed ownership, the market wants leadership.  Rick’s governance comments are evidence of this struggle and Rick’s move to Cisco is an indication of leadership diversification.

I believe that Rackspace is committed to distributed ownership – we, in the community, need to rise to the challenge!

OpenStack still needs strong leadership from Rackspace because the market needs someone to be accountable for releases and features.  That allows new partners to depend on someone to run beside them while the wobble their way along to independence.  As the community leaders stand up, we’ll see a balanced community emerge.  The challenge is on us to make that happen (and happen quickly).

How OpenStack installer (crowbar + chefops) works (video from 3/14 demo)

July 24th 2012 Update:

This page is very very old and Crowbar has progressed significantly since this was posted.  For better information, please visit the Crowbar wiki and  review my Crowbar 2 writeups.

August 5th 2011 Update:

While still relevant and accurate, the information on this page does not reflect the latest information about the now Apache 2 released Crowbar code.  In the 4+ months following this post, we substantially refactored the code make make it more modular (see Barclamps), better looking, and multi-vendor/multi-application (Hadoop & RHEL).  If you want more information, I recommend that you try Crowbar for yourself.

Original March 14th 2011 Text:

I’ve been getting some “how does Crowbar work” inquiries and wanted to take a shot at adding some technical detail.   Before I launch into technical babble, there are some important things to note:

  1. Dell has committed to open source release the code for Crowbar (Apache 2)
  2. Crowbar is an extension of Chef Server – it does not function stand alone and uses Chef’s APIs to store all it’s data.
  3. The OpenStack components install is managed by Chef cookbooks & recipes jointly developed by Dell, Opscode and Rackspace.
  4. Crowbar can be used to simply bootstrap your data center; however, we believe it is the start of a cloud operational model that I described in the hyperscale cloud white paper.

LIVE DEMO (video via Barton George): If you’re at SXSW on 3/14 @ 2pm in Kung Fu Salon, you can ask Greg Althaus to explain it – he does a better job than I do.

Here’s what you need to know to understand Crowbar:

Crowbar is a PXE state machine.

The primary function of Crowbar is to get new hardware into a state where it can be managed by Chef.   To get hardware into a “Chef Ready” state, there are several steps that must be performed.  We need to setup the BIOS, RAID, figure out where the server is racked, install an operating system, assign IP networking and names, synchronize clocks (NTP) and setup a chef client linked to our server.  That’s a lot of steps!

In order to do these steps, we need to boot the server through a series of controlled images (stages) and track the progress through each state.  That means that each state corresponds to a PXE boot image.  The images have a simple script that uses WGET to update the Crowbar server (which stores it’s data in Chef) when the script completes.  When a state is finished, Crowbar will change the PXE server to provide the next image in the sequence.

During the Crowbar managed part of the install, the servers will reboot several times.  Once all of the hardware configuration is complete, Crowbar will use an operating system install image to create the base configuration.  For the first release, we are only planning to have a single Operating System (Ubuntu 10.10); however, we expect to be adding more operating system options.

The current architecture of Crowbar (and the Chef Server that it extends) is to use a dedicated server in the system for administration.  Our default install adds PXE, DHCP, NTP, DNS, Nagios, & Ganglia to the admin server.  For small systems, you can use Chef to add other infrastructure capabilities to the admin server; unfortunately, adding components makes it harder to redeploy the components.  For dynamic configurations where you may want to rehearse deployments while building Chef recipes, we recommend installing other infrastructure services on the admin server.

Of course, the hardware configuration steps are vendor specific.  We had to make the state machine (stored in Chef data bags) configurable so that you can add or omit steps.  Since hardware config is slow, error prone and painful, we see this as a big value add.  Making it work for open source will depend on community participation.

Once Chef has control of the servers, you can use Chef (on the local Chef Server) to complete the OpenStack installation.  From there, you can continue to use Chef to deploy VMs into the environment.  Because Chef encourages a DevOps automation mindset, I believe there is a significant ROI to your investment in learning how this tool operates if you want to manage hyperscale clouds.

Crowbar effectively extends the reach of Chef earlier into the cloud management life cycle.

3/21 Note: Updated graphic to show WGET.

Demo Redux: OpenStack installer SXSW demo of Chef + Crowbar

If you missed the OpenStack installer demo at Cloud Connect Event then you’ll have another chance to see us go from bare iron to provisioning VMs in under 30 minutes at SXSW on Monday 3/14 from 2-4 pm at Kung Fu Saloon.

Note: Rackspace rented out the Kung Fu Saloon all day Monday, and are doing various events — from live webinars to a Scoble tweetup to a happy hour and more VIP after hours event.

The demo will be orchestrated by Greg Althaus from my team at Dell.  Greg is the primary architect for Crowbar and responsible for some of it’s amazing capabilities including the Chef integrations, network discovery and rockin’ PXE state machine.  Dell Cloud Evanglist, Barton George, will also be on hand.

Of course, our friends from Opscode & Rackspace will be there too – this is Rackspace’s party (they are a Platinum SXSW sponsor)

More more information (outside of this blog, of course), check out http://www.Dell.com/OpenStack.

Notes from 2011 Cloud Connect Event Day 2 (#ccevent)

With the OpenStack launch behind me, I have some time to attend the Cloud Connect Event.  I missed all the DevOps sessions, but was getting to geek out on the NoSQL & Big Data sessions.   I jumped to the private cloud track (based on Twitter traffic) and was rewarded for the shift.

I’m surprised at how much focus this cloud conference is dedicated to private cloud.  At other cloud conferences I’ve attended, the focus has been on learning how to use the cloud (specifically the public cloud).  This is the first cloud show I’ve attended that has so much emphasis, dialog and vendor feeding around private.  This was a suits & slacks show with few jeans, t-shirts, and pony tails.  Perhaps private cloud is where the $$$ is being spent now?

It definitely feels like using cloud has become assumed, but the best practices and tools are just emerging.

The twitter #ccevent stream is interesting but temporal.  I’m posting my raw (spelling optional) notes (below the more tag) because there is a lot of great content from the show to support and extend the twitter stream.  I’ll try to italicize some of the better lines.

Continue reading