BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

Adding #11 to RightScale CEO’s Top 10 Cloud Myths

Generally, I think of a “Top 10 Cloud Myths” post as pure self-serving marketing fluffery, so I was pleasantly surprised to see Michael Crandal (RightScale’s CEO) producing a list with some substance.   Don’t get me wrong, the list is still a RightScale value prop 101.  It’s just that they have the good fortune to be addressing real problems and creating real value.

So, here’s my Myth #11 “We have to re-write our applications to run in the cloud.”  While that’s largely a myth; it may be a good myth to keep around because many applications SHOULD be rewritten – not for the cloud, but for changing usage patterns (more mobile users, more remote users, more SOA clients, etc)

API vs. API: How Amazon EC2 kicks VMware, RackSpace, and Microsoft

My day job is to try and choose and influence Cloud technologies so it’s no surprise when to hear different vendors pitching why their cloud API is more open, standards based, or performant.  They have convincing yet irrelevant arguments: the primary measure of a cloud API is the size of its ecosystem.

The API’s ecosystem is the number (and vitality) of the upstream partners, SaaS services, PaaS vendors, and ISVs that have built their business on top of that API.  The fundamental truth of this model, like all ad hoc IT standards, is that success is built on business traction, not on technical merit or endorsement by standards bodies.

So which Cloud API will be the winner?  We’re just rounding the first turn and Amazon is ahead.  Let’s look at the lead fillies

  • Amazon EC2/S3 has the clear leadership.  Their API is widely copied (without clear license to do so!), includes storage and their billing model is highly innovative.
  • Microsoft Azure is making a big push.  Windows continues to dominate as a platform and their SQL cloud helps address application porting.  In addition, their PaaS integration provides a forward migration.
  • VMware vCloud has taken to high road through the official standards bodies.  VMware dominates the private cloud space and their vCenter API represents a larger ecosystem than any other virtualization API.   This ecosystem guarantees that vCloud will be widely adopted but if they can cross over into public clouds is fuzzier.
  • RackSpace has an interesting position by offering both dedicated and shared hosting.  Their service and API have been along for a long time.  They have just not created the buzz that Amazon gets.  They could be a swing vote depending on their future decisions around Cloud APIs.

But maybe we don’t have to pick the winner!  Perhaps there’s an option for a trifecta bet where we don’t have to pick a single winner.  This scenario of building a multi-API abstraction layer is getting a lot of interest and creating a lot of value.  Vendors include RightScale, DeltaCloud (was RedHat, now Apache), and jCloud.

Right now, I’m sitting in the Delta Cloud session at RedHat Summit/JBoss World.  One of my concerns about API aggregation is that the API abstraction has to be either least common denominator (LCD) or have strange exceptions.  For example, the speaker is saying that approaches to Firewalls are very different or completely missing.  This creates a serious aggravation for aggregation:  does the API leave a gap, favor one API, or invent yet another way to solve the problem.

I believe the cloud API race is not just a single horse race for the Cloud Computing Cup, it’s more like the Triple Crown.   The real winning API will cover compute, network, and storage management.   

Then again, accelerating PaaS adoption could make these IaaS Clouds into buggy whip manufacturers.

Disclosure:  My employeer, Dell, is a partner with many of the companies listed above.