OpenStack is caught in a snowstorm – it’s status quo for ops implementations to be snowflakes

OpenStack got into exactly the place we expected: operations started with fragmented and divergent data centers (aka snowflaked) and OpenStack did nothing to change that. Can we fix that? Yes, but the answer involves relying on Amazon as our benchmark.

In advance of my OpenStack Summit Demo/Presentation (video!) [slides], I’ve spent the last few weeks mapping seven (and counting) OpenStack implementations into the cloud provider subsystem of the Digital Rebar provisioning platform. Before I started working on adding OpenStack integration, RackN already created a hybrid DevOps baseline. We are able to run the same Kubernetes and Docker Swarm provisioning extensions on multiple targets including Amazon, Google, Packet and directly on physical systems (aka metal).

Before we talk about OpenStack challenges, it’s important to understand that data centers and clouds are messy, heterogeneous environments.

These variations are so significant and operationally challenging that they are the fundamental design driver for Digital Rebar. The platform uses a composable operational approach to isolate and then chain automation tasks together. That allows configurations, like networking, from infrastructure specific functions to be passed into common building blocks without user intervention.

Composability is critical because it allows operators to isolate variations into modular pieces and the expose common configuration elements. Since the pattern works successfully for crossing other clouds and metal, I anticipated success with OpenStack.

The challenge is that there is not “one standard OpenStack” implementation.  This issue is well documented under OpenStack as Project Shade.

If you only plan to operate a mono-cloud then these are not concerns; however, everyone I’ve met is using at least AWS and one other cloud. This operational fact means that AWS provides the common service behavior baseline. This is not an API statement – it’s about being able to operate on the systems delivered by the API.

While the OpenStack API worked consistently on each tested cloud (win for DefCore!), it frequently delivered systems that could not be deployed or were unusable for later steps.

While these are not directly OpenStack API concerns, I do believe that additional metadata in the API could help expose material configuration choices. The challenge becomes defining those choices in a reference architecture way. The OpenStack principle of leaving implementation choices open makes it challenging to drive these options to a narrow set of choices. Unfortunately, it means it is difficult to create an intra-OpenStack hybrid automation without hard-coded vendor identities or exploding configuration flags.

As series of individually reasonable options dominoes together to make to these challenges.  These are real issues that I made the integration difficult.

  • No default of externally accessible systems. I have to assign floating IPs (an anti-pattern for individual VMs) or be on the internal networks. No consistent naming pattern for networks, types (flavors) or starting images.  In several cases, the “private” network is the publicly accessible one and the “external” network is visible but unusable.
  • No consistent naming for access user accounts.  If I want to ssh to a system, I have to fail my first login before I learn the right user name.
  • No data to determine which networks provide which functions.  And there’s no metadata about which networks are public or private.  
  • Incomplete post-provisioning processes because they are left open to user customization.

There is a defensible and logical reason for each example above; sadly, those reasons do nothing to make OpenStack more operationally accessible.  While intra-OpenStack interoperability is helpful, I believe that ecosystems and users benefit from Amazon-like behavior.

What should you do?  Help broaden the OpenStack discussions to seek interoperability with the whole cloud ecosystem.

 

At RackN, we will continue to refine and adapt to these variations.  Creating a consistent experience that copes with variability is the raison d’etre for our efforts with Digital Rebar. That means that we ultimately use AWS as the yardstick for configuration of any infrastructure from physical, OpenStack and even Amazon!

 

To avoid echo chamber, OpenStack must embrace competitive cloud ecosystem

wpid-20151023_100533.jpg
Japanese Bullet Train View

I was in Japan before the Tokyo summit on a bullet train to Kyoto watching the mix of heavy industry and bucolic mountains pass by. That scene reflects an OpenStack duality: we want to be both a dominant platform delivering core cloud services and an open source values driven collective.

First, I fundamentally believe in the success of OpenStack as the open virtual infrastructure management platform.

I believe that we have solved the virtual compute/storage/network problem sufficiently to become the de facto open IaaS platform. While not perfect, the technologies are sufficient assuming we continue to improve ease of use and operational hardening. Pursing that base capability is my primary motivation for DefCore work.

I don’t believe that the OpenStack community is, or should try to become, the authority on “all things cloud.”

In the presence of Amazon, VMware, Microsoft and Google, we cannot make that claim with any degree of self-respect. Even newcomers like DigitalOcean have an undeniable footprint and influence. Those vendor platforms drive cloud ecosystems and technologies which foster fast innovation because there is no friction to joining their ecosystems and they are sufficiently large and stable enough to represent a target market. We’ve seen clear signs from Rackspace, HP and others that platform diversity improves cloud strength.

I continue to think we (OpenStack) spend too much time evaluating what is “in” or “out” of the project and too little time talking about what’s “on,” “under” and “with” the project like Kubernetes, Mesos, Docker, SDN, Hadoop and Ceph. That type of thinking creates distance between OpenStack efforts and the majority of the market.

What motivates the drive to an all open captive community? It’s the reasonable concern that critical parts of the infrastructure will become pay-to-play. For example, what if a non-OpenStack alternative to Heat Orchestration gained popularity for OpenStack implementers. Perhaps something that ran on Amazon also. That would create external pressure that would drive internal priorities. These “non-OpenStack” products would then have influence without having to contribute back to upstream.

Can we afford to have external entities driving internal priorities? Hell yes, that’s what customer adoption looks like.

OpenStack does not own the market sufficiently to create cloud echo chamber. The next wave of cloud innovation (my money is on container platforms) will follow the path of least resistance and widest adoption. We need to embrace that these innovations will not all be inside our community so that we can welcome them as part of our ecosystem. The community needs to find peace with that.

The Upstream Imperative: paving the way for content creators is required for platform success

Since content is king, platform companies (like Google, Microsoft, Twitter, Facebook and Amazon) win by attracting developers to build on their services.  Open source tooling and frameworks are the critical interfaces for these adopters; consequently, they must invest in building communities around those platforms even if it means open sourcing previously internal only tools.

This post expands on one of my OSCON observations: companies who write lots of code have discovered an imperative to upstream their internal projects.   For background, review my thoughts about open source and supply chain management.

Huh?  What is an “upstream imperative?”  It sounds like what salmon do during spawning then read the post-script!

Historically, companies with a lot of internal development tools had no inventive to open those projects.  In fact, the “collaboration tax” of open source discouraged companies from sharing code for essential operations.   Historically, open source was considered less featured and slower than commercial or internal projects; however, this perception has been totally shattered.  So companies are faced with a balance between the overhead of supporting external needs (aka collaboration) and the innovation those users bring into the effort.

Until recently, this balance usually tipped towards opening a project but under-investing in the community to keep the collaboration costs low.  The change I saw at OSCON is that companies understand that making open projects successful bring communities closer to their products and services.

That’s a huge boon to the overall technology community.

Being able to leverage and extend tools that have been proven by these internal teams strengthens and accelerates everyone. These communities act as free laboratories that breed new platforms and build deep relationships with critical influencers.  The upstream savvy companies see returns from both innovation around their tools and more content that’s well matched to their platforms.

Oh, and companies that fail to upstream will find it increasingly hard to attract critical mind share.  Thinking the alternatives gives us a Windows into how open source impacts past incumbents.

That leads to a future post about how XaaS dog fooding and “pure-play” aaS projects like OpenStack and CloudFoundry.

Post Script about Upstreaming:

Continue reading

The unexpected openness of OpenStack: why it’s important to learn from others’ operations experience.

During the OpenStack Design Conference, Forrester’s James Staten (@Staten7) raved about OpenStack’s transparency compared to AWS.  Within the enclave of OpenStack fan boys supports (Dell alone sent >14 people to the summit), his post drew a considerable attention but did little to really further the value proposition.

“Open deployments” are a much more significant value to implementors than transparency from open source code.

For any technology solution, there are significant challenges that will only be understood when the system is under stress.  In some cases, these challenges are code defects; however, many will be related to configuration and deployment choices that are site specific.  It is correcting these issues that result in design patterns and practices that create a robust infrastructure; consequently, the process of hardening a solution is critical to its ultimate stability and success.

When a solution, like AWS, is deployed and managed by a single entity, it is extremely rare for operational lessons learned and best practices to make it to the larger community.  Amazon’s recent post mortem is a welcome exception.   This is not a bad thing (Roman Stanek’s contrasting point), it is just the reality of a proprietary cloud.  AWS operates as a black box and I don’t believe that Amazon’s operational experience would be relevant to others unless they were also operationally transparent.

While it makes business sense to remain operationally opaque, service providers lose the benefit of external lessons learned when there is no community working in parallel with them.

OpenStack’s community has an opportunity to iterate on CloudOps patterns and practices at a dramatically faster rate than any single provider.  This creates distinct value for OpenStack adopters because they can shorten or eliminate their own challenges because other adopters will have the same pains and benefit from the same fixes.

It is critical to understand that the benefit is conferred to both the party sharing the problem (they get advice and support) and the party lending assistance (they avoid the problem).  This is distinctly different from proprietary clouds where sharing is likely to cause embarrassment  unlikely to create helpful outcomes.

I am not advocating that all OpenStack deployments be the same or follow a prescriptive patterns. 

I believe that each installation will be unique in some way; however, there will  be enough commonalities and shared code to make sharing worthwhile.  This is especially true for adopters who start with tools like Crowbar that leverage community based Chef Recipes and automating scripts.  Tools that encourage automation and shared scripts help accelerate the establishment of robust deployment patterns and practices.

Ultimately, the ability to collaborate on cloud operation practice does more to strengthen OpenStack than developers, code reviews or corporate endorsements.

BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

Seattle Cloud Camp, Dec 2010

While I was in Seattle for Azure training preparing for Dell’s Azure Appliance , Dave @McCrory suggested that we also attend the Seattle Cloud Camp (SCC Tweets).  This event was very well attended (200 people!).  With heavy attendance by Amazon (at their HQ), Microsoft (in the ‘hood), and Google, there was a substantial cloud vendor presence (>25% from those vendors alone).  Notable omission: VMware.

My reflection about the event by segment.

Opening Sessions:

  • Most of the opening sessions were too light for the audience.  I thought we were past the “what is cloud” level, sigh.
  • Of note, the Amazon security presentation by Steve Rileywas fun and entertaining.
  • Picking on a Dell competitor specifically: calling your cloud solution “WAS” is a branding #fail (not that DCSWA much is better).

Unpanel of self-appointed cloud extroverts experts:

  • The unpanel covered some decent topics (@adronbh captured them on twitter), unfortunately none of the answers really stood out to me.  Except for NoSQL.
  • The unpanel discussion about NoSQL drew 2 answers.  1) It’s not NoSQL, it’s eventually consistent instead of strictly consistent.  (note: I’ve been calling it “Storage++”) 2) We’ll see more and more choices in this area as we tune the models for utility then we’ll see some consolidation.  The suggestion was that NoSQL would follow the same explosion/contraction pattern of SQL databases.

Session on Cloud APIs (my suggested topic)

  • The Cloud API topic was well attended (30+).  The vast overwhelming majority or the attendees were using Amazon.
  • There was some interest in having “standard” APIs for cloud functions was not well received because it was felt to stifle innovation.  We are still to early.
  • It was postulated but not generally agreed that cloud aggregation (DeltaCloud, RightScale, etc) is workable.  This was considered a reason to not require standard clouds.
  • CloudCamp sponsor, Skytap, has their own API.  These APIs are value added and provide extra abstraction levels.
  • It was said that there are a LOT (50 now, 500 soon) smaller hosts that want to enter the cloud space.  These hosts will need an API – some are inventing their own.
  • I brought up the concept discussed at OpenStack that the logical abstraction for cloud network APIs is a “vlan.”  This created confusion because some thought that I meant actual 802.1q tags.  NO!  I just meant that is was the ABSTRACTION of a VLAN connecting VMs together.
  • There was agreement from the clouderati in the room that cloud networking was f’ed up, but most people were not ready to discuss.
  • Cloud APIs have some basics that are working (semantics around VMs) but still have lots of wholes.  Notably: networking, application, services, and identity)

Session on Google App Engine (GAE)

  • GAE is got a lot going on, especially in the social/mobile space.
  • Do not think a lack of news about GAE means that they are going slow, it’s just the opposite.  It looks like they are totally kicking ass with a very focused strategy.  I suspect that they are just waiting for the market to catch-up.
  • GAE understands what a “platform” really is.  They talk about their platform as the SERVICES that they are offering.  The code is just code.  The services are impressive and include identity, mail, analysis, SQL (business only), map (as in Map-Reduce), prediction (yes, prediction!), storage, etc.  The total list was nearly 20 distinct services.
  • GAE compared them selves to Azure, not Amazon.

VM != Cloud! Comparision draws ire, misses point

Having the requirement benefit of working with both Dave McCrory and Joyent on a daily basis at Dell, I cannot resist weighing in on the blog pong between them.

Dave’s post comparing VM pricing prompted Joyent to blog that VMs are not the only measure of cloud.

While I completely agree that clouds are not all about VMs, I think that Joyent is too limited in their definition of cloud in their reply.  We’re seeing an emergence of services as the differentiator between clouds.

Looking at Amazon, Azure, and Google, the clear way to reduce cloud spend is to migrate applications to consume their services (SQL, Storage, Bus, etc).

If cloud users are primarily concerned about price per hour (which I’m not convinced is the case) then they have real motivation to migrate from purely VM (or SmartMachine(tm) ) based applications to ones that use services.

PaaS, much ado about network services

There’s a surprising about of a hair pulling regarding IaaS vs PaaS.  People in the industry get into shouting matches about this topic as if it mattered more than Lindsay Lohan’s journey through rehab.

The cold hard reality is that while pundits are busy writing XaaS white papers, developers are off just writing software.  We are writing software that fits within cloud environments (weak SLA, small VMs), saves money (hosted data instead of data in VMs), and changes quickly (interpreted languages).  We’re doing using an expanding tool kit of networked components like databases, object stores, shared cache, message queue, etc.

Using network components in an application architecture is about as novel as building houses made of bricks.  So, what makes cloud architectures any better or different?

Nothing!  There is no difference if you buy VMs, install services, and wire together your application in its own little cloud bubble.  If I wanted to bait trolls, I’d call that an IaaS deployment.

However, there’s an emerging economic driver to leverage lower cost and more elastic infrastructure by using services provided by hosts rather than standing them up in a VM.  These services replace dedicated infrastructure with managed network attached services and they have become a key differentiator for all the cloud vendors

  • At Google App Engine, they include Big Tables, Queues, MemCache, etc
  • At Microsoft Azure, they include SQL Azure, Azure Storage, AppFabric, etc
  • At Amazon AWS, they include S3, SimpleDB, RDS (MySQL), Queue & Notify, etc

Using these services allows developers to focus on the business problems we are solving instead of building out infrastructure to run our applications.  We also save money because consuming an elastic managed network service is less expensive (and more consumption based) than standing up dedicated VMs to operate the services.

Ultimately, an application can be written as stateless code (really “externalized state” is more a accurate description) that relies on these services for persistence.  If a host were to dynamically instantiate instances of that code based on incoming requests then my application resource requirements would become strictly consumption based.   I would describe that as true cloud architecture. 

On a bold day, I would even consider an environment that enforced offered that architecture to be a platform.  Some may even dare to describe that as a PaaS; however, I think it’s a mistake to look to the service offering for the definition when it’s driven by the application designers’ decisions to use network services.

While we argue about PaaS vs IaaS, developers are just doing what they need.  Today they may stand-up their own services and tomorrow they incorporate 3rd party managed services.  The choice is not a binary switch, a layer cake, or a holy war.

The choice is about choosing the right most cost effective and scalable resource model.

API vs. API: How Amazon EC2 kicks VMware, RackSpace, and Microsoft

My day job is to try and choose and influence Cloud technologies so it’s no surprise when to hear different vendors pitching why their cloud API is more open, standards based, or performant.  They have convincing yet irrelevant arguments: the primary measure of a cloud API is the size of its ecosystem.

The API’s ecosystem is the number (and vitality) of the upstream partners, SaaS services, PaaS vendors, and ISVs that have built their business on top of that API.  The fundamental truth of this model, like all ad hoc IT standards, is that success is built on business traction, not on technical merit or endorsement by standards bodies.

So which Cloud API will be the winner?  We’re just rounding the first turn and Amazon is ahead.  Let’s look at the lead fillies

  • Amazon EC2/S3 has the clear leadership.  Their API is widely copied (without clear license to do so!), includes storage and their billing model is highly innovative.
  • Microsoft Azure is making a big push.  Windows continues to dominate as a platform and their SQL cloud helps address application porting.  In addition, their PaaS integration provides a forward migration.
  • VMware vCloud has taken to high road through the official standards bodies.  VMware dominates the private cloud space and their vCenter API represents a larger ecosystem than any other virtualization API.   This ecosystem guarantees that vCloud will be widely adopted but if they can cross over into public clouds is fuzzier.
  • RackSpace has an interesting position by offering both dedicated and shared hosting.  Their service and API have been along for a long time.  They have just not created the buzz that Amazon gets.  They could be a swing vote depending on their future decisions around Cloud APIs.

But maybe we don’t have to pick the winner!  Perhaps there’s an option for a trifecta bet where we don’t have to pick a single winner.  This scenario of building a multi-API abstraction layer is getting a lot of interest and creating a lot of value.  Vendors include RightScale, DeltaCloud (was RedHat, now Apache), and jCloud.

Right now, I’m sitting in the Delta Cloud session at RedHat Summit/JBoss World.  One of my concerns about API aggregation is that the API abstraction has to be either least common denominator (LCD) or have strange exceptions.  For example, the speaker is saying that approaches to Firewalls are very different or completely missing.  This creates a serious aggravation for aggregation:  does the API leave a gap, favor one API, or invent yet another way to solve the problem.

I believe the cloud API race is not just a single horse race for the Cloud Computing Cup, it’s more like the Triple Crown.   The real winning API will cover compute, network, and storage management.   

Then again, accelerating PaaS adoption could make these IaaS Clouds into buggy whip manufacturers.

Disclosure:  My employeer, Dell, is a partner with many of the companies listed above.

Network World on Ubuntu Cloud

My team at Dell is working on solutions around this cloud strategy.  I like the approach that Canonical & Ecalyptus are taking concerning the use of open source (KVM), ad hoc API standards (Amazon), and flexible storage configurations (DAS or SAN).

Looking at usage trends, stateless server designs (as we get closer to PaaS) will allow us to rethink how we architect hypervisor based clouds.  Of course, this requires us to rethink application architectures and the OS choices that we make to run them. 

Thanks for BartonGeorge.net for the link  that got this thought started.  Network World says…

“Ubuntu Enterprise Cloud provides tight integration between Ubuntu and Ecalyptus and a series of CLI tools (made even more simple by apps like HybridFox with gives them a GUI) that follows along Amazon’s construction. Work done for Ubuntu Enterprise Cloud ends up being somewhat reusable if you’re transporting your work to Amazon.”