Post-OpenStack DefCore, I’m Chasing “open infrastructure” via cross-platform Interop

Like my previous DefCore interop windmill tilting, this is not something that can be done alone. Open infrastructure is a collaborative effort and I’m looking for your help and support. I believe solving this problem benefits us as an industry and individually as IT professionals.

2013-09-13_18-56-39_197So, what is open infrastructure?   It’s not about running on open source software. It’s about creating platform choice and control. In my experience, that’s what defines open for users (and developers are not users).

I’ve spent several years helping lead OpenStack interoperability (aka DefCore) efforts to ensure that OpenStack cloud APIs are consistent between vendors. I strongly believe that effort is essential to build an ecosystem around the project; however, in talking to enterprise users, I’ve learned that that their  real  interoperability gap is between that many platforms, AWS, Google, VMware, OpenStack and Metal, that they use everyday.

Instead of focusing inward to one platform, I believe the bigger enterprise need is to address automation across platforms. It is something I’m starting to call hybrid DevOps because it allows users to mix platforms, service APIs and tools.

Open infrastructure in that context is being able to work across platforms without being tied into one platform choice even when that platform is based on open source software. API duplication is not sufficient: the operational characteristics of each platform are different enough that we need a different abstraction approach.

We have to be able to compose automation in a way that tolerates substitution based on infrastructure characteristics. This is required for metal because of variation between hardware vendors and data center networking and services. It is equally essential for cloud because of variation between IaaS capabilities and service delivery models. Basically, those  minor  differences between clouds create significant challenges in interoperability at the operational level.

Rationalizing APIs does little to address these more structural differences.

The problem is compounded because the differences are not nicely segmented behind abstraction layers. If you work to build and sustain a fully integrated application, you must account for site specific needs throughout your application stack including networking, storage, access and security. I’ve described this as all deployments have 80% of the work common but the remaining 20% is mixed in with the 80% instead of being nicely layers. So, ops is cookie dough not vinaigrette.

Getting past this problem for initial provisioning on a single platform is a false victory. The real need is portable and upgrade-ready automation that can be reused and shared. Critically, we also need to build upon the existing foundations instead of requiring a blank slate. There is openness value in heterogeneous infrastructure so we need to embrace variation and design accordingly.

This is the vision the RackN team has been working towards with open source Digital Rebar project. We now able to showcase workload deployments (Docker, Kubernetes, Ceph, etc) on multiple cloud platforms that also translate to full bare metal deployments. Unlike previous generations of this tooling (some will remember Crowbar), we’ve been careful to avoid injecting external dependencies into the DevOps scripts.

While we’re able to demonstrate a high degree of portability (or fidelity) across multiple platforms, this is just the beginning. We are looking for users and collaborators who want to want to build open infrastructure from an operational perspective.

You are invited to join us in making open cross-platform operations a reality.

OpenStack Shared Community Values? Here’s my seven, let’s compare

The recent discussion about OpenStack API vs Implementation had led to several discussions about “OpenStack Values.”  While entertaining, they ultimately show that we have a lot of conflicting desires and opinions about the project.  In fact, the term “values” is itself hard to define.

Consequently, I wanted to try to capture what I see as OpenStack’s current values (not my personal ones for the project – those are in the post script). I’ve tried to put everything in positive terms, but value choices always have positive and negative impacts.

Rank Value Provides Possible Downsides
1 Upstreaming Share code base and community effort “First in” wins

Latest over stability

Measure value in commits

2 Vendor’s taking initiative Broad Participation

Free marketing buzz

No one wants to say no because of Vendor bias perception
3 End-to-end open source No licenses required for scale users and developers Build, don’t buy wastes a lot of effort

Does not align with users who want to pay for services

4 Developer leadership Lots of code being created Not many user requirements being considered
5 Figuring out API via implementation Fast iterations Frustrating APIs

API depreciation

6 Passionate discussion Diversity of opinion

Drama attracts attention

“Unfriendly” community

Loudest voice wins

Cross culture challenges

7 Being able to contribute broadly Generally maintainable platform Deep skills in subject matters

Best tool for the job

In my experience, if you don’t align with a communities values then you’re going to be very unhappy in the community.  I’ve watched this happen to project founders and the community changed around them.  Let’s all RAGE QUIT!

So, this makes me reflect on my own open source values. I’d start with pragmatic utility, transparent action, principle driven decisions, iterative design and data driven decisions.

What do you value most in open communities?

APIs and Implementations collide at OpenStack Interop: The Oracle Zones vs VMs Debate

I strive to stay neutral as OpenStack DefCore co-chair; however, as someone asking for another Board term, it’s important to review my thinking so that you can make an informed voting decision.

DefCore, while always on the edge of controversy, recently became ground zero for the “what is OpenStack” debate [discussion write up]. My preferred small core “it’s an IaaS product” answer is only one side. Others favor “it’s an open cloud community” while another faction champions an “open cloud platform.” I’m struggling to find a way that it can be all together at the same time.

The TL;DR is that, today, OpenStack vendors are required to implement a system that can run Linux guests. This is an example of an implementation over API bias because there’s nothing in the API that drives that specific requirement.

From a pragmatic “get it done” perspective, OpenStack needs to remain implementation driven for now. That means that we care that “OpenStack” clouds run VMs.

While there are pragmatic reasons for this, I think that long term success will require OpenStack to become an API specification. So today’s “right answer” actually undermines the long term community value. This has been a long standing paradox in OpenStack.

Breaking the API to implementation link allows an ecosystem to grow with truly alternate implementations (not just plug-ins). This is a threat to the community “upstream first” mentality.  OpenStack needs to be confident enough in the quality and utility of the shared code base that it can allow competitive implementations. Open communities should not need walls to win but they do need clear API definition.

What is my posture for this specific issue?  It’s complicated.

First, I think that the user and ecosystem expectations are being largely ignored in these discussions. Many of the controversial items here are vendor initiatives, not user needs. Right now, I’ve heard clearly that those expectations are for OpenStack to be an IaaS the runs VMs. OpenStack really needs to focus on delivering a reliably operable VM based IaaS experience. Until that’s solid, the other efforts are vendor noise.

Second, I think that there are serious test gaps that jeopardize the standard. The fundamental premise of DefCore is that we can use the development tests for API and behavior validation. We chose this path instead of creating an independent test suite. We either need to address tests for interop within the current body of tests or discuss splitting the efforts. Both require more investment than we’ve been willing to make.

We have mechanisms in place to collects data from test results and expand the test base.  Instead of creating new rules or guidelines, I think we can work within the current framework.

The simple answer would be to block non-VM implementations; however, I trust that cloud consumers will make good decisions when given sufficient information.  I think we need to fix the tests and accept non-VM clouds if they pass the corrected tests.

For this and other reasons, I want OpenStack vendors to be specific about the configurations that they test and support. We took steps to address this in DefCore last year but pulled back from being specific about requirements.  In this particular case, I believe we should require the official OpenStack vendor to state clear details about their supported implementation.  Customers will continue vote with their wallet about which configuration details are important.

This is a complex issue and we need community input.  That means that we need to hear from you!  Here’s the TC Position and the DefCore Patch.

12 Predictions for ’16: mono-cloud ambitions die as containers drive more hybrid IT

I expect 2016 to be a confusing year for everyone in IT.  For 2015, I predicted that new uses for containers are going to upset cloud’s apple cart; however, the replacement paradigm is not clear yet.  Consequently, I’m doing a prognostication mix and match: five predictions and seven items on a “container technology watch list.”

TL;DR: In 2016, Hybrid IT arrives on Containers’ wings.

Considering my expectations below, I think it’s time to accept that all IT is heterogeneous and stop trying to box everything into a mono-cloud.  Accepting hybrid as current state unblocks many IT decisions that are waiting for things to settle down.

Here’s the memo: “Stop waiting.  It’s not going to converge.”

2016 Predictions

  1. Container Adoption Seen As Two Stages:  We will finally accept that Containers have strength for both infrastructure (first stage adoption) and application life-cycle (second stage adoption) transformation.  Stage one offers value so we will start talking about legacy migration into containers without shaming teams that are not also rewriting apps as immutable microservice unicorns.
  2. OpenStack continues to bump and grow.  Adoption is up and open alternatives are disappearing.  For dedicated/private IaaS, OpenStack will continue to gain in 2016 for basic VM management.  Both competitive and internal pressures continue to threaten the project but I believe they will not emerge in 2016.  Here’s my complete OpenStack 2016 post?
  3. Amazon, GCE and Azure make everything else questionable.  These services are so deep and rich that I’d question anyone who is not using them.  At least one of them simply have to be part of everyone’s IT strategy for financial, talent and technical reasons.
  4. Cloud API becomes irrelevant. Cloud API is so 2011!  There are now so many reasonable clients to abstract various Infrastructures that Cloud APIs are less relevant.  Capability, interoperability and consistency remain critical factors, but the APIs themselves are not interesting.
  5. Metal aaS gets interesting.  I’m a big believer in the power of operating metal via an API and the RackN team delivers it for private infrastructure using Digital Rebar.  Now there are several companies (Packet.net, Ubiquity Hosting and others) that offer hosted metal.

2016 Container Tech Watch List

I’m planning posts about all these key container ecosystems for 2016.  I think they are all significant contributors to the emerging application life-cycle paradigm.

  1. Service Containers (& VMs): There’s an emerging pattern of infrastructure managed containers that provide critical host services like networking, logging, and monitoring.  I believe this pattern will provide significant value and generate it’s own ecosystem.
  2. Networking & Storage Services: Gaps in networking and storage for containers need to get solved in a consistent way.  Expect a lot of thrash and innovation here.
  3. Container Orchestration Services: This is the current battleground for container mind share.  Kubernetes, Mesos and Docker Swarm get headlines but there are other interesting alternatives.
  4. Containers on Metal: Removing the virtualization layer reduces complexity, overhead and cost.  Container workloads are good choices to re-purpose older servers that have too little CPU or RAM to serve as VM hosts.  Who can say no to free infrastructure?!  While an obvious win to many, we’ll need to make progress on standardized scale and upgrade operations first.
  5. Immutable Infrastructure: Even as this term wins the “most confusing” concept in cloud award, it is an important one for container designers to understand.  The unfortunate naming paradox is that immutable infrastructure drives disciplines that allow fast turnover, better security and more dynamic management.
  6. Microservices: The latest generation of service oriented architecture (SOA) benefits from a new class of distribute service registration platforms (etcd and consul) that bring new life into SOA.
  7. Paywall Registries: The important of container registries is easy to overlook because they seem to be version 2.0 of package caches; however, container layering makes these services much more dynamic and central than many realize.  (more?  Bernard Golden and I already posted about this)

What two items did not make the 2016 cut?  1) Special purpose container-focused operating systems like CoreOS or RancherOS.  While interesting, I don’t think these deployment technologies have architectural level influence.  2) Container Security via VMs. I’m seeing patterns where containers may actually be more secure than VMs.  This is FUD created by people with a vested interest in virtualization.

Did I miss something? I’d love to know what you think I got right or wrong!

My OpenStack 2016 Analysis: Continue Core, Stop Confusing Ecosystem, Change Hybrid Approach

Note: I’ve served on the OpenStack Foundation board since its formation.  There I’ve led the “define the core” DefCore efforts.  I’m on the 2016 ballot for another term.

I love using end-of-year posts to reflect (2015, I got 6 of 7!) and try to set direction (OpenStack needed to prioritize).  This year, I wanted to use a simple “Continue, Stop, Change” format that I’ve used for employee reviews in the past.  These three items reflect how I think OpenStack needs to respond to the industry in 206.

Continue: Focus on Core

OpenStack adoption continues around the legacy projects that traditionally define it for most users.  A lot of work and focus is needed around those projects including better representation of user, operator and product interests.

Towards that end, we’ve made amazing progress on DefCore implementation and I’m excited about the discussions that it’s been generating.  It’s driving pragmatic decisions about what is required (running a vm?) and how to verify compliance.  It’s also driving conceptual thinking around OpenStack principles and ecosystem priorities.

DefCore’s focus on using community tests to define OpenStack creates a very concrete and defensible standard.  Ultimately, it comes back to users and operators demanding compliance for the work to remain meaningful.

Overall, To focus on core function, OpenStack needs to empower new groups within the community.  Expanding the role of the Product Group, Operators, and User Committee are key to giving a voice to these constituents.

OpenStack core must transition into a consistent platform or it risks becoming irrelevant.

Stop: Confusing The Ecosystem

I’m concerned about the “big tent” governance change puts OpenStack into conflict with both community vendors and the larger cloud market.  I believe we’re creating an echo chamber of OpenStack on OpenStack focus that forces adjacent efforts (like software defined network, storage and container orchestration) to be either inside or outside the community circle.  While that artificially grows the apparent contributor base, it creates artificial walls between OpenStack and the dominate cloud platforms.

Let me illustrate using my own company, RackN.  We create cross-platform devops orchestration based on an open source project, Digital Rebar.  We consider ourselves to be part of the OpenStack community and have supported deploying the core.  We also provision bare metal and deploy Kubernetes, Docker Swarm and Cloud Foundry.  That has apparent conflicts with big tent Ironic and Magnum projects.  Does that make RackN competitive with OpenStack or not?

It hurts OpenStack when competitive alignment is unclear because vendors, users and operators are uncertain about where to make investments.  In the end, users will choose simpler alternatives.

I believe the Board needs to define the OpenStack ecosystem strategy in a clear and actionable way.  If re-elected, that will be my Board priority for 2016.

Change: Hybrid Approach

My top 2016 prediction (post coming) is that we accept “hybrid IT as the new normal.”  That means that we stop driving towards an IT mono-culture and start working towards tools that embrace heterogeneity.  Along those lines, OpenStack needs to evaluate our relative position and strengths in a hybrid cloud landscape.

Interoperability between OpenStack implementations is important because it reduces friction; however, we need to expand our thinking to ensure interoperability with other platforms.  That does not mean simply cloning the AWS APIs!  It means that we need to consider users and operator needs against a spectrum of private and public infrastructures.

A broader hybrid approach also suggests that duplicating cloud-locked adjacent services (e.g. Cloud Formation vs. Heat) does not address user needs.

I am advocating that OpenStack encourage a cloud-neutral ecosystem, outside of the OpenStack tent, that work across a wide range of platforms.  That leads to user choice and creates a truly open platform. 

And, of course, more Community Discussion!

I want to thank the many people who participated in a heated twitter discussion in advance of this post.  There are many great ideas and counter-points covered in that lengthy dialog.

Do you have an opinion about what to OpenStack should stop, accelerate or change?  I’d love to hear it!

¡Sí, Sí! That’s a Two Hundred Node Metal Docker Swarm Deployment

Today, RackN and Ubiquity Hosting announced a 200 node Docker Swarm deployment on hosted bare metal.

Leveraging the current Digital Rebar core and the RackN Swarm workload, this reference deployment was automatically configured using the same components that also work on a desktop VM deployment. That high fidelity deployment allows operators to start learning quickly on small systems then grow to AWS and if warranted, potentially smoothly transition to scale metal.

This deployment represents RackN starting a new chapter with Digital Rebar because it demonstrates a commitment deploy on any infrastructure: cloud, metal or something in between.

The RackN team started this journey with a “composable ops” vision that allows operators to mix and match. That spans both vendor physical resources and software components such as operating systems, software defined networking and platforms. In the 200 node Swarm cluster, physical infrastructure is provisioned by Ubiquity Hosting not Digital Rebar or RackN.  Historically, RackN focused on private infrastructure.  Now, users get the option of best-in-class metal deployment without having to own the infrastructure.

We experienced the futility of making Ops homogeneous and declared defeat.

Accepting that each data center has individual Ops was pivotal. Digital Rebar embraces heterogeneity at the most fundamental architectural level. Our system approach and unique composable abstractions allow users to make deployments portable between any infrastructure with existing tooling and operational processes. Portability means that we can both eliminate the fidelity gap as we scale and between deployments.

When multiple scales and sites can share deployment automation, we can finally work together on addressing critical operational issues like scale, high availability and upgrade

This 200 node deployment demonstrates more than scale and the deployment of the latest Docker technology. It is a milestone on the path toward sharable production operations.

Operators, they don’t want to swim Upstream

Operators Dinner 11/10

Nov 10, Palo Alto Operators Dinner

Last Tuesday, I had the honor of joining an OpenStack scale operators dinner. Foundation executives, Jonathan Bryce and Lauren Sell, were also on the guest list so talk naturally turned to “how can OpenStack better support operators.” Notably, the session was distinctly not OpenStack bashing.

The conversation was positive, enthusiastic and productive, but one thing was clear: the OpenStack default “we’ll fix it in the upstream” answer does not work for this group of operators.

What is upstreaming?  A sans nuance answer is that OpenStack drives fixes and changes in the next community release (longer description).  The project and community have a tremendous upstream imperative that pervades the culture so deeply that we take it for granted.  Have an issue with OpenStack?  Submit a patch!  Is there any other alternative?

Upstreaming [to trunk] makes perfect sense considering the project vendor structure and governance; however, it is a very frustrating experience for operators.   OpenStack does have robust processes to backport fixes and sustain past releases and documentation; yet, the feeling at the table was that they are not sufficiently operator focused.

Operators want fast, incremental and pragmatic corrections to the code and docs they are deploying (which is often two releases back).  They want it within the community, not from individual vendors.

There are great reasons for focusing on upstream trunk.  It encourages vendors to collaborate and makes it much easier to add and expand the capabilities of the project.  Allowing independent activity on past releases creates a forward integration mess and could make upgrades even harder.  It will create divergence on APIs and implementation choices.

The risk of having a stable, independently sustained release is that operators have less reason to adopt the latest shiny release.  And that is EXACTLY what they are asking for.

Upstreaming is a core value to OpenStack and essential to our collaborative success; however, we need to consider that it is not the right answer to all questions.  Discussions at that dinner reinforced that pushing everything to latest trunk creates a significant barrier for OpenStack operators and users.

What are your experiences?  Is there a way to balance upstreaming with forking?  How can we better serve operators?