SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Originally posted on Kubernetes Blog.  I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early.  Please join us in this effort!

openWe think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

AWS Ops patterns set the standard: embrace that and accelerate

RackN creates infrastructure agnostic automation so you can run physical and cloud infrastructure with the same elastic operational patterns.  If you want to make infrastructure unimportant then your hybrid DevOps objective is simple:

Create multi-infrastructure Amazon equivalence for ops automation.

Ecosystem View of AWSEven if you are not an AWS fan, they are the universal yardstick (15 minute & 40 minute presos) That goes for other clouds (public and private) and for physical infrastructure too. Their footprint is simply so pervasive that you cannot ignore “works on AWS” as a need even if you don’t need to work on AWS.  Like PCs in the late-80s, we can use vendor competition to create user choice of infrastructure. That requires a baseline for equivalence between the choices. In the 90s, the Windows’ monopoly provided those APIs.

Why should you care about hybrid DevOps? As we increase operational portability, we empower users to make economic choices that foster innovation.  That’s valuable even for AWS locked users.

We’re not talking about “give me a VM” here! The real operational need is to build accessible, interconnected systems – what is sometimes called “the underlay.” It’s more about networking, configuration and credentials than simple compute resources. We need consistent ways to automate systems that can talk to each other and static services, have access to dependency repositories (code, mirrors and container hubs) and can establish trust with other systems and administrators.

These “post” provisioning tasks are sophisticated and complex. They cannot be statically predetermined. They must be handled dynamically based on the actual resource being allocated. Without automation, this process becomes manual, glacial and impossible to maintain. Does that sound like traditional IT?

Side Note on Containers: For many developers, we are adding platforms like Docker, Kubernetes and CloudFoundry, that do these integrations automatically for their part of the application stack. This is a tremendous benefit for their use-cases. Sadly, hiding the problem from one set of users does not eliminate it! The teams implementing and maintaining those platforms still have to deal with underlay complexity.

I am emphatically not looking for AWS API compatibility: we are talking about emulating their service implementation choices.  We have plenty of ways to abstract APIs. Ops is a post-API issue.

In fact, I believe that red herring leads us to a bad place where innovation is locked behind legacy APIs.  Steal APIs where it makes sense, but don’t blindly require them because it’s the layer under them where the real compatibility challenge lurk.  

Side Note on OpenStack APIs (why they diverge): Trying to implement AWS APIs without duplicating all their behaviors is more frustrating than a fresh API without the implied AWS contracts.  This is exactly the problem with OpenStack variation.  The APIs work but there is not a behavior contract behind them.

For example, transitioning to IPv6 is difficult to deliver because Amazon still relies on IPv4. That lack makes it impossible to create hybrid automation that leverages IPv6 because they won’t work on AWS. In my world, we had to disable default use of IPv6 in Digital Rebar when we added AWS. Another example? Amazon’s regional AMI pattern, thankfully, is not replicated by Google; however, their lack means there’s no consistent image naming pattern.  In my experience, a bad pattern is generally better than inconsistent implementations.

As market dominance drives us to benchmark on Amazon, we are stuck with the good, bad and ugly aspects of their service.

For very pragmatic reasons, even AWS automation is highly fragmented. There are a large and shifting number of distinct system identifiers (AMIs, regions, flavors) plus a range of user-configured choices (security groups, keys, networks). Even within a single provider, these options make impossible to maintain a generic automation process.  Since other providers logically model from AWS, we will continue to expect AWS like behaviors from them.  Variation from those norms adds effort.

Failure to follow AWS without clear reason and alternative path is frustrating to users.

Do you agree?  Join us with Digital Rebar creating real a hybrid operations platform.

Fast Talk: Creating Operating Environments that Span Clouds and Physical Infrastructures

This short 15-minute talk pulls together a few themes around composability that you’ll see in future blogs where I lay out the challenges and solutions for hybrid DevOps practices.  Like any DevOps concept – it’s a mix of technology, attitude (culture) and process.

Our hybrid DevOps objective is simple: We need multi-infrastructure Amazon equivalence for ops automation.

IT perspective of AWSHere’s the summary:

  • Hybrid Infrastructure is new normal
  • Amazon is the Ops benchmark
  • Embrace operations automation
  • Invest in making IT composable

 

Want to listen to it?  Here’s the voice over:

 

One Cloud, Many Providers: The OpenStack Interop Challenge

We want OpenStack to work as a universal cloud API but it’s hard!  What’s the problem? 

Clouds DownThis post, written before the Tokyo Summit but not published, talks about how we got here without a common standard and offers some pointers.  At the Austin Summit, I’ve got a talk on hybrid Open Infrastructure Wednesday @ 2:40 where I talk specifically about solutions.  I’ve been working on multi-infrastructure hybrid – that means making ops portable between OpenStack, Google, Amazon, Physical and other options.

The Problem: At a fundamental level, OpenStack has yet to decide if it’s an infrastructure (IaaS) product or a open software movement.

Is there a path to be both? There are many vendors who are eager to sell you their distinct flavor of OpenStack; however, lack of consistency has frustrated users and operators. OpenStack faces internal and external competition if we do not address this fragmentation. Over the next few paragraphs, we’ll explore the path the Foundation has planned to offer users a consistent core product while fostering its innovative community.

How did we get down this path?  Here’s some background how how we got here.

Before we can discuss interoperability (interop), we need to define success for OpenStack because interop is the means, not the end. My mark for success is when OpenStack has created a sustainable market for products that rely on the platform. In business plan speak, we’d call that a serviceable available market (SAM). In practical terms, OpenStack is successful when businesses targets the platform as the first priority when building integrations over cloud behemoths: Amazon and VMware.

The apparent dominance of OpenStack in terms of corporate contribution and brand position does not translate into automatic long term success.  

While apparently united under a single brand, intentional technical diversity in OpenStack has lead to incompatibilities between different public and private implementations. While some of these issues are accidents of miscommunication, others were created by structural choices inherent to the project’s formation. No matter the causes, they frustrate users and limit the network effect of widespread adoption.

Technical diversity was both a business imperative and design objective for OpenStack during formation.

In order to quickly achieve critical mass, the project needed to welcome a diverse and competitive set of corporate sponsors. The commitment to support operating systems, multiple hypervisors, storage platforms and networking providers has been essential to the project’s growth and popularity. Unfortunately, it also creates combinatorial complexity and political headaches.

With all those variables, it’s best to think of interop as a spectrum.  

At the top of that spectrum is basic API compatibility and the boom is fully integrated operation where an application could run site unaware in multiple clouds simultaneously.  Experience shows that basic API compatibility is not sufficient: there are significant behavioral impacts due to implementation details just below the API layer that must also be part of any interop requirement.  Variations like how IPs are assigned and machines are initialized matter to both users and tools.  Any effort to ensure consistency must go beyond simple API checks to validate that these behaviors are consistent.

OpenStack enforces interop using a process known as DefCore which vendors are required to follow in order to use the trademark “OpenStack” in their product name.

The process is test driven – vendors are required to pass a suite of capability tests defined in DefCore Guidelines to get Foundation approval. Guidelines are published on a 6 month cadence and cover only a “core” part of OpenStack that DefCore has defined as the required minimum set. Vendors are encouraged to add and extend beyond that base which then leads for DefCore to expand the core based on seeing widespread adoption.

What is DefCore?  Here’s some background about that too!

By design, DefCore started with a very small set of OpenStack functionality.  So small in fact, that there were critical missing pieces like networking APIs from the initial guideline.  The goal for DefCore is to work through the coabout mmunity process to expand based identified best practices and needed capabilities.  Since OpenStack encourages variation, there will be times when we have to either accept or limit variation.  Like any shared requirements guideline, DefCore becomes a multi-vendor contract between the project and its users.

Can this work?  The reality is that Foundation enforcement of the Brand using DefCore is really a very weak lever. The real power of DefCore comes when customers use it to select vendors.

Your responsibility in this process is to demand compliance from your vendors. OpenStack interoperability efforts will fail if we rely on the Foundation to enforce compliance because it’s simply too late at that point. Think of the healthy multi-vendor WiFi environment: vendors often introduce products on preliminary specifications to get ahead of market. For success, OpenStack vendors also need to be racing to comply with the upcoming guidelines. Only customers can create that type of pressure.

From that perspective, OpenStack will only be as interoperable as the market demands.

That creates a difficult conundrum: without common standards, there’s no market and OpenStack will simply become vertical vendor islands with limited reach. Success requires putting shared interests ahead of product.

That brings us full circle: does OpenStack need to be both a product and a community?  Yes, it clearly does.  

The significant distinction for interop is that we are talking about a focus on the user community voice over the vendor and developer community.  For that, OpenStack needs to focus on product definition to grow users.

I want to thank Egle Sigler and Shamail Tahir for their early review of this post.  Even beyond the specific content, they have helped shape my views on this topics.  Now, I’d like to hear your thoughts about this!  We need to work together to address Interoperability – it’s a community thing.

OpenStack Brief Histories: Austin 2011 and DefCore

These two short items are sidebars for my “One Cloud, Many Providers: The OpenStack Interp Challenge” post.  They provide additional context for the more focused question in the post: “At a fundamental level, OpenStack has yet to decide if it’s an infrastructure product or a open software movement. Is there a path to be both?” 

Background 1: OpenStack, The Early Days

How did we get here?  It’s worth noting that 2011 OpenStack was structured as a heterogenous vendor playground.  At the inaugural OpenStack summit in Austin when the project was just forming around NASA’s Nova and Rackspace’s Swift projects, monolithic cloud stacks were a very real threat.  VMware and Amazon were the de facto standards but closed and proprietary.  The open alternatives, CloudStack (Cloud.com), Eucalyptus and OpenNebula were too tied to single vendors or lacking in scale.  Having a multi-vendor, multi-contributor project without a dictatorial owner was a critical imperative for the community and it continues to be one of the most distinctive OpenStack traits.

Background 2:  DefCore, The Community Interoperability Process

What is DefCore?  The name DefCore is a portmanteau of the committee’s job to “define core” functions of OpenStack.  The official explanation says “DefCore sets base requirements by defining 1) capabilities, 2) code and 3) must-pass tests for all OpenStack products. This definition uses community resources and involvement to drive interoperability by creating the minimum standards for products labeled OpenStack.”  Fundamentally, it’s an OpenStack Board committee with membership open to the community.  In very practical terms, DefCore picks which features and implementation details of OpenStack are required by the vendors; consequently, we’ve designed a governance process to ensure transparency and, hopefully, prevent individual vendors from exerting too much influence.

We need DevOps without Borders! Is that “Hybrid DevOps?”

The RackN team has been working on making DevOps more portable for over five years.  Portable between vendors, sites, tools and operating systems means that our automation needs be to hybrid in multiple dimensions by design.

Why drive for hybrid?  It’s about giving users control.

launch!I believe that application should drive the infrastructure, not the reverse.  I’ve heard may times that the “infrastructure should be invisible to the user.”  Unfortunately, lack of abstraction and composibility make it difficult to code across platforms.  I like the term “fidelity gap” to describe the cost of these differences.

What keeps DevOps from going hybrid?  Shortcuts related to platform entangled configuration management.

Everyone wants to get stuff done quickly; however, we make the same hard-coded ops choices over and over again.  Big bang configuration automation that embeds sequence assumptions into the script is not just technical debt, it’s fragile and difficult to upgrade or maintain.  The problem is not configuration management (that’s a critical component!), it’s the lack of system level tooling that forces us to overload the configuration tools.

What is system level tooling?  It’s integrating automation that expands beyond configuration into managing sequence (aka orchestration), service orientation, script modularity (aka composibility) and multi-platform abstraction (aka hybrid).

My ops automation experience says that these four factors must be solved together because they are interconnected.

What would a platform that embraced all these ideas look like?  Here is what we’ve been working towards with Digital Rebar at RackN:

Mono-Infrastructure IT “Hybrid DevOps”
Locked into a single platform Portable between sites and infrastructures with layered ops abstractions.
Limited interop between tools Adaptive to mix and match best-for-job tools.  Use the right scripting for the job at hand and never force migrate working automation.
Ad hoc security based on site specifics Secure using repeatable automated processes.  We fail at security when things get too complex change and adapt.
Difficult to reuse ops tools Composable Modules enable Ops Pipelines.  We have to be able to interchange parts of our deployments for collaboration and upgrades.
Fragile Configuration Management Service Oriented simplifies API integration.  The number of APIs and services is increasing.  Configuration management is not sufficient.
 Big bang: configure then deploy scripting Orchestrated action is critical because sequence matters.  Building a cluster requires sequential (often iterative) operations between nodes in the system.  We cannot build robust deployments without ongoing control over order of operations.

Should we call this “Hybrid Devops?”  That sounds so buzz-wordy!

I’ve come to believe that Hybrid DevOps is the right name.  More technical descriptions like “composable ops” or “service oriented devops” or “cross-platform orchestration” just don’t capture the real value.  All these names fail to capture the portability and multi-system flavor that drives the need for user control of hybrid in multiple dimensions.

Simply put, we need devops without borders!

What do you think?  Do you have a better term?

Post-OpenStack DefCore, I’m Chasing “open infrastructure” via cross-platform Interop

Like my previous DefCore interop windmill tilting, this is not something that can be done alone. Open infrastructure is a collaborative effort and I’m looking for your help and support. I believe solving this problem benefits us as an industry and individually as IT professionals.

2013-09-13_18-56-39_197So, what is open infrastructure?   It’s not about running on open source software. It’s about creating platform choice and control. In my experience, that’s what defines open for users (and developers are not users).

I’ve spent several years helping lead OpenStack interoperability (aka DefCore) efforts to ensure that OpenStack cloud APIs are consistent between vendors. I strongly believe that effort is essential to build an ecosystem around the project; however, in talking to enterprise users, I’ve learned that that their  real  interoperability gap is between that many platforms, AWS, Google, VMware, OpenStack and Metal, that they use everyday.

Instead of focusing inward to one platform, I believe the bigger enterprise need is to address automation across platforms. It is something I’m starting to call hybrid DevOps because it allows users to mix platforms, service APIs and tools.

Open infrastructure in that context is being able to work across platforms without being tied into one platform choice even when that platform is based on open source software. API duplication is not sufficient: the operational characteristics of each platform are different enough that we need a different abstraction approach.

We have to be able to compose automation in a way that tolerates substitution based on infrastructure characteristics. This is required for metal because of variation between hardware vendors and data center networking and services. It is equally essential for cloud because of variation between IaaS capabilities and service delivery models. Basically, those  minor  differences between clouds create significant challenges in interoperability at the operational level.

Rationalizing APIs does little to address these more structural differences.

The problem is compounded because the differences are not nicely segmented behind abstraction layers. If you work to build and sustain a fully integrated application, you must account for site specific needs throughout your application stack including networking, storage, access and security. I’ve described this as all deployments have 80% of the work common but the remaining 20% is mixed in with the 80% instead of being nicely layers. So, ops is cookie dough not vinaigrette.

Getting past this problem for initial provisioning on a single platform is a false victory. The real need is portable and upgrade-ready automation that can be reused and shared. Critically, we also need to build upon the existing foundations instead of requiring a blank slate. There is openness value in heterogeneous infrastructure so we need to embrace variation and design accordingly.

This is the vision the RackN team has been working towards with open source Digital Rebar project. We now able to showcase workload deployments (Docker, Kubernetes, Ceph, etc) on multiple cloud platforms that also translate to full bare metal deployments. Unlike previous generations of this tooling (some will remember Crowbar), we’ve been careful to avoid injecting external dependencies into the DevOps scripts.

While we’re able to demonstrate a high degree of portability (or fidelity) across multiple platforms, this is just the beginning. We are looking for users and collaborators who want to want to build open infrastructure from an operational perspective.

You are invited to join us in making open cross-platform operations a reality.

Smaller Nodes? Just the Right Size for Docker!

Container workloads have the potential to redefine how we think about scale and hosted infrastructure.

Last Fall, Ubiquity Hosting and RackN announced a 200 node Docker Swarm cluster as a phase one of our collaboration. Unlike cloud-based container workloads demonstrations, we chose to run this cluster directly on the bare metal.  

Why bare metal instead of virtualized? We believe that metal offers additional performance, availability and control.  

With the cluster automation ready, we’re looking for customers to help us prove those assumptions. While we could simply build on many VMs, our analysis is the a lot of smaller nodes will distribute work more efficiently. Since there is no virtualization overhead, lower RAM systems can still give great performance.

The collaboration with RackN allows us to offer customers a rapid, repeatable cluster capability. Their Digital Rebar automation works on a broad spectrum of infrastructure allow our users to rehearse deployments on cloud, quickly change components and iteratively tune the cluster.

We’re finding that these dedicated metal nodes have much better performance than similar VMs in AWS?  Don’t believe us – you can use Digital Rebar to spin up both and compare.   Since Digital Rebar is an open source platform, you can explore and expand on it.

The Docker Swarm deployment is just a starting point for us. We want to hear your provisioning ideas and work to turn them into reality.

OpenStack Shared Community Values? Here’s my seven, let’s compare

The recent discussion about OpenStack API vs Implementation had led to several discussions about “OpenStack Values.”  While entertaining, they ultimately show that we have a lot of conflicting desires and opinions about the project.  In fact, the term “values” is itself hard to define.

Consequently, I wanted to try to capture what I see as OpenStack’s current values (not my personal ones for the project – those are in the post script). I’ve tried to put everything in positive terms, but value choices always have positive and negative impacts.

Rank Value Provides Possible Downsides
1 Upstreaming Share code base and community effort “First in” wins

Latest over stability

Measure value in commits

2 Vendor’s taking initiative Broad Participation

Free marketing buzz

No one wants to say no because of Vendor bias perception
3 End-to-end open source No licenses required for scale users and developers Build, don’t buy wastes a lot of effort

Does not align with users who want to pay for services

4 Developer leadership Lots of code being created Not many user requirements being considered
5 Figuring out API via implementation Fast iterations Frustrating APIs

API depreciation

6 Passionate discussion Diversity of opinion

Drama attracts attention

“Unfriendly” community

Loudest voice wins

Cross culture challenges

7 Being able to contribute broadly Generally maintainable platform Deep skills in subject matters

Best tool for the job

In my experience, if you don’t align with a communities values then you’re going to be very unhappy in the community.  I’ve watched this happen to project founders and the community changed around them.  Let’s all RAGE QUIT!

So, this makes me reflect on my own open source values. I’d start with pragmatic utility, transparent action, principle driven decisions, iterative design and data driven decisions.

What do you value most in open communities?

APIs and Implementations collide at OpenStack Interop: The Oracle Zones vs VMs Debate

I strive to stay neutral as OpenStack DefCore co-chair; however, as someone asking for another Board term, it’s important to review my thinking so that you can make an informed voting decision.

DefCore, while always on the edge of controversy, recently became ground zero for the “what is OpenStack” debate [discussion write up]. My preferred small core “it’s an IaaS product” answer is only one side. Others favor “it’s an open cloud community” while another faction champions an “open cloud platform.” I’m struggling to find a way that it can be all together at the same time.

The TL;DR is that, today, OpenStack vendors are required to implement a system that can run Linux guests. This is an example of an implementation over API bias because there’s nothing in the API that drives that specific requirement.

From a pragmatic “get it done” perspective, OpenStack needs to remain implementation driven for now. That means that we care that “OpenStack” clouds run VMs.

While there are pragmatic reasons for this, I think that long term success will require OpenStack to become an API specification. So today’s “right answer” actually undermines the long term community value. This has been a long standing paradox in OpenStack.

Breaking the API to implementation link allows an ecosystem to grow with truly alternate implementations (not just plug-ins). This is a threat to the community “upstream first” mentality.  OpenStack needs to be confident enough in the quality and utility of the shared code base that it can allow competitive implementations. Open communities should not need walls to win but they do need clear API definition.

What is my posture for this specific issue?  It’s complicated.

First, I think that the user and ecosystem expectations are being largely ignored in these discussions. Many of the controversial items here are vendor initiatives, not user needs. Right now, I’ve heard clearly that those expectations are for OpenStack to be an IaaS the runs VMs. OpenStack really needs to focus on delivering a reliably operable VM based IaaS experience. Until that’s solid, the other efforts are vendor noise.

Second, I think that there are serious test gaps that jeopardize the standard. The fundamental premise of DefCore is that we can use the development tests for API and behavior validation. We chose this path instead of creating an independent test suite. We either need to address tests for interop within the current body of tests or discuss splitting the efforts. Both require more investment than we’ve been willing to make.

We have mechanisms in place to collects data from test results and expand the test base.  Instead of creating new rules or guidelines, I think we can work within the current framework.

The simple answer would be to block non-VM implementations; however, I trust that cloud consumers will make good decisions when given sufficient information.  I think we need to fix the tests and accept non-VM clouds if they pass the corrected tests.

For this and other reasons, I want OpenStack vendors to be specific about the configurations that they test and support. We took steps to address this in DefCore last year but pulled back from being specific about requirements.  In this particular case, I believe we should require the official OpenStack vendor to state clear details about their supported implementation.  Customers will continue vote with their wallet about which configuration details are important.

This is a complex issue and we need community input.  That means that we need to hear from you!  Here’s the TC Position and the DefCore Patch.