(re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes

TL;DR: infrastructure operations is hard and we need to do a lot more to make these systems widely accessible, easy to sustain and lower risk.  We’re discussing these topics on twitter…please join in.  Themes include “do we really have consensus and will to act” and “already a solved problem” and “this hurts OpenStack in the end.”  

pexels-photo-229949I am always looking for ways to explain (and solve!) the challenges that we face in IT operations and open infrastructure.  I’ve been writing a lot about my concern that data center automation is not keeping pace and causing technical debt.  That concern led to my recent SRE blogging for RackN.

It’s essential to solve these problems in an open way so that we can work together as a community of operators.

It feels like developers are quick to rally around open platforms and tools while operators tend to be tightly coupled to vendor solutions because operational work is tightly coupled to infrastructure.  From that perspective, I’m been very involved in OpenStack and Kubernetes open source infrastructure platforms because I believe the create communities where we can work together.

This week, I posted connected items on VMblog and RackN that layout a position where we bring together these communities.

Of course, I do have a vested interest here.  Our open underlay automation platform, Digital Rebar, was designed to address a missing layer of physical and hybrid automation under both of these projects.  We want to help accelerate these technologies by helping deliver shared best practices via software.  The stack is additive – let’s build it together.

I’m very interested in hearing from you about these ideas here or in the context of the individual posts.  Thanks!

OpenStack Boston Day 1 Notes

Contrary to pundit expectations, OpenStack did not roll over and die during the keynotes yesterday.


In my 2011 Boston Summit shirt.

In fact, I saw the signs of a maturing project seeing real use and adoption. More critically, OpenStack leadership started the event with an acknowledgement of being part of, not owning, the vibrant open infrastructure community.

Continued Growth in Core Areas

Practical reasons for running dedicated infrastructure (compliance, control and cost) make OpenStack relevant for companies and governments with significant budgets. There is also a healthy shared infrastructure (aka public cloud) market living in the shadow of the big 3 players. It’s still unclear how this ecosystem will make money for the vendors.

What do customers buy? Should the Core be free?

My personal experience is that most customers are reluctant to (but grudgingly do) buy distros for the core open technology. They are much more willing to pay for adjacencies like security, storage and networking.

Emerging Challenges from Adjacent Technologies

Containers and Kubernetes are making a significant impact on the OpenStack community. At points, the OpenStack keynote was more about Kubernetes than OpenStack. It’s also clear that customers want to use containers as an abstraction layer to make infrastructure less visible or locked-in. That opens the market for using servers directly (bare metal) or other clouds. That portability is likely to help OpenStack more than hurt it because customers can exit workloads from the Big 3 players.

Friction for adoption remains a critical hurdle.

Containers, which are cloud first platforms, have much less friction than IaaS platforms. IaaS platforms, even managed ones, require physical infrastructure with the matching complexity and investment.

OpenStack: an open infrastructure software community

Overall, the summit remains an amazing community space for open infrastructure software and cloud alternatives to the Big 3 players. The Foundation’s pivot to embrace Kubernetes and foster several other open technologies helps maintain the central enthusiasm for open source infrastructure that gave birth to the platform in the first place.

A healthy pragmatic vibe

The summit may not have the same heady taking-on-the-world feeling as the early days; instead, it has a healthy pragmatic vibe. Considering how frothy this space remains, that may be a welcome relief.

What are your impressions? I’m looking forward to hearing from you!

10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better.  I’m not thinking about scale in number of nodes, but in operator efficiency.  The primary way to create that efficiency is limit site customization and to improve reuse.  Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers.  This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.


Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers.  That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms.  Unfortunately, that does nothing for the platforms or infrastructure bound applications.  For this lower level software, we need to accept that operations environments are heterogeneous.


I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself.  To make things even harder, the infrastructure and dependencies are also constantly changing.  Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current.  These challenges limit the virtuous feedback cycles that help communities grow.  

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken.  That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns.  Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.  

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

What does it take to Operate Open Platforms? Answers in Datanaughts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanaughts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar)

RackN revisits OpenStack deployments with an eye on ongoing operations.

I’ve been an outspoken skeptic of a Joint OpenStack Kubernetes Environment (my OpenStack BCN presoSuper User follow-up and BOS Proposal) because I felt that the technical hurdles of cloud native architecture would prove challenging.  Issues like stable service positioning and persistent data are requirements for OpenStack and hard problems in Kubernetes.

I was wrong: I underestimated how fast these issues could be addressed.

youtube-thumb-nail-openstackThe Kubernetes Helm work out of the AT&T Comm Dev lab takes on the integration with a “do it the K8s native way” approach that the RackN team finds very effective.  In fact, we’ve created a fully integrated Digital Rebar deployment that lays down Kubernetes using Kargo and then adds OpenStack via Helm.  The provisioning automation includes a Ceph cluster to provide stateful sets for data persistence.  

This joint approach dramatically reduces operational challenges associated with running OpenStack without taking over a general purpose Kubernetes infrastructure for a single task.

sre-seriesGiven the rise of SRE thinking, the RackN team believes that this approach changes the field for OpenStack deployments and will ultimately dominate the field (which is already  mainly containerized).  There is still work to be completed: some complex configuration is required to allow both Kubernetes CNI and Neutron to collaborate so that containers and VMs can cross-communicate.

We are looking for companies that want to join in this work and fast-track it into production.  If this is interesting, please contact us at sre@rackn.com.

Why should you sponsor? Current OpenStack operators facing “fork-lift upgrades” should want to find a path like this one that ensures future upgrades are baked into the plan.  This approach provide a fast track to a general purpose, enterprise grade, upgradable Kubernetes infrastructure.

Closing note from my past presentations: We’re making progress on the technical aspects of this integration; however, my concerns about market positioning remain.

“Why SRE?” Discussion with Eric @Discoposse Wright

sre-series My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

ericewrightI was a guest on Eric “@discoposse” Wright of the Green Circle Community #42 Podcast (my previous appearance).

LISTEN NOW: Podcast #42 (transcript)

In this action-packed 30 minute conversation, we discuss the industry forces putting pressure on operations teams.  These pressures require operators to be investing much more heavily on reusable automation.

That leads us towards why Kubernetes is interesting and what went wrong with OpenStack (I actually use the phrase “dumpster fire”).  We ultimately talk about how those lessons embedded in Digital Rebar architecture.

Spiraling Ops Debt & the SRE coding imperative

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.


Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components.  For example, Kubernetes has a 3 month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.