Why we can’t move past installers to talk about operations – the underlay gap

20 minutes.  That’s the amount of time most developers are willing to spend installing a tool or platform that could become the foundation for their software.  I’ve watched our industry obsess on the “out of box” experience which usually translates into a single CLI command to get started (and then fails to scale up).

Secure, scalable and robust production operations is complex.  In fact, most of these platforms are specifically designed to hide that fact from developers.  

That means that these platforms intentionally hide the very complexity that they themselves need to run effectively.  Adding that complexity, at best, undermines the utility of the platform and, at worst, causes distractions that keep us forever looping on “day 1” installation issues.

I believe that systems designed to manage ops process and underlay are different than the platforms designed to manage developer life-cycle.  This is different than the fidelity gap which is about portability. Accepting that allows us to focus on delivering secure, scalable and robust infrastructure for both users.

In a pair of DevOps.com posts, I lay out my arguments about the harm being caused by trying to blend these concepts in much more detail:

  1. It’s Time to Slay the Universal Installer Unicorn
  2. How the Lure of an ‘Easy Button’ Installer Traps Projects

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

aid1165255-728px-install-pergo-flooring-step-5-version-2If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).  While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues.  Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration.  Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates.  Once the platform is installed, then we use the platform as a services.  On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance.  Using a load balancer?  You can’t configure it until you’ve got the node addresses allocated.  And then it needs to be updated as you manage your cluster.  This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration.  The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence.  For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries.  The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic.  No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Breaking Up is Hard To Do – Why I Believe Ops Decomposition (pt 1)

Over the summer, the RackN team took a radical step with our previous Ansible Kubernetes workload install: we broke it into pieces.  Why?  We wanted to eliminate all “magic happens here” steps in the deployment.

320px-dominos_fallingThe result, DR Kompos8, is a faster, leaner, transparent and parallelized installation that allows for pluggable extensions and upgrades (video tour). We also chose the operationally simplest configuration choice: Golang binaries managed by SystemDGolang binaries managed by SystemD.

Why decompose and simplify? Let’s talk about our hard earned ops automation battle scars that let to composability as a core value:

Back in the early OpenStack days, when the project was actually much simpler, we were part of a community writing Chef Cookbooks to install it. These scripts are just a sequence of programmable steps (roles in Ops-speak) that drive the configuration of services on each node in the cluster. There is an ability to find cross-cluster information and lookup local inventory so we were able to inject specific details before the process began. However, once the process started, it was pretty much like starting a dominoes chain. If anything went wrong anywhere in the installation, we had to reset all the dominoes and start over.

Like a dominoes train, it is really fun to watch when it works. Also, like dominoes, it is frustrating to set up and fix. Often we literally were holding our breath during installation hoping that we’d anticipated every variation in the software, hardware and environment. It is no surprise that the first and must critical feature we’d created was a redeploy command.

It turned out the the ability to successfully redeploy was the critical measure for success. We would not consider a deployment complete until we could wipe the systems and rebuild it automatically at least twice.

What made cluster construction so hard? There were a three key things: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).

We’ll explore these three reasons in detail for part 2 of this post tomorrow.

Even without the details, it easy to understand that we want to avoid all magic in a deployment.

For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Likewise, we need to eliminate “it worked from my desktop” automation too.  Those systems are impossible to maintain, share and scale. Composed cluster operations addresses this problem by making work modular, predictable and transparent.

5 Key Aspects of High Fidelity DevOps [repost from DevOps.com]

For all our cloud enthusiasm, I feel like ops automation is suffering as we increase choice and complexity.  Why is this happening?  It’s about loss of fidelity.

Nearly a year ago, I was inspired by a mention of “Fidelity Gaps” during a Cloud Foundry After Dark session.  With additional advice from DevOps leader Gene Kim, this narrative about the why and how of DevOps Fidelity emerged.

As much as we talk about how we should have shared goals spanning Dev and Ops, it’s not nearly as easy as it sounds. To fuel a DevOps culture, we have to build robust tooling, also.

That means investing up front in five key areas: abstraction, composability, automation, orchestration, and idempotency.

Together, these concepts allow sharing work at every level of the pipeline. Unfortunately, it’s tempting to optimize work at one level and miss the true system bottlenecks.

Creating production-like fidelity for developers is essential: We need it for scale, security and upgrades. It’s not just about sharing effort; it’s about empathy and collaboration.

But even with growing acceptance of DevOps as a cultural movement, I believe deployment disparities are a big unsolved problem. When developers have vastly different working environments from operators, it creates a “fidelity gap” that makes it difficult for the teams to collaborate.

Before we talk about the costs and solutions, let me first share a story from back when I was a bright-eyed OpenStack enthusiast…

Read the Full Article on DevOps.com including my section about Why OpenStack Devstack harms the project and five specific ways to improve DevOps fidelity.

my 8 steps that would improve OpenStack Interop w/ AWS

I’ve been talking with a lot of OpenStack people about frustrating my attempted hybrid work on seven OpenStack clouds [OpenStack Session Wed 2:40].  This post documents the behavior Digital Rebar expects from the multiple clouds that we have integrated with so far.  At RackN, we use this pattern for both cloud and physical automation.

Sunday, I found myself back in front of the the Board talking about the challenge that implementation variation creates for users.  Ultimately, the question “does this harm users?” is answered by “no, they just leave for Amazon.”

I can’t stress this enough: it’s not about APIs!  The challenge is twofold: implementation variance between OpenStack clouds and variance between OpenStack and AWS.

The obvious and simplest answer is that OpenStack implementers need to conform more closely to AWS patterns (once again, NOT the APIs).

Here are the eight Digital Rebar node allocation steps [and my notes about general availability on OpenStack clouds]:

  1. Add node specific SSH key [YES]
  2. Get Metadata on Networks, Flavors and Images [YES]
  3. Pick correct network, flavors and images [NO, each site is distinct]
  4. Request node [YES]
  5. Get node PUBLIC address for node [NO, most OpenStack clouds do not have external access by default]
  6. Login into system using node SSH key [PARTIAL, the account name varies]
  7. Add root account with Rebar SSH key(s) and remove password login [PARTIAL, does not work on some systems]
  8. Remove node specific SSH key [YES]

These steps work on every other cloud infrastructure that we’ve used.  And they are achievable on OpenStack – DreamHost delivered this experience on their new DreamCompute infrastructure.

I think that this is very achievable for OpenStack, but we’re doing to have to drive conformance and figure out an alternative to the Floating IP (FIP) pattern (IPv6, port forwarding, or adding FIPs by default) would all work as part of the solution.

For Digital Rebar, the quick answer is to simply allocate a FIP for every node.  We can easily make this a configuration option; however, it feels like a pattern fail to me.  It’s certainly not a requirement from other clouds.

I hope this post provides specifics about delivering a more portable hybrid experience.  What critical items do you want as part of your cloud ops process?

OpenStack is caught in a snowstorm – it’s status quo for ops implementations to be snowflakes

OpenStack got into exactly the place we expected: operations started with fragmented and divergent data centers (aka snowflaked) and OpenStack did nothing to change that. Can we fix that? Yes, but the answer involves relying on Amazon as our benchmark.

In advance of my OpenStack Summit Demo/Presentation (video!) [slides], I’ve spent the last few weeks mapping seven (and counting) OpenStack implementations into the cloud provider subsystem of the Digital Rebar provisioning platform. Before I started working on adding OpenStack integration, RackN already created a hybrid DevOps baseline. We are able to run the same Kubernetes and Docker Swarm provisioning extensions on multiple targets including Amazon, Google, Packet and directly on physical systems (aka metal).

Before we talk about OpenStack challenges, it’s important to understand that data centers and clouds are messy, heterogeneous environments.

These variations are so significant and operationally challenging that they are the fundamental design driver for Digital Rebar. The platform uses a composable operational approach to isolate and then chain automation tasks together. That allows configurations, like networking, from infrastructure specific functions to be passed into common building blocks without user intervention.

Composability is critical because it allows operators to isolate variations into modular pieces and the expose common configuration elements. Since the pattern works successfully for crossing other clouds and metal, I anticipated success with OpenStack.

The challenge is that there is not “one standard OpenStack” implementation.  This issue is well documented under OpenStack as Project Shade.

If you only plan to operate a mono-cloud then these are not concerns; however, everyone I’ve met is using at least AWS and one other cloud. This operational fact means that AWS provides the common service behavior baseline. This is not an API statement – it’s about being able to operate on the systems delivered by the API.

While the OpenStack API worked consistently on each tested cloud (win for DefCore!), it frequently delivered systems that could not be deployed or were unusable for later steps.

While these are not directly OpenStack API concerns, I do believe that additional metadata in the API could help expose material configuration choices. The challenge becomes defining those choices in a reference architecture way. The OpenStack principle of leaving implementation choices open makes it challenging to drive these options to a narrow set of choices. Unfortunately, it means it is difficult to create an intra-OpenStack hybrid automation without hard-coded vendor identities or exploding configuration flags.

As series of individually reasonable options dominoes together to make to these challenges.  These are real issues that I made the integration difficult.

  • No default of externally accessible systems. I have to assign floating IPs (an anti-pattern for individual VMs) or be on the internal networks. No consistent naming pattern for networks, types (flavors) or starting images.  In several cases, the “private” network is the publicly accessible one and the “external” network is visible but unusable.
  • No consistent naming for access user accounts.  If I want to ssh to a system, I have to fail my first login before I learn the right user name.
  • No data to determine which networks provide which functions.  And there’s no metadata about which networks are public or private.  
  • Incomplete post-provisioning processes because they are left open to user customization.

There is a defensible and logical reason for each example above; sadly, those reasons do nothing to make OpenStack more operationally accessible.  While intra-OpenStack interoperability is helpful, I believe that ecosystems and users benefit from Amazon-like behavior.

What should you do?  Help broaden the OpenStack discussions to seek interoperability with the whole cloud ecosystem.


At RackN, we will continue to refine and adapt to these variations.  Creating a consistent experience that copes with variability is the raison d’etre for our efforts with Digital Rebar. That means that we ultimately use AWS as the yardstick for configuration of any infrastructure from physical, OpenStack and even Amazon!


SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Originally posted on Kubernetes Blog.  I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early.  Please join us in this effort!

openWe think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.