Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

Posted on October 7, 2016 by Rob H

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration). While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues. Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration. Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates. Once the platform is installed, then we use the platform as a services. On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance. Using a load balancer? You can’t configure it until you’ve got the node addresses allocated. And then it needs to be updated as you manage your cluster. This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration. The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence. For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries. The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic. No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Breaking Up is Hard To Do – Why I Believe Ops Decomposition (pt 1)

Posted on October 6, 2016 by Rob H

Over the summer, the RackN team took a radical step with our previous Ansible Kubernetes workload install: we broke it into pieces. Why? We wanted to eliminate all “magic happens here” steps in the deployment.

The result, DR Kompos8, is a faster, leaner, transparent and parallelized installation that allows for pluggable extensions and upgrades (video tour). We also chose the operationally simplest configuration choice: Golang binaries managed by SystemD Golang binaries managed by SystemD.

Why decompose and simplify? Let’s talk about our hard earned ops automation battle scars that let to composability as a core value:

Back in the early OpenStack days, when the project was actually much simpler, we were part of a community writing Chef Cookbooks to install it. These scripts are just a sequence of programmable steps (roles in Ops-speak) that drive the configuration of services on each node in the cluster. There is an ability to find cross-cluster information and lookup local inventory so we were able to inject specific details before the process began. However, once the process started, it was pretty much like starting a dominoes chain. If anything went wrong anywhere in the installation, we had to reset all the dominoes and start over.

Like a dominoes train, it is really fun to watch when it works. Also, like dominoes, it is frustrating to set up and fix. Often we literally were holding our breath during installation hoping that we’d anticipated every variation in the software, hardware and environment. It is no surprise that the first and must critical feature we’d created was a redeploy command.

It turned out the the ability to successfully redeploy was the critical measure for success. We would not consider a deployment complete until we could wipe the systems and rebuild it automatically at least twice.

What made cluster construction so hard? There were a three key things: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).

We’ll explore these three reasons in detail for part 2 of this post tomorrow.

Even without the details, it easy to understand that we want to avoid all magic in a deployment.

For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Likewise, we need to eliminate “it worked from my desktop” automation too. Those systems are impossible to maintain, share and scale. Composed cluster operations addresses this problem by making work modular, predictable and transparent.

5 Key Aspects of High Fidelity DevOps [repost from DevOps.com]

Posted on May 5, 2016 by Rob H

For all our cloud enthusiasm, I feel like ops automation is suffering as we increase choice and complexity. Why is this happening? It’s about loss of fidelity.

Nearly a year ago, I was inspired by a mention of “Fidelity Gaps” during a Cloud Foundry After Dark session. With additional advice from DevOps leader Gene Kim, this narrative about the why and how of DevOps Fidelity emerged.

As much as we talk about how we should have shared goals spanning Dev and Ops, it’s not nearly as easy as it sounds. To fuel a DevOps culture, we have to build robust tooling, also.

That means investing up front in five key areas: abstraction, composability, automation, orchestration, and idempotency.

Together, these concepts allow sharing work at every level of the pipeline. Unfortunately, it’s tempting to optimize work at one level and miss the true system bottlenecks.

Creating production-like fidelity for developers is essential: We need it for scale, security and upgrades. It’s not just about sharing effort; it’s about empathy and collaboration.

But even with growing acceptance of DevOps as a cultural movement, I believe deployment disparities are a big unsolved problem. When developers have vastly different working environments from operators, it creates a “fidelity gap” that makes it difficult for the teams to collaborate.

Before we talk about the costs and solutions, let me first share a story from back when I was a bright-eyed OpenStack enthusiast…

Read the Full Article on DevOps.com including my section about Why OpenStack Devstack harms the project and five specific ways to improve DevOps fidelity.

my 8 steps that would improve OpenStack Interop w/ AWS

Posted on April 27, 2016 by Rob H

I’ve been talking with a lot of OpenStack people about frustrating my attempted hybrid work on seven OpenStack clouds [OpenStack Session Wed 2:40]. This post documents the behavior Digital Rebar expects from the multiple clouds that we have integrated with so far. At RackN, we use this pattern for both cloud and physical automation.

Sunday, I found myself back in front of the the Board talking about the challenge that implementation variation creates for users. Ultimately, the question “does this harm users?” is answered by “no, they just leave for Amazon.”

I can’t stress this enough: it’s not about APIs! The challenge is twofold: implementation variance between OpenStack clouds and variance between OpenStack and AWS.

The obvious and simplest answer is that OpenStack implementers need to conform more closely to AWS patterns (once again, NOT the APIs).

Here are the eight Digital Rebar node allocation steps [and my notes about general availability on OpenStack clouds]:

Add node specific SSH key [YES]
Get Metadata on Networks, Flavors and Images [YES]
Pick correct network, flavors and images [NO, each site is distinct]
Request node [YES]
Get node PUBLIC address for node [NO, most OpenStack clouds do not have external access by default]
Login into system using node SSH key [PARTIAL, the account name varies]
Add root account with Rebar SSH key(s) and remove password login [PARTIAL, does not work on some systems]
Remove node specific SSH key [YES]

These steps work on every other cloud infrastructure that we’ve used. And they are achievable on OpenStack – DreamHost delivered this experience on their new DreamCompute infrastructure.

I think that this is very achievable for OpenStack, but we’re doing to have to drive conformance and figure out an alternative to the Floating IP (FIP) pattern (IPv6, port forwarding, or adding FIPs by default) would all work as part of the solution.

For Digital Rebar, the quick answer is to simply allocate a FIP for every node. We can easily make this a configuration option; however, it feels like a pattern fail to me. It’s certainly not a requirement from other clouds.

I hope this post provides specifics about delivering a more portable hybrid experience. What critical items do you want as part of your cloud ops process?

OpenStack is caught in a snowstorm – it’s status quo for ops implementations to be snowflakes

Posted on April 21, 2016 by Rob H

OpenStack got into exactly the place we expected: operations started with fragmented and divergent data centers (aka snowflaked) and OpenStack did nothing to change that. Can we fix that? Yes, but the answer involves relying on Amazon as our benchmark.

In advance of my OpenStack Summit Demo/Presentation (video!) [slides], I’ve spent the last few weeks mapping seven (and counting) OpenStack implementations into the cloud provider subsystem of the Digital Rebar provisioning platform. Before I started working on adding OpenStack integration, RackN already created a hybrid DevOps baseline. We are able to run the same Kubernetes and Docker Swarm provisioning extensions on multiple targets including Amazon, Google, Packet and directly on physical systems (aka metal).

Before we talk about OpenStack challenges, it’s important to understand that data centers and clouds are messy, heterogeneous environments.

These variations are so significant and operationally challenging that they are the fundamental design driver for Digital Rebar. The platform uses a composable operational approach to isolate and then chain automation tasks together. That allows configurations, like networking, from infrastructure specific functions to be passed into common building blocks without user intervention.

Composability is critical because it allows operators to isolate variations into modular pieces and the expose common configuration elements. Since the pattern works successfully for crossing other clouds and metal, I anticipated success with OpenStack.

The challenge is that there is not “one standard OpenStack” implementation. This issue is well documented under OpenStack as Project Shade.

If you only plan to operate a mono-cloud then these are not concerns; however, everyone I’ve met is using at least AWS and one other cloud. This operational fact means that AWS provides the common service behavior baseline. This is not an API statement – it’s about being able to operate on the systems delivered by the API.

While the OpenStack API worked consistently on each tested cloud (win for DefCore!), it frequently delivered systems that could not be deployed or were unusable for later steps.

While these are not directly OpenStack API concerns, I do believe that additional metadata in the API could help expose material configuration choices. The challenge becomes defining those choices in a reference architecture way. The OpenStack principle of leaving implementation choices open makes it challenging to drive these options to a narrow set of choices. Unfortunately, it means it is difficult to create an intra-OpenStack hybrid automation without hard-coded vendor identities or exploding configuration flags.

As series of individually reasonable options dominoes together to make to these challenges. These are real issues that I made the integration difficult.

No default of externally accessible systems. I have to assign floating IPs (an anti-pattern for individual VMs) or be on the internal networks. No consistent naming pattern for networks, types (flavors) or starting images. In several cases, the “private” network is the publicly accessible one and the “external” network is visible but unusable.
No consistent naming for access user accounts. If I want to ssh to a system, I have to fail my first login before I learn the right user name.
No data to determine which networks provide which functions. And there’s no metadata about which networks are public or private.
Incomplete post-provisioning processes because they are left open to user customization.

There is a defensible and logical reason for each example above; sadly, those reasons do nothing to make OpenStack more operationally accessible. While intra-OpenStack interoperability is helpful, I believe that ecosystems and users benefit from Amazon-like behavior.

What should you do? Help broaden the OpenStack discussions to seek interoperability with the whole cloud ecosystem.

At RackN, we will continue to refine and adapt to these variations. Creating a consistent experience that copes with variability is the raison d’etre for our efforts with Digital Rebar. That means that we ultimately use AWS as the yardstick for configuration of any infrastructure from physical, OpenStack and even Amazon!

SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Posted on April 20, 2016 by Rob H

Originally posted on Kubernetes Blog. I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early. Please join us in this effort!

open We think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

Chat with us on the Cluster Ops Slack channel
Email us at the Cluster Ops SIG email list

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

AWS Ops patterns set the standard: embrace that and accelerate

Posted on April 18, 2016 by Rob H

RackN creates infrastructure agnostic automation so you can run physical and cloud infrastructure with the same elastic operational patterns. If you want to make infrastructure unimportant then your hybrid DevOps objective is simple:

Create multi-infrastructure Amazon equivalence for ops automation.

Even if you are not an AWS fan, they are the universal yardstick (15 minute & 40 minute presos) That goes for other clouds (public and private) and for physical infrastructure too. Their footprint is simply so pervasive that you cannot ignore “works on AWS” as a need even if you don’t need to work on AWS. Like PCs in the late-80s, we can use vendor competition to create user choice of infrastructure. That requires a baseline for equivalence between the choices. In the 90s, the Windows’ monopoly provided those APIs.

Why should you care about hybrid DevOps? As we increase operational portability, we empower users to make economic choices that foster innovation. That’s valuable even for AWS locked users.

We’re not talking about “give me a VM” here! The real operational need is to build accessible, interconnected systems – what is sometimes called “the underlay.” It’s more about networking, configuration and credentials than simple compute resources. We need consistent ways to automate systems that can talk to each other and static services, have access to dependency repositories (code, mirrors and container hubs) and can establish trust with other systems and administrators.

These “post” provisioning tasks are sophisticated and complex. They cannot be statically predetermined. They must be handled dynamically based on the actual resource being allocated. Without automation, this process becomes manual, glacial and impossible to maintain. Does that sound like traditional IT?

Side Note on Containers: For many developers, we are adding platforms like Docker, Kubernetes and CloudFoundry, that do these integrations automatically for their part of the application stack. This is a tremendous benefit for their use-cases. Sadly, hiding the problem from one set of users does not eliminate it! The teams implementing and maintaining those platforms still have to deal with underlay complexity.

I am emphatically not looking for AWS API compatibility: we are talking about emulating their service implementation choices. We have plenty of ways to abstract APIs. Ops is a post-API issue.

In fact, I believe that red herring leads us to a bad place where innovation is locked behind legacy APIs. Steal APIs where it makes sense, but don’t blindly require them because it’s the layer under them where the real compatibility challenge lurk.

Side Note on OpenStack APIs (why they diverge): Trying to implement AWS APIs without duplicating all their behaviors is more frustrating than a fresh API without the implied AWS contracts. This is exactly the problem with OpenStack variation. The APIs work but there is not a behavior contract behind them.

For example, transitioning to IPv6 is difficult to deliver because Amazon still relies on IPv4. That lack makes it impossible to create hybrid automation that leverages IPv6 because they won’t work on AWS. In my world, we had to disable default use of IPv6 in Digital Rebar when we added AWS. Another example? Amazon’s regional AMI pattern, thankfully, is not replicated by Google; however, their lack means there’s no consistent image naming pattern. In my experience, a bad pattern is generally better than inconsistent implementations.

As market dominance drives us to benchmark on Amazon, we are stuck with the good, bad and ugly aspects of their service.

For very pragmatic reasons, even AWS automation is highly fragmented. There are a large and shifting number of distinct system identifiers (AMIs, regions, flavors) plus a range of user-configured choices (security groups, keys, networks). Even within a single provider, these options make impossible to maintain a generic automation process. Since other providers logically model from AWS, we will continue to expect AWS like behaviors from them. Variation from those norms adds effort.

Failure to follow AWS without clear reason and alternative path is frustrating to users.

Do you agree? Join us with Digital Rebar creating real a hybrid operations platform.

Fast Talk: Creating Operating Environments that Span Clouds and Physical Infrastructures

Posted on April 14, 2016 by Rob H

This short 15-minute talk pulls together a few themes around composability that you’ll see in future blogs where I lay out the challenges and solutions for hybrid DevOps practices. Like any DevOps concept – it’s a mix of technology, attitude (culture) and process.

Our hybrid DevOps objective is simple: We need multi-infrastructure Amazon equivalence for ops automation.

Here’s the summary:

Hybrid Infrastructure is new normal
Amazon is the Ops benchmark
Embrace operations automation
Invest in making IT composable

Want to listen to it? Here’s the voice over:

Hybrid DevOps: Union of Configuration, Orchestration and Composability

Posted on March 8, 2016 by Rob H

Steven Spector and I talked about “Hybrid DevOps” as a concept. Our discussion led to a ‘there’s a picture for that!’ moment that often helped clarify the concept. We believe that this concept, like Rugged DevOps, is additive to existing DevOps thinking and culture. It’s about expanding our thinking to include orchestration and composability.

Hybrid DevOps 3 components (1) Here’s our write-up: Hybrid DevOps: Union of Configuration, Orchestration and Composability

Hybrid & Container Disruption [Notes from CTP Mike Kavis’ Interview]

Posted on February 23, 2016 by Rob H

Last week, Cloud Technology Partner VP Mike Kavis (aka MadGreek65) and I talked for 30 minutes about current trends in Hybrid Infrastructure and Containers.

Mike Kavis

Three of the top questions that we discussed were:

Why Composability is required for deployment? [5:45]
Is Configuration Management dead? [10:15]
How can containers be more secure than VMs? [23:30]

Here’s the audio matching the time stamps in my notes:

00:44: What is RackN? – scale data center operations automation
01:45: Digital Rebar is… 3^rd generation provisioning to manage data center ops & bring up
02:30: Customers were struggling on Ops more than code or hardware
04:00: Rethinking “open” to include user choice of infrastructure, not just if the code is open source.
05:00: Use platforms where it’s right for users.
05:45: Composability – it’s how do we deal with complexity. Hybrid DevOps
06:40: How do we may Ops more portable
07:00: Five components of Hybrid DevOps
07:27: Rob has “Rick Perry” Moment…
08:30: 80/20 Rule for DevOps where 20% is mixed.
10:15: “Is configuration management dead” > Docker does hurt Configuration Management
11:00: How Service Registry can replace Configuration.
11:40: Reference to John Willis on the importance of sequence.
12:30: Importance of Sequence, Services & Configuration working together
12:50: Digital Rebar intermixes all three
13:30: The race to have orchestration – “it’s always been there”
14:30: Rightscale Report > Enterprises average SIX platforms in use
15:30: Fidelity Gap – Why everyone will hybrid but need to avoid monoliths
16:50: Avoid hybrid trap and keep a level of abstraction
17:41: You have to pay some “abstraction tax” if you want to hybrid BUT you can get some additional benefits: hybrid + ops management.
18:00: Rob gives a shout out to Rightscale
19:20: Rushing to solutions does not create secure and sustained delivery
20:40: If you work in a silo, you loose the ability to collaborate and reuse other works
21:05: Rob is sad about “OpenStack explosion of installers”
21:45: Container benefit from services containers – how they can be MORE SECURE
23:00: Automation required for security
23:30: How containers will be more secure than containers
24:30: Rob bring up “cheese” again…
26:15: If you have more situational awareness, you can be more secure WITHOUT putting more work for developers.
27:00: Containers can help developers worry about as many aspects of Ops
27:45: Wrap up

What do you think? I’d love to hear your opinion on these topics!

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: Agile