5 Key Aspects of High Fidelity DevOps [repost from DevOps.com]

For all our cloud enthusiasm, I feel like ops automation is suffering as we increase choice and complexity.  Why is this happening?  It’s about loss of fidelity.

Nearly a year ago, I was inspired by a mention of “Fidelity Gaps” during a Cloud Foundry After Dark session.  With additional advice from DevOps leader Gene Kim, this narrative about the why and how of DevOps Fidelity emerged.

As much as we talk about how we should have shared goals spanning Dev and Ops, it’s not nearly as easy as it sounds. To fuel a DevOps culture, we have to build robust tooling, also.

That means investing up front in five key areas: abstraction, composability, automation, orchestration, and idempotency.

Together, these concepts allow sharing work at every level of the pipeline. Unfortunately, it’s tempting to optimize work at one level and miss the true system bottlenecks.

Creating production-like fidelity for developers is essential: We need it for scale, security and upgrades. It’s not just about sharing effort; it’s about empathy and collaboration.

But even with growing acceptance of DevOps as a cultural movement, I believe deployment disparities are a big unsolved problem. When developers have vastly different working environments from operators, it creates a “fidelity gap” that makes it difficult for the teams to collaborate.

Before we talk about the costs and solutions, let me first share a story from back when I was a bright-eyed OpenStack enthusiast…

Read the Full Article on DevOps.com including my section about Why OpenStack Devstack harms the project and five specific ways to improve DevOps fidelity.

my 8 steps that would improve OpenStack Interop w/ AWS

I’ve been talking with a lot of OpenStack people about frustrating my attempted hybrid work on seven OpenStack clouds [OpenStack Session Wed 2:40].  This post documents the behavior Digital Rebar expects from the multiple clouds that we have integrated with so far.  At RackN, we use this pattern for both cloud and physical automation.

Sunday, I found myself back in front of the the Board talking about the challenge that implementation variation creates for users.  Ultimately, the question “does this harm users?” is answered by “no, they just leave for Amazon.”

I can’t stress this enough: it’s not about APIs!  The challenge is twofold: implementation variance between OpenStack clouds and variance between OpenStack and AWS.

The obvious and simplest answer is that OpenStack implementers need to conform more closely to AWS patterns (once again, NOT the APIs).

Here are the eight Digital Rebar node allocation steps [and my notes about general availability on OpenStack clouds]:

  1. Add node specific SSH key [YES]
  2. Get Metadata on Networks, Flavors and Images [YES]
  3. Pick correct network, flavors and images [NO, each site is distinct]
  4. Request node [YES]
  5. Get node PUBLIC address for node [NO, most OpenStack clouds do not have external access by default]
  6. Login into system using node SSH key [PARTIAL, the account name varies]
  7. Add root account with Rebar SSH key(s) and remove password login [PARTIAL, does not work on some systems]
  8. Remove node specific SSH key [YES]

These steps work on every other cloud infrastructure that we’ve used.  And they are achievable on OpenStack – DreamHost delivered this experience on their new DreamCompute infrastructure.

I think that this is very achievable for OpenStack, but we’re doing to have to drive conformance and figure out an alternative to the Floating IP (FIP) pattern (IPv6, port forwarding, or adding FIPs by default) would all work as part of the solution.

For Digital Rebar, the quick answer is to simply allocate a FIP for every node.  We can easily make this a configuration option; however, it feels like a pattern fail to me.  It’s certainly not a requirement from other clouds.

I hope this post provides specifics about delivering a more portable hybrid experience.  What critical items do you want as part of your cloud ops process?

OpenStack is caught in a snowstorm – it’s status quo for ops implementations to be snowflakes

OpenStack got into exactly the place we expected: operations started with fragmented and divergent data centers (aka snowflaked) and OpenStack did nothing to change that. Can we fix that? Yes, but the answer involves relying on Amazon as our benchmark.

In advance of my OpenStack Summit Demo/Presentation (video!) [slides], I’ve spent the last few weeks mapping seven (and counting) OpenStack implementations into the cloud provider subsystem of the Digital Rebar provisioning platform. Before I started working on adding OpenStack integration, RackN already created a hybrid DevOps baseline. We are able to run the same Kubernetes and Docker Swarm provisioning extensions on multiple targets including Amazon, Google, Packet and directly on physical systems (aka metal).

Before we talk about OpenStack challenges, it’s important to understand that data centers and clouds are messy, heterogeneous environments.

These variations are so significant and operationally challenging that they are the fundamental design driver for Digital Rebar. The platform uses a composable operational approach to isolate and then chain automation tasks together. That allows configurations, like networking, from infrastructure specific functions to be passed into common building blocks without user intervention.

Composability is critical because it allows operators to isolate variations into modular pieces and the expose common configuration elements. Since the pattern works successfully for crossing other clouds and metal, I anticipated success with OpenStack.

The challenge is that there is not “one standard OpenStack” implementation.  This issue is well documented under OpenStack as Project Shade.

If you only plan to operate a mono-cloud then these are not concerns; however, everyone I’ve met is using at least AWS and one other cloud. This operational fact means that AWS provides the common service behavior baseline. This is not an API statement – it’s about being able to operate on the systems delivered by the API.

While the OpenStack API worked consistently on each tested cloud (win for DefCore!), it frequently delivered systems that could not be deployed or were unusable for later steps.

While these are not directly OpenStack API concerns, I do believe that additional metadata in the API could help expose material configuration choices. The challenge becomes defining those choices in a reference architecture way. The OpenStack principle of leaving implementation choices open makes it challenging to drive these options to a narrow set of choices. Unfortunately, it means it is difficult to create an intra-OpenStack hybrid automation without hard-coded vendor identities or exploding configuration flags.

As series of individually reasonable options dominoes together to make to these challenges.  These are real issues that I made the integration difficult.

  • No default of externally accessible systems. I have to assign floating IPs (an anti-pattern for individual VMs) or be on the internal networks. No consistent naming pattern for networks, types (flavors) or starting images.  In several cases, the “private” network is the publicly accessible one and the “external” network is visible but unusable.
  • No consistent naming for access user accounts.  If I want to ssh to a system, I have to fail my first login before I learn the right user name.
  • No data to determine which networks provide which functions.  And there’s no metadata about which networks are public or private.  
  • Incomplete post-provisioning processes because they are left open to user customization.

There is a defensible and logical reason for each example above; sadly, those reasons do nothing to make OpenStack more operationally accessible.  While intra-OpenStack interoperability is helpful, I believe that ecosystems and users benefit from Amazon-like behavior.

What should you do?  Help broaden the OpenStack discussions to seek interoperability with the whole cloud ecosystem.

 

At RackN, we will continue to refine and adapt to these variations.  Creating a consistent experience that copes with variability is the raison d’etre for our efforts with Digital Rebar. That means that we ultimately use AWS as the yardstick for configuration of any infrastructure from physical, OpenStack and even Amazon!

 

SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Originally posted on Kubernetes Blog.  I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early.  Please join us in this effort!

openWe think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

Fast Talk: Creating Operating Environments that Span Clouds and Physical Infrastructures

This short 15-minute talk pulls together a few themes around composability that you’ll see in future blogs where I lay out the challenges and solutions for hybrid DevOps practices.  Like any DevOps concept – it’s a mix of technology, attitude (culture) and process.

Our hybrid DevOps objective is simple: We need multi-infrastructure Amazon equivalence for ops automation.

IT perspective of AWSHere’s the summary:

  • Hybrid Infrastructure is new normal
  • Amazon is the Ops benchmark
  • Embrace operations automation
  • Invest in making IT composable

 

Want to listen to it?  Here’s the voice over:

https://www.youtube.com/watch?v=uorHrvMgwc0

 

Problems with the “Give me a Wookiee” hybrid API

Greg Althaus, RackN CTO, creates amazing hybrid DevOps orchestration that spans metal and cloud implementations.  When it comes to knowing the nooks and crannies of data centers, his ops scar tissue has scar tissue.  So, I knew you’d all enjoy this funny story he wrote after previewing my OpenStack API report.  

“APIs are only valuable if the parameters mean the same thing and you get back what you expect.” Greg Althaus

The following is a guest post by Greg:

While building the Digital Rebar OpenStack node provider, Rob Hirschfeld tried to integrate with 7+ OpenStack clouds.  While the APIs matched across instances, there are all sorts of challenges with what comes out of the API calls.  

The discovery made me realize that APIs are not the end of interoperability.  They are the beginning.  

I found I could best describe it with a story.

I found an API on a service and that API creates a Wookiee!

I can tell the API that I want a tall or short Wookiee or young or old Wookiee.  I test against the Kashyyyk service.  I consistently get a 8ft Brown 300 year old Wookiee when I ask for a Tall Old Wookiee.  

I get a 6ft Brown 50 Year old Wookiee when I ask for a Short Young Wookiee.  Exactly what I want, all the time.  

My pointy-haired emperor boss says I need to now use the Forest Moon of Endor (FME) Service.  He was told it is the exact same thing but cheaper.  Okay, let’s do this.  It consistently gives me 5 year old 4 ft tall Brown Ewok (called a Wookiee) when I ask for the Tall Young Wookiee.  

This is a fail.  I mean, yes, they are both furry and brown, but the Ewok can’t reach the top of my bookshelf.  

The next service has to work, right?  About the same price as FME, the Tatooine Service claims to be really good too.  It passes tests.  It hands out things called Wookiees.  The only problem is that, while size is an API field, the service requires the use of petite and big instead of short and tall.  This is just annoying.  This time my tall (well big) young Wookiee is 8 ft tall and 50 years old, but it is green and bald (scales are like that).  

I don’t really know what it is.  I’m sure it isn’t a Wookiee.  

And while she is awesome (better than the male Wookiees), she almost froze to death in the arctic tundra that is Boston.  

My point: APIs are only valuable if the parameters mean the same thing and you get back what you expect.

 

Hybrid DevOps: Union of Configuration, Orchestration and Composability

Steven Spector and I talked about “Hybrid DevOps” as a concept.  Our discussion led to a ‘there’s a picture for that!’ moment that often helped clarify the concept.  We believe that this concept, like Rugged DevOps, is additive to existing DevOps thinking and culture.  It’s about expanding our thinking to include orchestration and composability.

Hybrid DevOps 3 components (1)Here’s our write-up: Hybrid DevOps: Union of Configuration, Orchestration and Composability

Is Hybrid DevOps Like The Tokyo Metro?

I LOVE OPS ANALOGIES!  The “Hybrid DevOps = Tokyo Metro” really works because it accepts that some complexity is inescapable.  It would be great if Tokyo was a single system, but it’s not.  Cloud and infrastructure are the same – they are not a single vendor system and going to converge.

With that intro…Dan Choquette writes how DevOps at scale like a major city’s subway system? Both require strict processes and operational excellence to move a lot of different parts at once. How else? If you had …

Source: Is Hybrid DevOps Like The Tokyo Metro?

Composability is Critical in DevOps: let’s break the monoliths

This post was inspired by my DevOps.com Git for DevOps post and is an evolution of my “Functional Ops (the cake is a lie)” talks.

git_logo2016 is the year we break down the monoliths.  We’ve spent a lot of time talking about monolithic applications and microservices; however, there’s an equally deep challenge in ops automation.

Anti-monolith composability means making our automation into function blocks that can be chained together by orchestration.

What is going wrong?  We’re building fragile tightly coupled automation.

Most of the automation scripts that I’ve worked with become very long interconnected sequences well beyond the actual application that they are trying to install.  For example, Kubernetes needs etcd as a datastore.  The current model is to include the etcd install in the install script.  The same is true for SDN install/configuation and post-install test and dashboard UIs.  The simple “install Kubernetes” quickly explodes into a kitchen sink of related adjacent components.

Those installs quickly become fragile and bloated.  Even worse, they have hidden dependencies.  What happens when etcd changes.  Now, we’ve got to track down all the references to it burried in etcd based applications.  Further, we don’t get the benefits of etcd deployment improvements like secure or scale configuration.

What can we do about it?  Resist the urge to create vertical silos.

It’s temping and fast to create automation that works in a very prescriptive way for a single platform, operating system and tool chain.  The work of creating abstractions between configuration steps seems like a lot of overhead.  Even if you create those boundaries or reuse upstream automation, you’re likely to be vulnerable to changes within that component.  All these concerns drive operators to walk away from working collaboratively with each other and with developers.

Giving up on collaborative Ops hurts us all and makes it impossible to engineer excellent operational tools.  

Don’t give up!  Like git for development, we can do this together.

DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underware” you can find.