5 Key Aspects of High Fidelity DevOps [repost from DevOps.com]

For all our cloud enthusiasm, I feel like ops automation is suffering as we increase choice and complexity.  Why is this happening?  It’s about loss of fidelity.

Nearly a year ago, I was inspired by a mention of “Fidelity Gaps” during a Cloud Foundry After Dark session.  With additional advice from DevOps leader Gene Kim, this narrative about the why and how of DevOps Fidelity emerged.

As much as we talk about how we should have shared goals spanning Dev and Ops, it’s not nearly as easy as it sounds. To fuel a DevOps culture, we have to build robust tooling, also.

That means investing up front in five key areas: abstraction, composability, automation, orchestration, and idempotency.

Together, these concepts allow sharing work at every level of the pipeline. Unfortunately, it’s tempting to optimize work at one level and miss the true system bottlenecks.

Creating production-like fidelity for developers is essential: We need it for scale, security and upgrades. It’s not just about sharing effort; it’s about empathy and collaboration.

But even with growing acceptance of DevOps as a cultural movement, I believe deployment disparities are a big unsolved problem. When developers have vastly different working environments from operators, it creates a “fidelity gap” that makes it difficult for the teams to collaborate.

Before we talk about the costs and solutions, let me first share a story from back when I was a bright-eyed OpenStack enthusiast…

Read the Full Article on DevOps.com including my section about Why OpenStack Devstack harms the project and five specific ways to improve DevOps fidelity.

Bugs Bunny, Prince and Enabling True Hybrid Infrastructure Consumption

OK- Stay with me on this. I’m drawing parallels again.  🙂

Like many from my generation, my initial exposure to classical music and opera was derived from Bugs Bunny on Saturday mornings (culturally deprived, I know). One of the cartoons I remember well was with Bugs trying to get even with the heavy-set opera singer who disrupts Bugs’ banjo playing. In order to exact his revenge, Bugs infiltrates the opera singer’s concert by impersonating the famous long-hared (hared…get it?) conductor, Leopold Stokowski. He proceeds to force the tenor to hit octaves that structurally compromise the amphitheater and as it crumbles leaves him bruised and battered. Bugs is as always, victorious.

bugs

In examining Bugs’ strategy (let’s assume he actually had one), Bugs took over operations of the orchestra’s musical program to achieve his goal of getting the tenor “in-line” so to speak. As I prepare to head down to the OpenStack Conference in Austin, TX next week, I’m seeing similar patterns develop in the cloud and data center infrastructure space which are very “Bugs/Leopold-like”. With organizations deciding on how to consolidate data centers, containerize apps and move to the cloud, vendors and open source technologies offer value, however true operational, infrastructure and platform independence are not what they appear to be. For example, once you move your apps off the data center to AWS or VMware and then later determine you are paying too much or the workload is no longer is appropriate for the infrastructure, good luck replicating the configuration work done on CloudFormation on another cloud or back in the data center. Same rationale is applicable to other technologies such as converged infrastructure and proprietary private cloud platforms. As the customer, to achieve scale and remove operational pain you must fall in line. That in itself is a big commitment to make in a still-evolving and maturing technology industry and a dynamic business climate.

On an unrelated topic, I was saddened to learn of the passing of Prince this past week. While not a die-hard fan, I liked his music. He was a great composer of songs and had a style all to his own. Beyond his music and sheer talent, I admired his business beliefs and deep desire to maintain creative ownership and control of his music and his brand.

princeDespite his fortune and fame, there was a period in the middle of Prince’s career in which he felt creatively and financially locked-in by the big record companies. Once Prince (and the unpronounceable symbol) broke away from Warner Music, he was able to produce music under his own label. This action enabled him to create music without a major record label dictating when he needed to produce a new album and what it needed to sound like. In addition, he was now able to market his new recordings to the distribution platform that supported his artistic and financial goals. While still having ties to Warner Music, he was no longer bound by their business practices. Along with starting his own music subscription service, Prince cut deals with Arista, Columbia, iTunes and Sony. Prince’s music production had operational portability, business agility and choice (seven Grammy awards and 100 million record sales also help create that kind of leverage.).

While open APIs and containers offer some portability, at RackN we believe they do not offer a completely free market experience to the cloud and infrastructure consumer. If the business decides it is paying too much for AWS, it should not allow for the operational underlay and configuration complexity to lock them to the infrastructure provider. They should be able to transfer their business to Google, Azure, Rackspace or Dreamhost with ease. We believe technologies that create portable, composable operational workflows drive true infrastructure and platform independence and as a benefit, reduces business risk. Choosing a platform and being forced to use it are two very different things.

In conclusion, when considering moving workloads to the cloud, converged infrastructure platforms or using DevOps automation tools, consider how you can achieve programmable operational portability and agility. Think about how you can best absorb new technologies without causing operational disruption in your infrastructure. Furthermore, ensure you can accomplish this in a repeatable, automated fashion. Analyze how you can abstract away complex configurations for security, networking and container orchestration technologies and make them adaptable from one infrastructure platform to another. Attempt to eliminate configuration versioning as much as possible and make upgrades simplistic and automated so your DevOps staff does not have to be experts (they are stressed out enough.).

If you are attending the OpenStack Conference this week, look me up. While I am far from a music expert, i’ll be happy to share with you my insights on how to spot a technology vendor that likes to play a purple guitar as opposed to one that eats carrots and plays the banjo.

-Dan Choquette: Co-Founder, RackN

 

 

 

SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Originally posted on Kubernetes Blog.  I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early.  Please join us in this effort!

openWe think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

AWS Ops patterns set the standard: embrace that and accelerate

RackN creates infrastructure agnostic automation so you can run physical and cloud infrastructure with the same elastic operational patterns.  If you want to make infrastructure unimportant then your hybrid DevOps objective is simple:

Create multi-infrastructure Amazon equivalence for ops automation.

Ecosystem View of AWSEven if you are not an AWS fan, they are the universal yardstick (15 minute & 40 minute presos) That goes for other clouds (public and private) and for physical infrastructure too. Their footprint is simply so pervasive that you cannot ignore “works on AWS” as a need even if you don’t need to work on AWS.  Like PCs in the late-80s, we can use vendor competition to create user choice of infrastructure. That requires a baseline for equivalence between the choices. In the 90s, the Windows’ monopoly provided those APIs.

Why should you care about hybrid DevOps? As we increase operational portability, we empower users to make economic choices that foster innovation.  That’s valuable even for AWS locked users.

We’re not talking about “give me a VM” here! The real operational need is to build accessible, interconnected systems – what is sometimes called “the underlay.” It’s more about networking, configuration and credentials than simple compute resources. We need consistent ways to automate systems that can talk to each other and static services, have access to dependency repositories (code, mirrors and container hubs) and can establish trust with other systems and administrators.

These “post” provisioning tasks are sophisticated and complex. They cannot be statically predetermined. They must be handled dynamically based on the actual resource being allocated. Without automation, this process becomes manual, glacial and impossible to maintain. Does that sound like traditional IT?

Side Note on Containers: For many developers, we are adding platforms like Docker, Kubernetes and CloudFoundry, that do these integrations automatically for their part of the application stack. This is a tremendous benefit for their use-cases. Sadly, hiding the problem from one set of users does not eliminate it! The teams implementing and maintaining those platforms still have to deal with underlay complexity.

I am emphatically not looking for AWS API compatibility: we are talking about emulating their service implementation choices.  We have plenty of ways to abstract APIs. Ops is a post-API issue.

In fact, I believe that red herring leads us to a bad place where innovation is locked behind legacy APIs.  Steal APIs where it makes sense, but don’t blindly require them because it’s the layer under them where the real compatibility challenge lurk.  

Side Note on OpenStack APIs (why they diverge): Trying to implement AWS APIs without duplicating all their behaviors is more frustrating than a fresh API without the implied AWS contracts.  This is exactly the problem with OpenStack variation.  The APIs work but there is not a behavior contract behind them.

For example, transitioning to IPv6 is difficult to deliver because Amazon still relies on IPv4. That lack makes it impossible to create hybrid automation that leverages IPv6 because they won’t work on AWS. In my world, we had to disable default use of IPv6 in Digital Rebar when we added AWS. Another example? Amazon’s regional AMI pattern, thankfully, is not replicated by Google; however, their lack means there’s no consistent image naming pattern.  In my experience, a bad pattern is generally better than inconsistent implementations.

As market dominance drives us to benchmark on Amazon, we are stuck with the good, bad and ugly aspects of their service.

For very pragmatic reasons, even AWS automation is highly fragmented. There are a large and shifting number of distinct system identifiers (AMIs, regions, flavors) plus a range of user-configured choices (security groups, keys, networks). Even within a single provider, these options make impossible to maintain a generic automation process.  Since other providers logically model from AWS, we will continue to expect AWS like behaviors from them.  Variation from those norms adds effort.

Failure to follow AWS without clear reason and alternative path is frustrating to users.

Do you agree?  Join us with Digital Rebar creating real a hybrid operations platform.

Fast Talk: Creating Operating Environments that Span Clouds and Physical Infrastructures

This short 15-minute talk pulls together a few themes around composability that you’ll see in future blogs where I lay out the challenges and solutions for hybrid DevOps practices.  Like any DevOps concept – it’s a mix of technology, attitude (culture) and process.

Our hybrid DevOps objective is simple: We need multi-infrastructure Amazon equivalence for ops automation.

IT perspective of AWSHere’s the summary:

  • Hybrid Infrastructure is new normal
  • Amazon is the Ops benchmark
  • Embrace operations automation
  • Invest in making IT composable

 

Want to listen to it?  Here’s the voice over:

 

Problems with the “Give me a Wookiee” hybrid API

Greg Althaus, RackN CTO, creates amazing hybrid DevOps orchestration that spans metal and cloud implementations.  When it comes to knowing the nooks and crannies of data centers, his ops scar tissue has scar tissue.  So, I knew you’d all enjoy this funny story he wrote after previewing my OpenStack API report.  

“APIs are only valuable if the parameters mean the same thing and you get back what you expect.” Greg Althaus

The following is a guest post by Greg:

While building the Digital Rebar OpenStack node provider, Rob Hirschfeld tried to integrate with 7+ OpenStack clouds.  While the APIs matched across instances, there are all sorts of challenges with what comes out of the API calls.  

The discovery made me realize that APIs are not the end of interoperability.  They are the beginning.  

I found I could best describe it with a story.

I found an API on a service and that API creates a Wookiee!

I can tell the API that I want a tall or short Wookiee or young or old Wookiee.  I test against the Kashyyyk service.  I consistently get a 8ft Brown 300 year old Wookiee when I ask for a Tall Old Wookiee.  

I get a 6ft Brown 50 Year old Wookiee when I ask for a Short Young Wookiee.  Exactly what I want, all the time.  

My pointy-haired emperor boss says I need to now use the Forest Moon of Endor (FME) Service.  He was told it is the exact same thing but cheaper.  Okay, let’s do this.  It consistently gives me 5 year old 4 ft tall Brown Ewok (called a Wookiee) when I ask for the Tall Young Wookiee.  

This is a fail.  I mean, yes, they are both furry and brown, but the Ewok can’t reach the top of my bookshelf.  

The next service has to work, right?  About the same price as FME, the Tatooine Service claims to be really good too.  It passes tests.  It hands out things called Wookiees.  The only problem is that, while size is an API field, the service requires the use of petite and big instead of short and tall.  This is just annoying.  This time my tall (well big) young Wookiee is 8 ft tall and 50 years old, but it is green and bald (scales are like that).  

I don’t really know what it is.  I’m sure it isn’t a Wookiee.  

And while she is awesome (better than the male Wookiees), she almost froze to death in the arctic tundra that is Boston.  

My point: APIs are only valuable if the parameters mean the same thing and you get back what you expect.

 

Hybrid DevOps: Union of Configuration, Orchestration and Composability

Steven Spector and I talked about “Hybrid DevOps” as a concept.  Our discussion led to a ‘there’s a picture for that!’ moment that often helped clarify the concept.  We believe that this concept, like Rugged DevOps, is additive to existing DevOps thinking and culture.  It’s about expanding our thinking to include orchestration and composability.

Hybrid DevOps 3 components (1)Here’s our write-up: Hybrid DevOps: Union of Configuration, Orchestration and Composability

Composability is Critical in DevOps: let’s break the monoliths

This post was inspired by my DevOps.com Git for DevOps post and is an evolution of my “Functional Ops (the cake is a lie)” talks.

git_logo2016 is the year we break down the monoliths.  We’ve spent a lot of time talking about monolithic applications and microservices; however, there’s an equally deep challenge in ops automation.

Anti-monolith composability means making our automation into function blocks that can be chained together by orchestration.

What is going wrong?  We’re building fragile tightly coupled automation.

Most of the automation scripts that I’ve worked with become very long interconnected sequences well beyond the actual application that they are trying to install.  For example, Kubernetes needs etcd as a datastore.  The current model is to include the etcd install in the install script.  The same is true for SDN install/configuation and post-install test and dashboard UIs.  The simple “install Kubernetes” quickly explodes into a kitchen sink of related adjacent components.

Those installs quickly become fragile and bloated.  Even worse, they have hidden dependencies.  What happens when etcd changes.  Now, we’ve got to track down all the references to it burried in etcd based applications.  Further, we don’t get the benefits of etcd deployment improvements like secure or scale configuration.

What can we do about it?  Resist the urge to create vertical silos.

It’s temping and fast to create automation that works in a very prescriptive way for a single platform, operating system and tool chain.  The work of creating abstractions between configuration steps seems like a lot of overhead.  Even if you create those boundaries or reuse upstream automation, you’re likely to be vulnerable to changes within that component.  All these concerns drive operators to walk away from working collaboratively with each other and with developers.

Giving up on collaborative Ops hurts us all and makes it impossible to engineer excellent operational tools.  

Don’t give up!  Like git for development, we can do this together.

DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underware” you can find.

Full Metal DevOps: 12 things we needed beyond Cobbler

Almost a manifesto!

Rob H's avatarRob Hirschfeld

The RackN team did not plan to replace Cobbler, we just needed something that responded to our need for full-cycle cross-platform DevOps automation.

Provisioning an O/S is never enough!  You need to coordinate a lot of operational activity to deploy a multi-node system, like OpenStack, Kubernetes, Docker Swarm or Ceph.  Since we believe an automated upgrade path is also required, there is a huge gap in provisioning.

So what was needed?  Here’s our (rather long!) list of gaps to fill for full Metal DevOps provisioning:

GapCommentary
1Needs to work with Cobbler!Improve? Yes.  Disrupt?  Hell No!  It has to be OK to leave Cobbler in place while we do something better.  I’d be OK to tweak my Cobber to point it to the new stuff.
2REST API & JSON CLIBeyond the obvious API, we really want a way to write scripts that drive deployment proactively.
3Modular ComponentsIf…

View original post 358 more words