10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better.  I’m not thinking about scale in number of nodes, but in operator efficiency.  The primary way to create that efficiency is limit site customization and to improve reuse.  Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers.  This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.

 

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers.  That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms.  Unfortunately, that does nothing for the platforms or infrastructure bound applications.  For this lower level software, we need to accept that operations environments are heterogeneous.

 

I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself.  To make things even harder, the infrastructure and dependencies are also constantly changing.  Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current.  These challenges limit the virtuous feedback cycles that help communities grow.  

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken.  That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns.  Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.  

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

How scared do we need to be for Ops collaboration & investment?

Note: Yesterday RackN posted Are you impatient enough to be an SRE?  and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?” 

Cia leak (1)As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project.  Curious?  Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.”

Infrastructure Masons is building a community around data center practice

IT is subject to seismic shifts right now. Here’s how we cope together.

For a long time, I’ve advocated for open operations (“OpenOps”) as a way to share best practices about running data centers. I’ve worked hard in OpenStack and, recently, Kubernetes communities to have operators collaborate around common architectures and automation tools. I believe the first step in these efforts starts with forming a community forum.

I’m very excited to have the RackN team and technology be part of the newly formed Infrastructure Masons effort because we are taking this exact community first approach.

infrastructure_masons

Here’s how Dean Nelson, IM organizer and head of Uber Compute, describes the initiative:

An Infrastructure Mason Partner is a professional who develop products, build or support infrastructure projects, or operate infrastructure on behalf of end users. Like their end users peers, they are dedicated to the advancement of the Industry, development of their fellow masons, and empowering business and personal use of the infrastructure to better the economy, the environment, and society.

We’re in the midst of tremendous movement in IT infrastructure.  The change to highly automated and scale-out design was enabled by cloud but is not cloud specific.  This requirement is reshaping how IT is practiced at the most fundamental levels.

We (IT Ops) are feeling amazing pressure on operations and operators to accelerate workflow processes and innovate around very complex challenges.

Open operations loses if we respond by creating thousands of isolated silos or moving everything to a vendor specific island like AWS.  The right answer is to fund ways to share practices and tooling that is tolerant of real operational complexity and the legitimate needs for heterogeneity.

Interested in more?  Get involved with the group!  I’ll be sharing more details here too.

 

OpenStack Interop, Container Security, Install & Open Source Posts

In case you missed it, I posted A LOT of content this week on other sites covering topics for OpenStack Interop, Container Security, Anti-Universal Installers and Monetizing Open Source.  Here are link-bait titles & blurbs from each post so you can decide which topics pique your interest.

Thirteen Ways Containers are More Secure than Virtual Machines on TheNewStack.com

Last year, conventional wisdom had it that containers were much less secure than virtual machines (VMs)! Since containers have such thin separating walls; it was easy to paint these back door risks with a broad brush.  Here’s a reality check: Front door attacks and unpatched vulnerabilities are much more likely than these backdoor hacks.

It’s Time to Slay the Universal Installer Unicorn on DevOps.com 

While many people want a universal “easy button installer,” they also want it to work on their unique snowflake of infrastructures, tools, networks and operating systems.  Because there is so much needful variation and change, it is better to give up on open source projects trying to own an installer and instead focus on making their required components more resilient and portable.

King of the hill? Discussing practical OpenStack interoperability on OpenStack SuperUser

Can OpenStack take the crown as cloud king? In our increasingly hybrid infrastructure environment, the path to the top means making it easier to user to defect from the current leaders (Amazon AWS; VMware) instead of asking them to blaze new trails. Here are my notes from a recent discussion about that exact topic…

Have OpenSource, Will Profit?! 5 thoughts from Battery Ventures OSS event on RobHirschfeld.com

As “open source eats software” the profit imperative becomes ever more important to figure out.  We have to find ways to fund this development or acknowledge that software will simply become waste IP and largess from mega brands.  The later outcome is not particularly appealing or innovative.

AWS Ops patterns set the standard: embrace that and accelerate

RackN creates infrastructure agnostic automation so you can run physical and cloud infrastructure with the same elastic operational patterns.  If you want to make infrastructure unimportant then your hybrid DevOps objective is simple:

Create multi-infrastructure Amazon equivalence for ops automation.

Ecosystem View of AWSEven if you are not an AWS fan, they are the universal yardstick (15 minute & 40 minute presos) That goes for other clouds (public and private) and for physical infrastructure too. Their footprint is simply so pervasive that you cannot ignore “works on AWS” as a need even if you don’t need to work on AWS.  Like PCs in the late-80s, we can use vendor competition to create user choice of infrastructure. That requires a baseline for equivalence between the choices. In the 90s, the Windows’ monopoly provided those APIs.

Why should you care about hybrid DevOps? As we increase operational portability, we empower users to make economic choices that foster innovation.  That’s valuable even for AWS locked users.

We’re not talking about “give me a VM” here! The real operational need is to build accessible, interconnected systems – what is sometimes called “the underlay.” It’s more about networking, configuration and credentials than simple compute resources. We need consistent ways to automate systems that can talk to each other and static services, have access to dependency repositories (code, mirrors and container hubs) and can establish trust with other systems and administrators.

These “post” provisioning tasks are sophisticated and complex. They cannot be statically predetermined. They must be handled dynamically based on the actual resource being allocated. Without automation, this process becomes manual, glacial and impossible to maintain. Does that sound like traditional IT?

Side Note on Containers: For many developers, we are adding platforms like Docker, Kubernetes and CloudFoundry, that do these integrations automatically for their part of the application stack. This is a tremendous benefit for their use-cases. Sadly, hiding the problem from one set of users does not eliminate it! The teams implementing and maintaining those platforms still have to deal with underlay complexity.

I am emphatically not looking for AWS API compatibility: we are talking about emulating their service implementation choices.  We have plenty of ways to abstract APIs. Ops is a post-API issue.

In fact, I believe that red herring leads us to a bad place where innovation is locked behind legacy APIs.  Steal APIs where it makes sense, but don’t blindly require them because it’s the layer under them where the real compatibility challenge lurk.  

Side Note on OpenStack APIs (why they diverge): Trying to implement AWS APIs without duplicating all their behaviors is more frustrating than a fresh API without the implied AWS contracts.  This is exactly the problem with OpenStack variation.  The APIs work but there is not a behavior contract behind them.

For example, transitioning to IPv6 is difficult to deliver because Amazon still relies on IPv4. That lack makes it impossible to create hybrid automation that leverages IPv6 because they won’t work on AWS. In my world, we had to disable default use of IPv6 in Digital Rebar when we added AWS. Another example? Amazon’s regional AMI pattern, thankfully, is not replicated by Google; however, their lack means there’s no consistent image naming pattern.  In my experience, a bad pattern is generally better than inconsistent implementations.

As market dominance drives us to benchmark on Amazon, we are stuck with the good, bad and ugly aspects of their service.

For very pragmatic reasons, even AWS automation is highly fragmented. There are a large and shifting number of distinct system identifiers (AMIs, regions, flavors) plus a range of user-configured choices (security groups, keys, networks). Even within a single provider, these options make impossible to maintain a generic automation process.  Since other providers logically model from AWS, we will continue to expect AWS like behaviors from them.  Variation from those norms adds effort.

Failure to follow AWS without clear reason and alternative path is frustrating to users.

Do you agree?  Join us with Digital Rebar creating real a hybrid operations platform.

Hybrid DevOps: Union of Configuration, Orchestration and Composability

Steven Spector and I talked about “Hybrid DevOps” as a concept.  Our discussion led to a ‘there’s a picture for that!’ moment that often helped clarify the concept.  We believe that this concept, like Rugged DevOps, is additive to existing DevOps thinking and culture.  It’s about expanding our thinking to include orchestration and composability.

Hybrid DevOps 3 components (1)Here’s our write-up: Hybrid DevOps: Union of Configuration, Orchestration and Composability

Hybrid & Container Disruption [Notes from CTP Mike Kavis’ Interview]

Last week, Cloud Technology Partner VP Mike Kavis (aka MadGreek65) and I talked for 30 minutes about current trends in Hybrid Infrastructure and Containers.

leadership-photos-mike

Mike Kavis

Three of the top questions that we discussed were:

  1. Why Composability is required for deployment?  [5:45]
  2. Is Configuration Management dead? [10:15]
  3. How can containers be more secure than VMs? [23:30]

Here’s the audio matching the time stamps in my notes:

  • 00:44: What is RackN? – scale data center operations automation
  • 01:45: Digital Rebar is… 3rd generation provisioning to manage data center ops & bring up
  • 02:30: Customers were struggling on Ops more than code or hardware
  • 04:00: Rethinking “open” to include user choice of infrastructure, not just if the code is open source.
  • 05:00: Use platforms where it’s right for users.
  • 05:45: Composability – it’s how do we deal with complexity. Hybrid DevOps
  • 06:40: How do we may Ops more portable
  • 07:00: Five components of Hybrid DevOps
  • 07:27: Rob has “Rick Perry” Moment…
  • 08:30: 80/20 Rule for DevOps where 20% is mixed.
  • 10:15: “Is configuration management dead” > Docker does hurt Configuration Management
  • 11:00: How Service Registry can replace Configuration.
  • 11:40: Reference to John Willis on the importance of sequence.
  • 12:30: Importance of Sequence, Services & Configuration working together
  • 12:50: Digital Rebar intermixes all three
  • 13:30: The race to have orchestration – “it’s always been there”
  • 14:30: Rightscale Report > Enterprises average SIX platforms in use
  • 15:30: Fidelity Gap – Why everyone will hybrid but need to avoid monoliths
  • 16:50: Avoid hybrid trap and keep a level of abstraction
  • 17:41: You have to pay some “abstraction tax” if you want to hybrid BUT you can get some additional benefits: hybrid + ops management.
  • 18:00: Rob gives a shout out to Rightscale
  • 19:20: Rushing to solutions does not create secure and sustained delivery
  • 20:40: If you work in a silo, you loose the ability to collaborate and reuse other works
  • 21:05: Rob is sad about “OpenStack explosion of installers”
  • 21:45: Container benefit from services containers – how they can be MORE SECURE
  • 23:00: Automation required for security
  • 23:30: How containers will be more secure than containers
  • 24:30: Rob bring up “cheese” again…
  • 26:15: If you have more situationalleadership-photos-mike awareness, you can be more secure WITHOUT putting more work for developers.
  • 27:00: Containers can help developers worry about as many aspects of Ops
  • 27:45: Wrap up

What do you think?  I’d love to hear your opinion on these topics!