10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better.  I’m not thinking about scale in number of nodes, but in operator efficiency.  The primary way to create that efficiency is limit site customization and to improve reuse.  Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers.  This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.

 

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers.  That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms.  Unfortunately, that does nothing for the platforms or infrastructure bound applications.  For this lower level software, we need to accept that operations environments are heterogeneous.

 

I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself.  To make things even harder, the infrastructure and dependencies are also constantly changing.  Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current.  These challenges limit the virtuous feedback cycles that help communities grow.  

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken.  That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns.  Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.  

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

How scared do we need to be for Ops collaboration & investment?

Note: Yesterday RackN posted Are you impatient enough to be an SRE?  and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?” 

Cia leak (1)As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project.  Curious?  Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.”

What does it take to Operate Open Platforms? Answers in Datanaughts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanaughts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

SRE role with DevOps for Enterprise [@HPE podcast]

sre-series

My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

Yes, DevOps and SRE are complementary

In this short 16 minute podcast, HPE’s Stephen Spector and I discuss how DevOps and SRE thinking overlaps and where are the differences.  We also discuss how Enterprises should be evaluating Site Reliability Engineering as a function and where it fits in their organization.

Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar)

RackN revisits OpenStack deployments with an eye on ongoing operations.

I’ve been an outspoken skeptic of a Joint OpenStack Kubernetes Environment (my OpenStack BCN presoSuper User follow-up and BOS Proposal) because I felt that the technical hurdles of cloud native architecture would prove challenging.  Issues like stable service positioning and persistent data are requirements for OpenStack and hard problems in Kubernetes.

I was wrong: I underestimated how fast these issues could be addressed.

youtube-thumb-nail-openstackThe Kubernetes Helm work out of the AT&T Comm Dev lab takes on the integration with a “do it the K8s native way” approach that the RackN team finds very effective.  In fact, we’ve created a fully integrated Digital Rebar deployment that lays down Kubernetes using Kargo and then adds OpenStack via Helm.  The provisioning automation includes a Ceph cluster to provide stateful sets for data persistence.  

This joint approach dramatically reduces operational challenges associated with running OpenStack without taking over a general purpose Kubernetes infrastructure for a single task.

sre-seriesGiven the rise of SRE thinking, the RackN team believes that this approach changes the field for OpenStack deployments and will ultimately dominate the field (which is already  mainly containerized).  There is still work to be completed: some complex configuration is required to allow both Kubernetes CNI and Neutron to collaborate so that containers and VMs can cross-communicate.

We are looking for companies that want to join in this work and fast-track it into production.  If this is interesting, please contact us at sre@rackn.com.

Why should you sponsor? Current OpenStack operators facing “fork-lift upgrades” should want to find a path like this one that ensures future upgrades are baked into the plan.  This approach provide a fast track to a general purpose, enterprise grade, upgradable Kubernetes infrastructure.

Closing note from my past presentations: We’re making progress on the technical aspects of this integration; however, my concerns about market positioning remain.

“Why SRE?” Discussion with Eric @Discoposse Wright

sre-series My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

ericewrightI was a guest on Eric “@discoposse” Wright of the Green Circle Community #42 Podcast (my previous appearance).

LISTEN NOW: Podcast #42 (transcript)

In this action-packed 30 minute conversation, we discuss the industry forces putting pressure on operations teams.  These pressures require operators to be investing much more heavily on reusable automation.

That leads us towards why Kubernetes is interesting and what went wrong with OpenStack (I actually use the phrase “dumpster fire”).  We ultimately talk about how those lessons embedded in Digital Rebar architecture.

Apparently IT death smells like kickstart files. Six Reasons why.

Today, I’m sharing a parable about always being focused on adding value.

Recently, I was on a call with an IT Ops manager who insisted that his team had their on-premises operations under control with “python scripts and manual kickstart files” because they “really don’t change their infrastructure setup.” He explained that he and his team was comfortable with this because it was something they understood and did not require learning new systems. While I understand his position, I was sort of sad for him and his employer because…

No value is created for his company by maintaining custom kickstart, preseeds or boot files.

Maintaining kickstarts is fatal for many reasons. Is there a way to make it less fatal? Yes, and it involves investing in learning tools that let you move up stack.

Contrary to popular IT mythology, managing physical infrastructure is still a reality for many IT teams and will remain a part of best practices until every workload simply runs on Amazon and it becomes their problem.  Since that “Utopian” future is unlikely, let’s deal with some practical realities of hybrid IT.

Here are my six reasons why custom kickstarts (and other site-specific boot provisioning scripts) are dangerous:

1. Creating Site Unique Processes

Every infrastructure is unique and that’s a practical reality that we have to accept because otherwise we would never be able to make improvements and corrects without touching everything that already deployed. However, we really want to work hard to minimize places where we inject variation into the environment. That means that server and site specific kickstarts with lots of post-provisioning steps forces operators to maintain additional information about each server.

2. Building Server Specific Configurations

When we create server specific templates, it becomes nearly impossible to recreate server builds. That directly leads to fragile infrastructure because teams cannot quickly redeploy or automate refreshes. Static IT infrastructure is a known fail pattern and makes enterprises vulnerable to staff changes, hacking and inability to manage and patch.

3. Having Opaque Configurations

Kickstart is hard to understand (and even harder to troubleshoot). When teams take actions during the provisioning process they are often not tracked or managed like other operational scripting tools. Failures or injections can easily go undetected. Even if they are tracked, the number of operators who can read and manage these scripts is limited. That means that critical aspects of your operational environment happen outside of your awareness.

4. Being Less Secure

Kickstart processes generally include injecting SSH keys, certificates and other authentication credentials. These embedded credentials are often hard coded into the process with minimal awareness of the operational team leaving you vulnerable at the most foundational level. This is not an acceptable security process; however, teams who hack kickstarts often don’t want to consider the implications.

Security side note: most teams don’t have the expertise to integrate TPM or HSM into their kickstart processes; consequently, these key security technologies are generally unused and ignored. If you want to talk about this, please contact me!

5. Diverging Provisioning Patterns

Cloud does not use kickstarts. Provisioning variation increases when teams keep/add logic and configuration into server provisioning instead of doing it as post-provision automation. If your physical provisioning team is not rehearsing on cloud then you’re in a serious IT hole because all workloads should be managed as hybrid-ready. Deployment fidelity helps accelerate teams and reduces cost.

6. Reusing Community Practice

Finally, managing your own kickstarts makes it impossible to leverage community patterns and practices. Kickstarts are not exactly a hive of innovation so you are not creating any competitive advantage by adding variation there. In cases like that, reusing community tooling is a net benefit to your organization. Why have we not done this already? Until recently, provisioning tools were not API driven or focused on reusable shared practice.

While Kickstart or similar is pretty much required for physical, we have a solution for these issues.

One of the key design elements of Digital Rebar is an templated, API driven boot provisioner. Our approach uses kickstarts, preseeds and other tools; however, we’ve worked hard to minimize their span and decompose them into reusable components. That allows users to inject site specific code as snippets that are centrally managed and hardware neutral.

Critically, our approach allows SRE and Ops teams to get out of the kickstart business and focus on provisioning workflow and automation. Yes, there’s some learning curve but there are a lot of benefits to moving up stack.

It’s not too late to “:q!” those kickstart edits and accelerate your infrastructure.