Packet Pushers 333: Orchestration v Automation < YES, this is what we are doing!

Iix34grhy_400x400 highly recommend catching Packet Pushers 333 “Automation & Orchestration In Networking” by Drew Conry-Murray and guests Pete Lumbis and Michael Damkot.

While the discussion is all about NETWORK DevOps, they do a good job of decrying WHY current state of system orchestration is so sad – in a word: heterogeneity.  It’s not going away because the alternative is lock-in.  They also do a good job of describing the difference between automation and orchestration; however, I think there’s a middle tier  of resource “scheduling” that better describes OpenStack and Kubernetes.

Around 5:00 minutes into the podcast, they effectively describe the composable design of Digital Rebar and the rationale for the way that we’ve abstracted interfaces for automation.  If you guys really do want to cash in by consulting with it (at 10 minutes), just give me a call.

It’s great to hear acknowledgement of both the complexity and need for solving these problems.   Thanks for the great podcast Drew, Pete and Michael!

Oh… and I’m going to be presenting at Interop ITX also.  Hopefully, I’ll get a chance to talk 1×1 with Drew.

Need PXE? Try out this Cobbler Replacement

DR Provision

Operators & SREs – we need your feedback on an open DHCP/PXE technical preview that will amaze you and can be easily tested right from your laptop.

We wanted to make open basic provisioning API-driven, secure, scalable and fast.  So we carved out the Provision & DHCP services as a stand alone unit from the larger open Digital Rebar project.  While this Golang service lacks orchestration, this complete service is part of Digital Rebar infrastructure and supports the discovery boot process, templating, security and extensive image library (Linux, ESX, Windows, … ) from the main project.

TL;DR: FIVE MINUTES TO REPLACE COBBLER?  YES.

The project APIs and CLIs are complete for all provisioning functions with good Swagger definitions and docs.  After all, it’s third generation capability from the Digital Rebar project.  The integrated UX is still evolving.

Here’s a video of the quick install process.

 

Here are some examples from the documentation:

core_servicesinstall_discovered

Open Source Collaboration: The Power of No

TL;DR: The days of using open software passively from vendors are past, users need to have a voice and opinion about project governance. This post is a joint effort with Rob Hirschfeld, RackN, and Chris Ferris, IBM, based on their IBM Interconnect 2017 “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest?” presentation.

nullIt’s a common misconception that open source collaboration means saying YES to all ideas; however, the reality of successful projects is the opposite.

Permissive open source licenses drive a delicate balance for projects. On one hand, projects that adopt permissive licenses should be accepting of contributions to build community and user base. On the other, maintainers need to adopt a narrow focus to ensure project utility and simplicity. If the project’s maintainers are too permissive, the project bloats and wanders without a clear purpose. If they are too restrictive then the project fails to build community.

It is human nature to say yes to all collaborators, but that can frustrate core developers and users.

For that reason, stronger open source projects have a clear, focused, shared vision. Historically, that vision was enforced by a benevolent dictator for life (BDFL); however, recent large projects have used a consensus of project elders to make the task more sustainable. These roles serve a critical need: they say “no” to work that does not align with the project’s mission and vision. The challenge of defining that vision can be a big one, but without a clear vision, it’s impossible for the community to sustain growth because new contributors can dilute the utility of projects. [author’s note: This is especially true of celebrity projects like OpenStack or Kubernetes that attract “shared glory” contributors]

There is tremendous social and commercial pressure driving this vision vs. implementation balance.

The most critical one is the threat of “forking.” Forking is what happens when the code/collaborator base of a project splits into multiple factions and stops working together on a single deliverable. The result is incompatible products with a shared history. While small forks are required to support releases, and foster development; diverging community forks can have unpredictable impacts for a project.

Forks are not always bad: they provide a control mechanism for communities.

The fundamental nature of open source projects that adopt a permissive license is what allows forks to become the primary governance tool. The nature of permissive licenses allows anyone to create a new line of development that’s different than the original line. Forks can allow special interests in a code base to focus on their needs. That could be new features or simply stabilization. Many times, a major release version of a project evolves into forks where both old and newer versions have independent communities because of deployment inertia. It can also allow new leadership or governance without having to directly displace an entrenched “owner”.

But forking is expensive because it makes it harder for communities to collaborate.

To us, the antidote for forking is not simply vision but a strong focus on interoperability. Interoperability (or interop) means ensuring that different implementations remain compatible for users. A simplified example would be having automation that works on one OpenStack cloud also work on all the others without modification. Strong interop creates an ecosystem for a project by making users confident that their downstream efforts will not be disrupted by implementation variance or version changes.

Good Interop relieves the pressure of forking.

Interop can only work when a project defines what is expected behavior and creates tests that enforce those standards. That activity forces project contributors to agree on project priorities and scope. Projects that refuse to define interop expectations end up disrupting their user and collaborator base in frustrating ways that lead to forking (Rob’s commentary on the potential Docker fork of 2016).

nullUnfortunately, Interop is not a generally a developer priority.

In the end, interoperability is a user feature that competes with other features. Sadly, it is often seen as hurting feature development because new features must work to maintain existing interop standards. For that reason, new contributors may see interop demands as a impediment to forward progress; however, it’s a strong driver for user adoption and growth.

The challenge is that those users are typically more focused on their own implementation and less visible to the project leadership. Vendors have similar disincentives to do work that benefits other vendors in the community. These tensions will undermine the health of communities that do not have strong BDFL or Elders leadership. So, who then provides the adult supervision?

Ultimately, users must demand interop and provide commercial preference for vendors that invest in interop.

Open source has definitely had an enormous impact on the software industry; generally, a change for the better. But, that change comes at a cost – the need for involvement, not just of vendors and individual developers, but, ultimately it demands the participation of consumers/users.

Interop isn’t naturally a vendor priority because it levels the playing field for all vendors; however, vendors do prioritize what their customers want.

Ideally, customer needs translate into new features that have a broad base of consumer interest. Interop ensure that features can be used broadly. Thus interop is an important attribute to consumers not only for vendors, but by the open source communities building the software. This alignment then serves as the foundation upon which (increasingly) that vendor software is based.

Customers should be actively and publicly supportive of interop efforts of projects on which their vendor’s offerings depend. If there isn’t such an initiative in those projects, then they should demand one be started through their vendor partners and in the public forums for the project.

Further, if consumers of an open source project sense that it lacks a strong, focused, vision and is wandering off course, they need to get involved and say so, either directly and/or through their vendor partners.

While open source has changing the IT industry, it also has a cost. The days of using software passively from vendors are past, users need to have a voice and opinion. The need to ensure that their chosen vendors are also supporting the health of the community.

What do you think? Reach out to Rob (@zehicle) and Chris (@christo4ferris) and let us know!

Note: Cross posted on IBM OpenTech site.

Starting Weekly SRE Update Posts!

sre-seriesEveryone at RackN is excited to talk about Site Reliability Engineering and we’re spinning up several efforts including an interview series about it (contact me if you want talk!).

Our first deliverable is a weekly SRE industry round up blog post (our first one!)  You can subscribe to the updates on the RackN site.

Let us know if you find news that we should include by posting comments there or tweeting to @RackNGo.

If you want to hear more of my regular opinionated stuff, never fear!  I’ve got some fun “server cage match” material coming together around DevOpsDays Austin talking about how Cloud Native, DevOps and SRE align.  You can subscribe to this blog too (button on the left)- we generally don’t double post material, so you’ll get fresh insights from both.

 

Accelerating Community Ops on Kubernetes in Hybrid Style

Preface: RackN is looking for SRE teams who are enthusiastic about accelerating Kubernetes on-premises in a long term operational way that can be shared and reused across the community.

kubernetesWe’re excited to see and be part of the community progress towards enterprise-ready Kubernetes operations on both cloud and on-premises.  The RackN team is excited to be part of multiple groups establishing patterns with shareable/reusable automation. I strongly recommend watching (or, better, collaborating in) these efforts if you are deploying Kubernetes even at experimental scale.

We’ve worked hard to make shared community ops work accessible, repeatable and multi-platform without compromising scale or security.

The RackN team has been enthusiastic supporters of Kubernetes since the 1.0 launch with our first deployments going back to June 2015 with updates for 1.2, 1.3 and now 1.5. I’m excited to report that fully converged the composable Digital Rebar approach with the Kubernetes Kargo Ansible. Our 1.2 efforts leveraged the Kargo predecessor “Kubespray.” This integration brings the parallel hybrid operation and node-by-node function of Digital Rebar with the Ansible community efforts around Kargo.

Composable design is a key element the RackN focus on SRE automation because it allows ecosystem

That allows a fully integrated deploy where Digital Rebar stages the environment and then use Kargo directly from upsteam to install Kubernetes. Post-deployment, Digital Rebar is able to extend the cluster with packages like Helm, Deis, Dashboard and others.

Since Digital Rebar supports parallel deployments, it’s possible to fully exercise the options enabled by Kargo simultaneously for development and testing.  Benefits????

For example, you can built-test-destroy coordinated Kubernetes installs on Centos, Redhat and Ubuntu as part of an automation pipeline. Unlike client side approaches like Terraform or Ansible, our infrastructure allows transparent monitoring of the deployments including Slack integration.

Flexibility is also important between users because Ops variation is both a benefit and a cost.

A key Digital Rebar design goal is for users to explore useful variation and still share operational best practices. We are proving that shared community automation can support many different scenarios including variation between between clouds, physical, operating system, networking and container engine.

If we cannot manage this variation in a consistent way then we’re doomed to operational fragmentation (like OpenStack has endured).

We’re inviting you to check out our open work supporting the Kubernetes Ops community. As Rob Hirschfeld says, looking for “Day 2” minded operators who want to make sure that we are always able to share Kubernetes best practices.

10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better.  I’m not thinking about scale in number of nodes, but in operator efficiency.  The primary way to create that efficiency is limit site customization and to improve reuse.  Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers.  This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.

 

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers.  That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms.  Unfortunately, that does nothing for the platforms or infrastructure bound applications.  For this lower level software, we need to accept that operations environments are heterogeneous.

 

I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself.  To make things even harder, the infrastructure and dependencies are also constantly changing.  Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current.  These challenges limit the virtuous feedback cycles that help communities grow.  

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken.  That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns.  Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.  

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

How scared do we need to be for Ops collaboration & investment?

Note: Yesterday RackN posted Are you impatient enough to be an SRE?  and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?” 

Cia leak (1)As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project.  Curious?  Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.”

Are you impatient enough to be an SRE?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

SRE minded teams are very impatient about eliminating manual, routine and non-differentiated work.

I’ve been talking to a lot of people about SRE lately in the context of helping Ops get out of the way while coping with increasing load and complexity.  Why are they so impatient? Because they know that ops demand is constantly increasing, there’s no “good enough” when it comes to finding ways to automate tasks and move up stack. Without consistent improvement in automation, teams will get buried (my post about Ops Debt).

The core SRE mantra needs to be “Own Ops, don’t be owned by Ops.”

Yet, outsourcing ops responsibility to a service is equally problematic for an SRE.  They cannot give up responsibility for the integrated system.  In fact, that’s one of the basic reasons why Google’s SRE teams went from just “web site reliability” to full system thinking.  Every aspect of the infrastructure stack needs to be considered when looking at system performance and reliability.  For example, something deep like SSD drive write behavior or GPU BIOS could make a critical difference.  SREs need to be able to root cause issues and black box infrastructure (a.k.a. Cloud) can get in the way.

SRE teams must balance owning the full stack versus focusing on what makes their job unique.

That’s why we have been rethinking about how SRE teams approach infrastructure.  Instead of trying to turn infrastructure into a black box services; we’ve designed the Digital Rebar composable Ops platform that embraces and contains heterogeneity with a high degree of transparency and control.  This is critical because SREs cannot afford to keep reinventing automation at the bottom of the stack.  We must be able to share and leverage best-practices on infrastructure provisioning and platform deployment.  

Like the hardware that runs it, the foundation automation layer must be commoditized.

That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.

That’s what we are building at RackN.  Our primary goal is to reuse automation whenever possible.  That was our top design priority for Digital Rebar and it drives our customer engagement models.  If you’d like to hear more, download our SRE white paper.

More information:

What does it take to Operate Open Platforms? Answers in Datanauts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanauts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

SRE role with DevOps for Enterprise [@HPE podcast]

sre-series

My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

Yes, DevOps and SRE are complementary

In this short 16 minute podcast, HPE’s Stephen Spector and I discuss how DevOps and SRE thinking overlaps and where are the differences.  We also discuss how Enterprises should be evaluating Site Reliability Engineering as a function and where it fits in their organization.