Spiraling Ops Debt & the SRE coding imperative

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

c2wfuvaveaaronn

Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components.  For example, Kubernetes has a 3 month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

 

Surgical Ansible & Script Injections before, during or after deployment.

I’ve been posting about the unique composable operations approach the RackN team has taken with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling.  The orchestration design (what we call annealing) allows us to dynamically add roles to the environment and execute them as single role/node interactions in operational chains.

ansiblemtaWith our latest patches (short demo videos below), you can now create single role Ansible or Bash scripts dynamically and then incorporate them into the node execution.

That makes it very easy to extend an existing deployment on-the-fly for quick changes or as part of a development process.

You can also run an ad hoc bash script against one or groups of machines.  If that script is something unique to your environment, you can manage it without having to push it back upsteam because Digital Rebar workloads are composable and designed to be safely integrated from multiple sources.

Beyond tweaking running systems, this is fastest script development workflow that I’ve ever seen.  I can make fast, surgical iterative changes to my scripts without having to rerun whole playbooks or runlists.  Even better, I can build multiple operating system environments side-by-side and test changes in parallel.

For secure environments, I don’t have to hand out user SSH access to systems because the actions run in Digital Rebar context.  Digital Rebar can limit control per user or tenant.

I’m very excited about how this capability can be used for dev, test and production systems.  Check it out and let me know what you think.

 

 

 

Composability is Critical in DevOps: let’s break the monoliths

This post was inspired by my DevOps.com Git for DevOps post and is an evolution of my “Functional Ops (the cake is a lie)” talks.

git_logo2016 is the year we break down the monoliths.  We’ve spent a lot of time talking about monolithic applications and microservices; however, there’s an equally deep challenge in ops automation.

Anti-monolith composability means making our automation into function blocks that can be chained together by orchestration.

What is going wrong?  We’re building fragile tightly coupled automation.

Most of the automation scripts that I’ve worked with become very long interconnected sequences well beyond the actual application that they are trying to install.  For example, Kubernetes needs etcd as a datastore.  The current model is to include the etcd install in the install script.  The same is true for SDN install/configuation and post-install test and dashboard UIs.  The simple “install Kubernetes” quickly explodes into a kitchen sink of related adjacent components.

Those installs quickly become fragile and bloated.  Even worse, they have hidden dependencies.  What happens when etcd changes.  Now, we’ve got to track down all the references to it burried in etcd based applications.  Further, we don’t get the benefits of etcd deployment improvements like secure or scale configuration.

What can we do about it?  Resist the urge to create vertical silos.

It’s temping and fast to create automation that works in a very prescriptive way for a single platform, operating system and tool chain.  The work of creating abstractions between configuration steps seems like a lot of overhead.  Even if you create those boundaries or reuse upstream automation, you’re likely to be vulnerable to changes within that component.  All these concerns drive operators to walk away from working collaboratively with each other and with developers.

Giving up on collaborative Ops hurts us all and makes it impossible to engineer excellent operational tools.  

Don’t give up!  Like git for development, we can do this together.

OpenCrowbar 2.1 Released Last Week with new integrations and support

Crowbar 2.1 Release brings commercial support, hardware configs, chef and saltstack

OpenCrowbarLast week, the Crowbar community completed the OpenCrowbar “Broom” release and officially designed it as v2.1.  This release represents 8 months of hardening of the core orchestration engine (including automated testing), the addition of true hardware support (in the optional hardware workload) and preliminary advanced integration with Chef and Saltstack.

Core Features:

  • RAID – Automatically set RAID configuration parameters depending on how the system will be used.
    • Support for LSI controllers
    • Single and Dual RAID configuration
  • BIOS – Automatically set BIOS settings depending on how the system will be used.
    • Configuration setting for Dell PE series systems
  • Out of Band Support–  Configure and manage systems via their OOB interface
    • Support for IPMI and WSMan
  • RPM Installation (it riseth again!) – Install OpenCrowbar via a standard RPM instead of a Docker container

Integrations:

  • SaltStack integration – OpenCrowbar can install SaltStack as a configuration tool to take over after “Ready State”
  • Chef Provisioning (was Chef Metal) – OpenCrowbar driver allows Chef to build clusters on bare metal using the Crowbar API.

Infrastructure:

  • Automated smoke test and code coverage analysis for all pull requests.

And…v2.1 is the first release with commercial support!

RackN (rackn.com) offers consulting and support for the OpenCrowbar v2.1 release.  The company was started by Crowbar founders Greg Althaus, Scott Jensen, Dan Choquette, and myself specifically to productize and extend Crowbar.

Want to try it out?