Spiraling Ops Debt & the SRE coding imperative

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

c2wfuvaveaaronn

Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components.  For example, Kubernetes has a 3 month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

 

15 thoughts on “Spiraling Ops Debt & the SRE coding imperative

  1. Pingback: The Danger of SRE Backlogs

  2. Pingback: Apparently IT death smells like kickstart files. Six Reasons why. | Rob Hirschfeld

  3. Pingback: “Why SRE?” Discussion with Eric @Discoposse Wright | Rob Hirschfeld

  4. Pingback: Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar) | Rob Hirschfeld

  5. Pingback: Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar) – GREENSTACK

  6. Pingback: SRE role with DevOps for Enterprise [@HPE podcast] | Rob Hirschfeld

  7. Pingback: Evolution or Rebellion? The rise of Site Reliability Engineers (SRE) | Rob Hirschfeld

  8. Pingback: What does it take to Operate Open Platforms? Answers in Datanaughts 72 | Rob Hirschfeld

  9. Pingback: What does it take to Operate Open Platforms? Answers in Datanaughts 72 – GREENSTACK

  10. Pingback: Are you impatient enough to be an SRE? | RackN

  11. Pingback: 10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem | Rob Hirschfeld

  12. Pingback: (re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes | Rob Hirschfeld

  13. Pingback: (re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes – GREENSTACK

  14. Pingback: Let’s DevOps IRL: my SRE postings on RackN! | Rob Hirschfeld

  15. Pingback: July 14 – Weekly Recap of All Things Site Reliability Engineering (SRE) | RackN

Leave a comment