This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.
2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.
Every Ops team I know is underwater and doesn’t have the time to catch their breath.
Why does the load increase and leave Ops behind? It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors. Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.
That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.
It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.
It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components. For example, Kubernetes has a 3 month release cycle. That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles. That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization. For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.
And the problem will get worse because infrastructure is cheap and developer productivity is improving.
Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).
That’s why the Google SRE teams require a 50% automation vs Ops ratio.
If the ratio is >50 then the team slowly sinks under growing operational load. If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.
This is not optional – if you are behind now then it will just get worse!
The escape from the cycle is to get help. Stop writing automation that you can buy or re-use. Get help running it. Don’t waste time solving problems that other people have solved. That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.