Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)

What is a Google SRE?  Charity Majors gave a great overview on Datanauts #65, Susan Fowler from Uber talks about “no ops” tensions and Patrick Hill from Atlassian wrote up a good review too.  This is not new: Ben Treynor defined it back in 2014.

DevOps is under attack.

Well, not DevOps exactly but the common misconception that DevOps is about Developers doing Ops (it’s really about lean process, system thinking, and positive culture).  It turns out the Ops is hard and, as I recently discussed with John Furrier, developers really really don’t want be that focused on infrastructure.

In fact, I see containers and serverless as a “developers won’t waste time on ops revolt.”  (I discuss this more in my 2016 retrospective).

The tension between Ops and Dev goes way back and has been a source of confusion for me and my RackN co-founders.  We believe we are developers, except that we spend our whole time focused on writing code for operations.  With the rise of Site Reliability Engineers (SRE) as a job classification, our type of black swan engineer is being embraced as a critical skill.  It’s recognized as the only way to stay ahead of our ravenous appetite for  computing infrastructure.

I’ve been writing about Site Reliability Engineering (SRE) tasks for nearly 5 years under a lot of different names such as DevOps, Ready State, Open Operations and Underlay Operations. SRE is a term popularized by Google (there’s a book!) for the operators who build and automate their infrastructure. Their role is not administration, it is redefining how infrastructure is used and managed within Google.

Using infrastructure effectively is a competitive advantage for Google and their SREs carry tremendous authority and respect for executing on that mission.

ManagersMeanwhile, we’re in the midst of an Enterprise revolt against running infrastructure. Companies, for very good reasons, are shutting down internal IT efforts in favor of using outsourced infrastructure. Operations has simply not been able to complete with the capability, flexibility and breadth of infrastructure services offered by Amazon.

SRE is about operational excellence and we keep up with the increasingly rapid pace of IT.  It’s a recognition that we cannot scale people quickly as we add infrastructure.  And, critically, it is not infrastructure specific.

Over the next year, I’ll continue to dig deeply into the skills, tools and processes around operations.  I think that SRE may be the right banner for these thoughts and I’d like to hear your thoughts about that.

MORE?  Here’s the next post in the series about Spiraling Ops Debt.  Or Skip to Podcasts with Eric Wright and Stephen Spector.

7 thoughts on “Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)

  1. Pingback: The rise of Site Reliability Engineers (SRE) — Rob Hirschfeld

  2. Pingback: Spiraling Ops Debt & the SRE coding imperative | Rob Hirschfeld

  3. Pingback: Spiraling Ops Debt & the SRE Coding Imperative – B.loom

  4. Pingback: “Why SRE?” Discussion with Eric @Discoposse Wright | Rob Hirschfeld

  5. Pingback: “Why SRE?” Discussion with Eric @Discoposse Wright – GREENSTACK

  6. Pingback: SRE role with DevOps for Enterprise [@HPE podcast] | Rob Hirschfeld

  7. Pingback: Cloud-first Physical Provisioning? 10 ways that the DR is in to fix your PXE woes. | Rob Hirschfeld

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s