I’m investing in these Site Reliability Engineering (SRE) discussions because I believe operations (and by extension DevOps) is facing a significant challenge in keeping up with development tooling. The links below have been getting a lot of interest on twitter and driving some good discussion.
Addressing this Ops debt is our primary mission at my company, RackN: we believe that integrated system level tooling is required. We also believe that new tools should not disrupt environments so we work very hard to adapt to requirements of individual sites.
SRE is urgent because it provides a pragmatic path and rationale for investment.
Even if you don’t agree with Google’s term or all their practices, I think fundamental concepts of system thinking, status/pay, automation investment and developer collaboration are essential. It should come as no surprise that these are all Lean/DevOps concepts; however, SRE has the pragmatic side of being a job function.
Here are some recent relevant discussions I’ve been having about SREs with links to both the audio and my text show notes.
- Cloud Cast about SRE concepts and decomposing Ops
- Datanauts deep dive about SRE based on the “DevOps vs SRE” talk from DevOpsDays Austin (original post)
- Charity Majors and I debate the SRE name and pay equity for Ops.
- Further Reading Podcasts
Of course, RackN is also doing a WEEKLY SRE update that captures general interest items. Check that out and subscribe.