TL;DR: Operators (DevOps & SREs) have a hard job, we need to make time and room for them to redefine their jobs in a much more productive way.
The Cloudcast.net by Brian Gracely and Aaron Delp brings deep experience and perspective into their discussions based on their impressive technology careers and understanding of the subject matter. Their podcasts go deep quickly with substantial questions that get to the heart of the issue. This was my third time on the show (previous notes).
In episode 301, we go deeply into the meaning and challenges for Site Reliability Engineering (SRE) functions. We also cover some popular technologies that are of general interest.
Author’s Note; For further information about SREs, listen to my discussion about “SRE vs DevOps vs Cloud Native” on the Datanauts podcast #89. (transcript pending)
Here are my notes from Cloudcast 301. with bold added for emphasis:
- 2:00 Rob defines SRE (more resources on RackN.com site).
- 2:30 Google’s SRE book gave a name, even changed the definition, to what I’ve been doing my whole career. Evolved name from being just about sites to a full system perspective.
- 3:30 SRE and DevOps are aligned at the core. While DevOps is about process and culture, SRE is more about the function and “factory.”
- 4:30 Developers don’t want to be shoving coal into the engine, but someone, SREs, have to make sure that everything keeps running
- 5:15 Brian asks about impedance mismatch between Dev and Ops. How do we fix that?
- 6:30 Rob talks about the crisis brewing for operations innovation gap (link). Digital Rebar is designed to create site-to-site automation so Operators can share repeatable best practices.
- 7:30 OpenStack ran aground because Operators because we never created a the practices that could be repeated. “Managed service as the required pattern is a failure of building good operational software.”
- 8:00 RackN decomposes operations into isolated units so that individual changes don’t break the software on top
- 9:20 Brian talks about the increasing rate of releases means that operations doesn’t have the skills to keep up with patching.
- 10:10 That’s “underlay automation” and even scarier because software is composited with all sorts of parts that have their own release cycles that are not synchronized.
- 11:30 We need to get system level patch/security.update hygiene to be automatic
- 12:20 This is really hard!
- 13:00 Brian asks what are the baby steps?
- 13:20 We have to find baby steps where there are nice clean boundaries at every layer from the very most basic. For RackN, that’s DHCP and PXE and then upto Kubernetes.
- 15:15 Rob rants that renaming Ops teams as SRE is a failure because SRE has objectives like job equity that need to be included.
- 16:00 Org silos get in the way of automation that have antibodies that make it difficult for SREs and DevOps to succeed.
- 17:10 Those people have to be empowered to make change
- 17:40 The existing tools must be pluggable or you are hurting operators. There’s really no true greenfield, so we help people by making things work in existing data centers.
- 19:00 Scripts may have technical debt but that does not mean they should just be disposed.
- 19:20 New and shiney does not equal better. For example, Container Linux (aka CoreOS) does not solve all problems.
- 20:10 We need to do better creating bridges between existing and new.
- 20:40 How do we make Day 2 compelling?
- 21:15 Brian asks about running OpenStack on Kubernetes.
- 22:00 Rob is a fan of Kubernetes on Metal, but really, we don’t want metal and vms to be different. That means that Kubernetes can be a universal underlay which is threatening to OpenStack.
- 23:00 This is no longer a JOKE: “Joint OpenStack Kubernetes Environments”
- 23:30 Running things on Kubernetes (or OpenStack) is great because the abstractions hide complexity of infrastructure; however, at the physical layer you need something that exposes that complexity (which is what RackN does).
- 25:00 Brian asks at what point do you need to get past the easy abstractions
- 25:30 You want to never care ever. But sometimes you need the information for special cases.
- 26:20 We don’t want to make the core APIs complex just to handle the special cases.
- 27:00 There’s still a class of people who need to care about hardware. These needs should not be embedded into the Kubernetes (or OpenStack) API.
- 28:00 Brian summarizes that we should not turn 1% use cases into complexity for everyone. We need to foster the skill of coding for operators
- 28:45 For SREs, turning Operators into coding & automation is essential. That’s a key point in the 50% programming statement for SREs.
- In the closing, Rob suggested checking out Digital Rebar Provision as a Cobbler replacement.
We’re very invested in talking about SRE and want to hear from you! How is your company transforming operations work to make it more sustainable, robust and human?We want to hear your stories and questions.