Datanauts #89 dives deep on SRE approach and urgency

TL;DR: SRE makes Ops more Dev like in critical ways like status equity and tooling approaches.

In Datanauts 089, Chris Wahl and Ethan Banks help me break down the concepts from my “DevOps vs SRE vs Cloud Native” presentation from DevOpsDays Austin last spring. They do a great job exploring the tough topics and concepts from the presentation.  It’s almost like an extended Q&A so you may want to review the slides or recording before diving into the podcast.

Advanced Reading: my follow-up discussion on SRE with the Cloudcast team and my previous Datanauts podcast.

Here are my notes from the podcast:

  • 01:00 “Doing infrastructure in a way that the robots can take over”
  • 01:51 Video where Charity & Rob Debated the SRE term
  • 02:00 History of SRE term from Google vs Sys Ops – if site was not up, money was not flowing.  SRE culture fixed pay equity and career ladder, ops would have automation/dev time, dev on hooks for errors
  • 03:00 Google took a systems approach with lots of time for automation and coding
  • 03:20 Finding a 10x improvement in ops.  Go buy the book
  • 04:00 SRE is a new definition of System Op
  • 04:10 The S in could be “system” or physical location (not web site).
  • 05:00 We’re seeing SRE teams showing up in companies of every size.  Replacing DevOps teams (which is a good thing).  Rob is hoping that SRE is replacing DevOps as a job title.  
  • 06:10 Don’t fall for a title change from Sys Op to SRE with actually getting the pay and authority
  • 06:45 Ethan believes that SRE is transforming to have a broad set of responsibilities.  Is just a new System Admin definition?
  • 07:30 Rob things that the SRE expectation is for a much higher level of automation.  There’s a big thinking shift.
  • 08:00 SREs are still operators.  You have to walk the walk to know how to run the system.  Not developers who are writing the platform.
  • 08:30 Chris asks about the Ops technical debt
  • 09:00 We need to make Ops tooling “better enough” – we’re not solving this problem fast enough.  We have to do a better job – Rob talks about the Wannacry event.
  • 10:30 Chris asks how to fix this since complexity is increasing.  Rob plugs Digital Rebar as a way to solve this.
  • 11:00 People are excited about Digital Rebar but don’t have the time to fix the problem.  They are running crisis to crisis so we never get to automation that actually improves things.
  • 12:00 At best, Ops is invisible.  SRE is different because it includes CI/CD with on going interactions.  There’s a lot coming with immutable operating systems and constantly term.
  • 13:00 The idea that a Linux system has been up for 10 years is an anti-pattern.  Rob would rather have people say that none of their servers has been up for more than a week (because they are constantly refreshed)
  • 13:19 Chris & Ethan – SECTION 1 REVIEW
    • SRE is not new, it’s about moving into a proactive stance (automatically reacting)
    • The power is the buy in so that Ops has ownership of the stack
  • 15:00 SRE vs DevOps vs Cloud Native – not in conflict, but we love to create opposition
  • 15:40 There is a difference, they are not interchangeable.  SRE is a job title, DevOps is a process and Cloud Native is an architecture.
  • 16:30 We need to resist that Cloud Native is a “new shiney” that replaces DevOps. We don’t have to take things away.
  • 17:00 Lean is a process where we’re trying to shorten the flow from ideation to delivery.  Read the Goal [links] and The Phoenix Project [links].  
  • 18:00 Bottlenecks (where we’ve added work or delays) really break our pipelines.  
  • 19:00 Ethan’s adds the insight: If you don’t have small steps then you don’t really understand your process
  • 20:00 Platform as a Service is not really reducing complexity, we’re just hiding/abstracting it.  That moves the complexity.  We may hide it from developers but may be passing it to the operators.
  • 21:00 Chris asks if this can be mapped to legacy?  Rob agrees that it’s a legacy architectural choice that was made to reduce incremental risk.  Today, we’re trying to make our risk into smaller steps which makes it so that we will have smaller but more frequent breaks.
  • 22:40 The way we deliver systems is changing to require a much faster pace of taking changes
  • 23:00 SREs are data driven so they can feed information back to devs.  They can’t (shouldn’t) walk away from running systems.  This is an investment requirement so we can create data.
  • 24:00 We let a lot of problems lurk below the surface that eventually surface as a critical issue.  Cannot let toothaches turn into abscesses.  SREs should watch systems over time.
  • 25:20 If you are running under performance in the cloud, then you are wasting money.
  • 26:00 Cloud Native, an architecture?  What is it?  It means a ton of things.  For this preso, Rob made it about 12 factor and API driven infrastructure.
  • 26:50 “If you are not worried about rising debt then we are in trouble.”  We need to root cause!  If not, they snowball and operators are just running fire to fire.  We need to stop having operators be heros / grenade divers because it’s an anti-pattern.  Predictable systems do not create a lot of interrupts or crises.  Operators should not be event driven.
  • 28:40 Chris & Ethan – SECTION 2 REVIEW
    • Chris: Being data driven combats complexity
    • Ethan: Breaking down processes into smaller units reduces risk.  
  • 30:00 Cloud First is not Cloud Only.  CNCF projects are not VM specific, they are about abstractions that help developers be more productive.  Ideally, the abstractions remove infrastructure because developers don’t want to do any infrastructure.  We should not are about which type of infrastructure we are using
  • 31:30 The similarities between the concepts is in their common outcomes/values.  Cloud First wants to be infrastructure agnostic.
  • 32:30 Chris ask how important CI/CD should be.  Are these still important in non-Cloud environments.  Rob things that Cloud Native may “cloud wash” architectures that are really just as important in traditional infrastructure.  
  • 34:00 Cloud Native was a defensive architecture because early cloud was not very good.  CI/CD pipelines would be considered best practices in regular manufacturing. 
  • 35:00 These ideas are really good manufacturing process applied back to IT.  Thankfully, there’s really nothing unexpected from repeatable production.
  • 36:30 Lesson: Pay Equity.  Traditionally operators are not paid as well as developers and that means that we’re giving them less respect.  HiPPO (highest paid person in organization) is a very real effect where you can create a respect gap.
  • 38:00 Lesson: Disrupt Less.  We love the idea of disruption but they are very expensive and disproportionately to the operators.  Change for Developers may be small but have big impacts to operators.  More disruptive changes actually slow down adoption because that slows down inertia.  SREs should be able to push back to insist on migration paths.
  • 40:00 Rob talks about how RedFish, while good to replace IPMI, will take long time before it.  There are pros and cons.

 

This entry was posted in DevOps, SRE and tagged , , , by Rob H. Bookmark the permalink.

About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s