Let’s DevOps IRL: my SRE postings on RackN!

I’m investing in these Site Reliability Engineering (SRE) discussions because I believe operations (and by extension DevOps) is facing a significant challenge in keeping up with development tooling.   The links below have been getting a lot of interest on twitter and driving some good discussion.

datanauts_logo_300

15967

Addressing this Ops debt is our primary mission at my company, RackN: we believe that integrated system level tooling is required.  We also believe that new tools should not disrupt environments so we work very hard to adapt to requirements of individual sites.

SRE is urgent because it provides a pragmatic path and rationale for investment.

Even if you don’t agree with Google’s term or all their practices, I think fundamental concepts of system thinking, status/pay, automation investment and developer collaboration are essential.  It should come as no surprise that these are all Lean/DevOps concepts; however, SRE has the pragmatic side of being a job function.

Here are some recent relevant discussions I’ve been having about SREs with links to both the audio and my text show notes.

Of course, RackN is also doing a WEEKLY SRE update that captures general interest items.  Check that out and subscribe.

What makes ops hard? SRE/DevOps challenge & imperative [from Cloudcast 301]

TL;DR: Operators (DevOps & SREs) have a hard job, we need to make time and room for them to redefine their jobs in a much more productive way.

Cloudcast-Logo-2015-Banner-BlueThe Cloudcast.net by Brian Gracely and Aaron Delp brings deep experience and perspective into their discussions based on their impressive technology careers and understanding of the subject matter.  Their podcasts go deep quickly with substantial questions that get to the heart of the issue.  This was my third time on the show (previous notes).

In episode 301, we go deeply into the meaning and challenges for Site Reliability Engineering (SRE) functions.  We also cover some popular technologies that are of general interest.

Author’s Note; For further information about SREs, listen to my discussion about “SRE vs DevOps vs Cloud Native” on the Datanauts podcast #89.  (transcript pending)

Here are my notes from Cloudcast 301. with bold added for emphasis:

  • 2:00 Rob defines SRE (more resources on RackN.com site).
    • 2:30 Google’s SRE book gave a name, even changed the definition, to what I’ve been doing my whole career. Evolved name from being just about sites to a full system perspective.  
    • 3:30 SRE and DevOps are aligned at the core.  While DevOps is about process and culture, SRE is more about the function and “factory.”
    • 4:30 Developers don’t want to be shoving coal into the engine, but someone, SREs, have to make sure that everything keeps running
  • 5:15 Brian asks about impedance mismatch between Dev and Ops.  How do we fix that?
    • 6:30 Rob talks about the crisis brewing for operations innovation gap (link).  Digital Rebar is designed to create site-to-site automation so Operators can share repeatable best practices.
    • 7:30 OpenStack ran aground because Operators because we never created a the practices that could be repeated.  “Managed service as the required pattern is a failure of building good operational software.”
    • 8:00 RackN decomposes operations into isolated units so that individual changes don’t break the software on top

  • 9:20 Brian talks about the increasing rate of releases means that operations doesn’t have the skills to keep up with patching.
    • 10:10 That’s “underlay automation” and even scarier because software is composited with all sorts of parts that have their own release cycles that are not synchronized.
    • 11:30 We need to get system level patch/security.update hygiene to be automatic
    • 12:20 This is really hard!

  • 13:00 Brian asks what are the baby steps?
    • 13:20 We have to find baby steps where there are nice clean boundaries at every layer from the very most basic.  For RackN, that’s DHCP and PXE and then upto Kubernetes.
    • 15:15 Rob rants that renaming Ops teams as SRE is a failure because SRE has objectives like job equity that need to be included.
    • 16:00 Org silos get in the way of automation that have antibodies that make it difficult for SREs and DevOps to succeed.
    • 17:10 Those people have to be empowered to make change
    • 17:40 The existing tools must be pluggable or you are hurting operators.  There’s really no true greenfield, so we help people by making things work in existing data centers.
    • 19:00 Scripts may have technical debt but that does not mean they should just be disposed.
    • 19:20 New and shiney does not equal better.  For example, Container Linux (aka CoreOS) does not solve all problems.  
    • 20:10 We need to do better creating bridges between existing and new.
    • 20:40 How do we make Day 2 compelling?

  • 21:15 Brian asks about running OpenStack on Kubernetes.
    • 22:00 Rob is a fan of Kubernetes on Metal, but really, we don’t want metal and vms to be different.  That means that Kubernetes can be a universal underlay which is threatening to OpenStack.
    • 23:00 This is no longer a JOKE: “Joint OpenStack Kubernetes Environments”
    • 23:30 Running things on Kubernetes (or OpenStack) is great because the abstractions hide complexity of infrastructure; however, at the physical layer you need something that exposes that complexity (which is what RackN does).

  • 25:00 Brian asks at what point do you need to get past the easy abstractions
    • 25:30 You want to never care ever.  But sometimes you need the information for special cases.
    • 26:20 We don’t want to make the core APIs complex just to handle the special cases.
    • 27:00 There’s still a class of people who need to care about hardware.  These needs should not be embedded into the Kubernetes (or OpenStack) API.

  • 28:00 Brian summarizes that we should not turn 1% use cases into complexity for everyone.  We need to foster the skill of coding for operators
    • 28:45 For SREs, turning Operators into coding & automation is essential.  That’s a key point in the 50% programming statement for SREs.
    • In the closing, Rob suggested checking out Digital Rebar Provision as a Cobbler replacement.

We’re very invested in talking about SRE and want to hear from you! How is your company transforming operations work to make it more sustainable, robust and human?We want to hear your stories and questions.

June 16 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rack ngo)

SRE Items of the Week

The Cloudcast #301 – SRE and Infrastructure Operations
http://www.thecloudcast.net/2017/06/the-cloudcast-301-sre-and.html

Description: Brian talks with Rob Hirschfeld (@zehicle, Founder/CEO of @RackN) about the concepts of SRE (Site Reliability Engineering), the challenges of maintaining infrastructure software, emerging tools and the next-generation of operations.

Show Notes:

  • Topic 1 – Welcome back to the show. Let’s start by talking about the concept of SRE (Site Reliability Engineering). Give us the basics and maybe explain how it differs from what people define in DevOps.
  • Topic 2 – Application development has been moving faster for quite a while (agile development, etc.). But now infrastructure/operations teams have to deal with faster software – especially around updates (e.g. Kubernetes releases every 3 months). How are companies managing this?
  • Topic 3 – Given that this pace of operations change may not slow down, how do you think about the challenge in terms of process/operations versus technology/tools?
  • Topic 4 – What are some of the steps that companies take to better prepare for this type of operational model? Tools, process, skills, etc.
  • Topic 5 – Do you see SRE as being a progression for existing infrastructure/operations people, or is this more focused on sysadmins or developers that want to get away from building applications?

_____________

DevOps Enterprise Summit London: Tales of courage and community
https://techbeacon.com/devops-enterprise-summit-london-tales-courage-community

After spending two amazing days with 700 of my closest DevOps cohorts from Europe, the Middle East, Africa, and beyond, I learned all about the latest and greatest IT and technology transformation reports at the DevOps Enterprise Summit London. With substantial growth in attendance from the first year, in 2016, the buzz around the show was palpable. And, what a location! From the venue, the QEII Centre, we had 360-degree views of central London, from Big Ben to the London Eye and beyond.

Read more from Steve Brodie, CEO of Electric Cloud @stbrodie
_____________

.IO! .IO! It’s off to a Service Mesh you should go [Gluecon 2017 notes]
http://bit.ly/2rjw4We  

Gluecon turned out to be all about a microservice concept called a “service mesh” which was being promoted by Buoyant with Linkerd and IBM/Google/Lyft with Istio.  This class of services is a natural evolution of the rush to microservices and something that I’ve written microservice technical architecture on TheNewStack about in the past. READ MORE
_____________

A few things I’ve learned about Kubernetes
https://jvns.ca/blog/2017/06/04/learning-about-kubernetes/

I’ve been learning about Kubernetes at work recently. I only started seriously thinking about it maybe 6 months ago – my partner Kamal has been excited about Kubernetes for a few years (him: “julia! you can run programs without worrying what computers they run on! it is so cool!“, me: “I don’t get it, how is that even possible”), but I understand it a lot better now.

This isn’t a comprehensive explanation or anything, it’s some things I learned along the way that have helped me understand what’s going on.

Read more from Julia Evans @b0rk
_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

  • 2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

 

 

Cloudcast.net gem about Cluster Ops Gap

15967Podcast juxtaposition can be magical.  In this case, I heard back-to-back sessions with pragmatic for cluster operations and then how developers are rebelling against infrastructure.

Last week, I was listening to Brian Gracely’s “Automatic DevOps” discussion with  John Troyer (CEO at TechReckoning, a community for IT pros) followed by his confusingly titled “operators” talk with Brandon Phillips (CTO at CoreOS).

John’s mid-recording comments really resonated with me:

At 16 minutes: “IT is going to be the master of many environments… If you have an environment is hybrid & multi-cloud, then you still need to care about infrastructure… we are going to be living with that for at least 10 years.”

At 18 minutes: “We need a layer that is cloud-like, devops-like and agile-like that can still be deployed in multiple places.  This middle layer, Cluster Ops, is really important because it’s the layer between the infrastructure and the app.”

The conversation with Brandon felt very different where the goal was to package everything “operator” into Kubernetes semantics including Kubernetes running itself.  This inception approach to running the cluster is irresistible within the community because the goal of the community is to stop having to worry about infrastructure.  [Brian – call me if you want to a do podcast of the counter point to self-hosted].

Infrastructure is hard and complex.  There’s good reason to limit how many people have to deal with that, but someone still has to deal with it.

I’m a big fan of container workloads generally and Kubernetes specifically as a way to help isolate application developers from infrastructure; consequently, it’s not designed to handle the messy infrastructure requirements that make Cluster Ops a challenge.  This is a good thing because complexity explodes when platforms expose infrastructure details.

For Kubernetes and similar, I believe that injecting too much infrastructure mess undermines the simplicity of the platform.

There’s a different type of platform needed for infrastructure aware cluster operations where automation needs to address complexity via composability.  That’s what RackN is building with open Digital Rebar: a the hybrid management layer that can consistently automate around infrastructure variation.

If you want to work with us to create system focused, infrastructure agnostic automation then take a look at the work we’ve been doing on underlay and cluster operations.

 

Container Migration 101: Cloudcast.net & Lachlan Evenson

Last week, the CloudCast.net interviewed Lachlan Evenson (now at Deis!).  I highly recommend listening to the interview because he has a unique and deep experience with OpenStack, Kubernetes and container migration.

15967I had the good fortune of lunching with Lachie just before the interview aired.  We got compare notes about changes going on in the container space.  Some of those insights will end up in my OpenStack Barcelona talk “Will it Blend? The Joint OpenStack Kubernetes Environment.”

There’s no practical way to rehash our whole lunch discussion as a post; however, I can point you to some key points [with time stamps] in his interview that I found highly insightful:

  • [7:20] In their pre-containers cloud pass, they’d actually made it clunky for the developers and it hurt their devops attempts.
  • [17.30] Developers advocating for their own use and value is a key to acceptance.  A good story follows…
  • [29:50] We’d work with the app dev teams and if it didn’t fit then we did not try to make it fit.

Overall, I think Lachie does a good job reinforcing that containers create real value to development when there’s a fit between the need and the technology.

Also, thanks Brian and Aaron for keeping such a great podcast going!

 

 

Cloudcast interview with Dell Cloud Solutions Team (quotes with time stamps)

Lloyds BarbershopTheCloudcast.net – Thank you for such a great series of questions. Wow, nearly 36+ minutes of cloudicious interview about the work my team at Dell is doing!

Thanks to our hosts for putting together a great series! They are:

Highlights from Episode 16: Dell, Dude you’re getting a cloud

  • 3:40 JBG “we are listening to our customers tell us what they want to accomplish”
  • 4:40 RAH “humility is part of [what we’re] doing … cloud is about learning and collaboration”
  • 6:40 RAH “OpenStack filled a niche. It was the first open source community cloud. … Not just open source, its open community.”
  • 7:15 RAH “We’re beyond critical mass. We’re seeing acceleration… we are transitioning into a community development.”
  • 7:30 RAH “It’s accelerating. It happening so fast.”
  • 8:00 RAH “We felt it was really important for people to be able to use it. We felt that it was important to get away from just people developing into people using. “
  • 8:57 – RAH “Cloud is not just one thing. You have to have all the pieces.”
  • 10: 15 – RAH “Cloud is always ready, never finished”
  • 10:50 – RAH “OpenStack is an alternative to public cloud including hosting providers seeking to offer their own cloud”
  • 12:40 – AD “Dell has been in the Big data space for many years now”
  • 20:15 – JBG “There’s a legacy of great partnerships that we leverage”
  • 20:48 – JBG “Conflicts have not come up because we are focused on the customer”
  • 21:30 – RAH “Shout out to Greg Althaus for solving these problems in such an elegant way. And we rewrote it 3 times”
  • 22:02 – RAH “Crowbar started from our frustration of bringing up a cloud quickly … so we took a DevOps approach.”
  • 22:41 – RAH “You had to have a system view AND a boot strapping view simultaneously”
  • 23:50 – JBG “Crowbar was born out of necessity because we were setting up and blowing away our clouds over and over and over again. “
  • 24:40 – JBG “We realized there were not many people thinking about all the pieces before OpenStack was installed”
  • 25:20 – RAH “We don’t think customers have all the answers before we show up. This is not unique to OpenStack.”
  • 28:20 – JBG “We’re seeing the community pick up Crowbar as a way to deploy”