August 18 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on August 18, 2017 by Rob H

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

SRE Items of the Week

Beyond Google SRE: What is Site Reliability Engineering like at Medium?
https://blog.netsil.com/beyond-google-sre-what-is-site-reliability-engineering-like-at-medium-71c65bd35f4e

We had the opportunity to sit down with Nathaniel Felsen, DevOps Engineer at Medium and the author of “Effective DevOps with AWS”. We are happy to share some practical insights from Nathaniel’s extensive experience as a seasoned DevOps and SRE practitioner.

While we hear a lot about these experiences from Google, Netflix, etc., we wanted to gather perspectives on DevOps and SRE life with other easily relatable companies. From tech-stack challenges to organization structure, Nathaniel provides a wide range of practical insights that we hope will be valuable in improving DevOps practices at your organization. READ MORE

GitHub seeks to spur innovation with Kubernetes migration
http://www.zdnet.com/article/github-seeks-to-spur-innovation-with-kubernetes-migration/

GitHub on Wednesday is sharing the details of the massive technical endeavor its engineers went through to migrate the infrastructure that powers github.com and api.github.com — some of its most critical workloads — from a set of manually-configured physical servers to Kubernetes clusters that run application containers.

GitHub is confident the move will allow for faster innovation on the online code sharing and development platform. READ MORE

SRE Thinking: Reframing Dev + Ops
http://bit.ly/2w2I53F

Last month, Eric Wright and I were able to complete a discussion the inspired my guest post for CapitalOne “How Platforms and SREs Change the DevOps Contract.” While our conversation ranged widely over the challenges of building and integration of IT processes, the key message is simple: we need to make investments in operations. READ MORE

Coal or Diamonds? Configuration Management is Under Pressure
http://bit.ly/2uTvADN

Cloud Native thinking is thankfully changing the way we approach traditional IT infrastructure. These profound changes in how we build applications with 12-factor design and containers has deep implications on how we manage configuration and the tools we use to do it. These are not cloud only impacts – the changes impact every corner of IT data centers. READ MORE

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/

_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

DevOpsDays Dallas – August 29 – 30: Rob Hirschfeld Talk
DevOps Summit – Oct 31 – Nov 2: Rob Hirschfeld Talk

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #85
The DevOps/WebOps Marketing Geek – LINK from @LukasHertig
Julie Evans Blog – LINK

SRE Thinking : Reframing Dev + Ops

Posted on August 16, 2017 by Rob H

This podcast explains why I’ve been using Site Reliability Engineering (SRE) as a proxy for this DevOps inspired rethinking of operations.

I hope you’ll take the time to listen to this deep conversation about very real IT issues. Eric and I are not shy about expressing our opinions, but we’re also anti-shaming. The simple reality is that building infrastructure is hard and we all make difficult choices. My hope is that we can start sharing the fixes and helping each other out.

Podcast Episode 50 – SRE Revisited plus the Challenges of Ops and more with Rob Hirschfeld (@zehicle)

Do these topics inspire you? Creating data center automation for SREs is our mission at RackN. We believe that well run infrastructure requires building APIs from the ground up and keeping them simple. I hope that you’ll take 5 minutes to try our latest offering, Digital Rebar Provision and join us on the quest drive excellence in operations.

July 28 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on July 28, 2017 by Rob H

This week, we launched our new RackN website to provide more information on our solutions and services as well as provide customer examples. Click over to our new site and let us know your thoughts.

SRE Items of the Week

Site Reliability Engineer: Don’t fall victim to the bias blind spot
http://sdtimes.com/site-reliability-engineer-dont-fall-victim-to-the-bias-blind-spot/

To ensure websites and applications deliver consistently excellent speed and availability, some organizations are adopting Google’s Site Reliability Engineering (SRE) model. In this model, a Site Reliability Engineer (SRE) – usually someone with both development and IT Ops experience – institutes clear-cut metrics to determine when a website or application is production-ready from a user performance perspective. This helps reduce friction that often exists between the “dev” and “ops” sides of organizations. More specifically, metrics can eliminate the conflict between developers’ desire to “Ship it!” and operations desire to not be paged when they are on-call. If performance thresholds aren’t met, releases cannot move forward. READ MORE

Episode 50 – SRE Revisited plus the Challenge of Ops and more with Rob Hirschfeld
http://podcast.discoposse.com/e/ep-50-sre-revisited-plus-the-challenges-of-ops-and-more-with-rob-hirschfeld-zehicle/

This fun chat expands on what we started talking about in episode 42 (http://podcast.discoposse.com/e/ep-42-spiraling-ops-debt-sre-solutions-and-rackn-chat-with-rob-hirschfeld-zehicle/) as we dive into the challenges and potential solutions for thinking and acting with the SRE approach. Big thansk to Rob Hirschfeld from @RackN for sharing his thoughts and experiences from the field on this very exciting subject. LISTEN HERE

Site Reliability Engineering – Operators and Developers Working Together
http://bit.ly/2u7eSmm

Rob Hirschfeld, Co-Founder and CEO of RackN provides his thoughts on how operators are equivalent to developers and work together to accomplish the critical task of keep the infrastructure running and available with constant changes in the data center

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/
_____________

UPCOMING EVENTS

DevOpsDays Dallas – August 29 – 30: Rob Hirschfeld Talk

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #82
The DevOps/WebOps Marketing Geek – LINK from @LukasHertig
Julie Evans Blog – LINK

July 14 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on July 14, 2017 by Rob H

SRE Items of the Week

Teradata Acquires San Diego-based Start-up StackIQ to Strengthen Teradata Everywhere and IntelliCloud Capabilities
http://prn.to/2vicpUb

SAN DIEGO, July 13, 2017 /PRNewswire/ — Teradata (NYSE: TDC), the leading data and analytics company, today announced the acquisition of StackIQ, developers of one of the industry’s fastest bare metal software provisioning platforms which has managed the deployment of cloud and analytics software at millions of servers in data centers around the globe. The deal will leverage StackIQ’s expertise in open source software and large cluster provisioning to simplify and automate the deployment of Teradata Everywhere. Offering customers the speed and flexibility to deploy Teradata solutions across hybrid cloud environments, allows them to innovate quickly and build new analytical applications for their business.

How Platforms and SREs Change the DevOps Contract on CapitalOne DevExchange
http://bit.ly/2uVXekf

capitalone
DevOps struggles under a “fully shared responsibility” contract for Developers and Operations that drives a futile search for elusive “full-stack engineers.” It’s time to revisit how to Dev and Ops are going to collaborate because these jobs often have different priorities. READ MORE

RackN Introduction Video
Rob Hirschfeld, CEO and Co-Founder introduces RackN in 48 seconds

Kubernauts Worldwide Meetup
This video is from our first Kubernauts Worldwide Meetup covering the new features in Kubernetes 1.7 presented by Ihor Dvoretskyi, Kubernetes Pain Points and Upgrade presented by Rob Hirschfeld and about Kubernauts Training presented by Des Drury. Arash Kaffamanesh moderated the online meetup and provided a short overview about what Kubernauts are about.

Rob starts at 38 minute 50 seconds

Video Series w/ Packet.net
Three videos showing how to use Packet.net custom IPXE option with Digital Rebar IPXE provisioning

http://bit.ly/2t54J65      (Video 1 of 3)
http://bit.ly/2tO5WCy   (Video 2 of 3)
http://bit.ly/2vi5dXZ     (Video 3 of 3)

Let’s DevOps IRL: My SRE Postings on RackN by Rob Hirschfeld
http://bit.ly/2tzCvnj

I’m investing in these Site Reliability Engineering (SRE) discussions because I believe operations (and by extension DevOps) is facing a significant challenge in keeping up with development tooling. The links below have been getting a lot of interest on twitter and driving some good discussion. READ MORE

newsletter

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/
_____________

UPCOMING EVENTS

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #80
The DevOps/WebOps Marketing Geek – LINK from @LukasHertig
Julie Evans Blog – LINK

Let’s DevOps IRL: my SRE postings on RackN!

Posted on July 10, 2017 by Rob H

datanauts_logo_300

15967

Addressing this Ops debt is our primary mission at my company, RackN: we believe that integrated system level tooling is required. We also believe that new tools should not disrupt environments so we work very hard to adapt to requirements of individual sites.

SRE is urgent because it provides a pragmatic path and rationale for investment.

Even if you don’t agree with Google’s term or all their practices, I think fundamental concepts of system thinking, status/pay, automation investment and developer collaboration are essential. It should come as no surprise that these are all Lean/DevOps concepts; however, SRE has the pragmatic side of being a job function.

Here are some recent relevant discussions I’ve been having about SREs with links to both the audio and my text show notes.

Cloud Cast about SRE concepts and decomposing Ops
Datanauts deep dive about SRE based on the “DevOps vs SRE” talk from DevOpsDays Austin (original post)
Charity Majors and I debate the SRE name and pay equity for Ops.
Further Reading Podcasts
- Turbomatic’s Eric Wright
- HPE’s Stephen Spector

Of course, RackN is also doing a WEEKLY SRE update that captures general interest items. Check that out and subscribe.

July 7 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on July 7, 2017 by Rob H

SRE Items of the Week

Presidential Campaigns & Immutable Infrastructure by @danielbryantuk
https://www.infoq.com/news/2017/06/presidential-infrastructure

At QCon New York 2017 Michael Fisher presented “Presidential Campaigns & Immutable Infrastructure” and discussed the implementation and challenges of provisioning infrastructure for the Hillary for America (HFA) campaign that ran during the 2015-2016 US regional and national elections. Immutable infrastructure was key to the technical success of the campaign – the team moved quickly, but were resilient against failure for the majority of the time. It can take more effort to apply the principle of immutability to everything being deployed, but it is beneficial and developers “like the handshake between SRE and dev”. READ MORE

So you want to be a SRE? by Ingo Averdunk‏ @ingoa
https://hackernoon.com/so-you-want-to-be-an-sre-34e832357a8c

About 9 months ago I set out to leave my teaching career of six years to pursue a career as a Software Engineer. I attended a 3 month Programming Bootcamp called Hackbright Academy during which I not only learned the fundamentals of programming, but more importantly, the fundamentals of what type of work excites me. I realized that I loved design. I loved data-model design, user experience design, architectural design, system design… The list goes on, I love design. Because of this, I thought the best place for me would be as a Front End Engineer, boy was I wrong. READ MORE

LinkedIn Releases Open Source Tools
https://www.martechadvisor.com/news/search-social-ads/linkedin-releases-opensource-tools/

The social networking service for professionals, LinkedIn, has announced that it will be releasing a couple of key tools that will be available as open source projects. These have been primarily created to help businesses deal with issues regarding website outages. The new tools will also be enabling organizations to automatically connect with engineers whenever their applications fail. READ MORE
___________

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/
_____________

UPCOMING EVENTS

2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #79
The DevOps/WebOps Marketing Geek – LINK from @LukasHertig
Julie Evans Blog – LINK

June 30 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on June 30, 2017 by Rob H

SRE Items of the Week

Site Reliability Engineering at Dropbox with Tammy Butow @tammybutow

The mess and success of building open leadership (notes from Kubernetes Leadership Summit)
http://bit.ly/2tMTzEy

Three weeks ago, Kubernetes leaders met for a very busy day to reflect and plan how the community was being growing. I was humbled to be part of the Kubernetes Leadership Summit due to my work as the Cluster Ops SIG co-chair. READ MORE

Ops integration will be scary, proceed with haste
http://bit.ly/2u2Wfhq

As CEO of RackN, I talk to a lot of operations teams who have big aspirations for automation that are faltering due to internal resistance. Generally, we’re talking to the SREs on the team. Sadly, those SREs are often stymied by narrowly scoped teams and house-of-cards technical debt. READ MORE

The Case for Ops Engineering Pay Equity with Charity Majors
http://bit.ly/2tZBjYD

Charity Majors is one of my DevOps and SRE heroes* so it was great fun to be able to debate SRE with her at Gluecon this spring. Encouraged by Mike Maney to retell the story, we got to recapture our disagreement about “Is SRE is Good Term?” from the evening before. READ MORE

Datanauts #89 Dives Deep on SRE Approach and Urgency
http://bit.ly/2tqmbGl

In Datanauts 089, Chris Wahl and Ethan Banks help me break down the concepts from my “DevOps vs SRE vs Cloud Native” presentation from DevOpsDays Austin last spring. They do a great job exploring the tough topics and concepts from the presentation. It’s almost like an extended Q&A so you may want to review the slides or recording before diving into the podcast.

Here are my notes from the podcast READ MORE

5 Laws every aspiring Devops engineer should know by @ChrisShort
https://opensource.com/open-organization/17/5/5-devops-laws

“A good engineer is a lazy engineer,” some will say. And to a certain extent, it’s true: Laziness is a great quality if you’re automating repetitive tasks. But laziness flies in the face of learning new technologies and getting new work done. Somewhere between Junior Systems Administrator and Senior DevOps Engineer, laziness no longer becomes an advantage.

Let’s discuss the five laws aspiring DevOps engineers should follow if they want to become great DevOps engineers. READ MORE
___________

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/
____________

UPCOMING EVENTS

2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #78
The DevOps/WebOps Marketing Geek – LINK from @LukasHertig
Julie Evans Blog – LINK

Datanauts #89 dives deep on SRE approach and urgency

Posted on June 29, 2017 by Rob H

TL;DR: SRE makes Ops more Dev like in critical ways like status equity and tooling approaches.

In Datanauts 089, Chris Wahl and Ethan Banks help me break down the concepts from my “DevOps vs SRE vs Cloud Native” presentation from DevOpsDays Austin last spring. They do a great job exploring the tough topics and concepts from the presentation. It’s almost like an extended Q&A so you may want to review the slides or recording before diving into the podcast.

Advanced Reading: my follow-up discussion on SRE with the Cloudcast team and my previous Datanauts podcast.

Here are my notes from the podcast:

01:00 “Doing infrastructure in a way that the robots can take over”
01:51 Video where Charity & Rob Debated the SRE term
02:00 History of SRE term from Google vs Sys Ops – if site was not up, money was not flowing. SRE culture fixed pay equity and career ladder, ops would have automation/dev time, dev on hooks for errors
03:00 Google took a systems approach with lots of time for automation and coding
03:20 Finding a 10x improvement in ops. Go buy the book.
04:00 SRE is a new definition of System Op
04:10 The S in could be “system” or physical location (not web site).
05:00 We’re seeing SRE teams showing up in companies of every size. Replacing DevOps teams (which is a good thing). Rob is hoping that SRE is replacing DevOps as a job title.
06:10 Don’t fall for a title change from Sys Op to SRE with actually getting the pay and authority
06:45 Ethan believes that SRE is transforming to have a broad set of responsibilities. Is just a new System Admin definition?
07:30 Rob things that the SRE expectation is for a much higher level of automation. There’s a big thinking shift.
08:00 SREs are still operators. You have to walk the walk to know how to run the system. Not developers who are writing the platform.
08:30 Chris asks about the Ops technical debt
09:00 We need to make Ops tooling “better enough” – we’re not solving this problem fast enough. We have to do a better job – Rob talks about the Wannacry event.
10:30 Chris asks how to fix this since complexity is increasing. Rob plugs Digital Rebar as a way to solve this.
11:00 People are excited about Digital Rebar but don’t have the time to fix the problem. They are running crisis to crisis so we never get to automation that actually improves things.
12:00 At best, Ops is invisible. SRE is different because it includes CI/CD with on going interactions. There’s a lot coming with immutable operating systems and constantly term.
13:00 The idea that a Linux system has been up for 10 years is an anti-pattern. Rob would rather have people say that none of their servers has been up for more than a week (because they are constantly refreshed)
13:19 Chris & Ethan – SECTION 1 REVIEW
- SRE is not new, it’s about moving into a proactive stance (automatically reacting)
- The power is the buy in so that Ops has ownership of the stack
15:00 SRE vs DevOps vs Cloud Native – not in conflict, but we love to create opposition
15:40 There is a difference, they are not interchangeable. SRE is a job title, DevOps is a process and Cloud Native is an architecture.
16:30 We need to resist that Cloud Native is a “new shiney” that replaces DevOps. We don’t have to take things away.
17:00 Lean is a process where we’re trying to shorten the flow from ideation to delivery. Read the Goal [links] and The Phoenix Project [links].
18:00 Bottlenecks (where we’ve added work or delays) really break our pipelines.
19:00 Ethan’s adds the insight: If you don’t have small steps then you don’t really understand your process
20:00 Platform as a Service is not really reducing complexity, we’re just hiding/abstracting it. That moves the complexity. We may hide it from developers but may be passing it to the operators.
21:00 Chris asks if this can be mapped to legacy? Rob agrees that it’s a legacy architectural choice that was made to reduce incremental risk. Today, we’re trying to make our risk into smaller steps which makes it so that we will have smaller but more frequent breaks.
22:40 The way we deliver systems is changing to require a much faster pace of taking changes
23:00 SREs are data driven so they can feed information back to devs. They can’t (shouldn’t) walk away from running systems. This is an investment requirement so we can create data.
24:00 We let a lot of problems lurk below the surface that eventually surface as a critical issue. Cannot let toothaches turn into abscesses. SREs should watch systems over time.
25:20 If you are running under performance in the cloud, then you are wasting money.
26:00 Cloud Native, an architecture? What is it? It means a ton of things. For this preso, Rob made it about 12 factor and API driven infrastructure.
26:50 “If you are not worried about rising debt then we are in trouble.” We need to root cause! If not, they snowball and operators are just running fire to fire. We need to stop having operators be heros / grenade divers because it’s an anti-pattern. Predictable systems do not create a lot of interrupts or crises. Operators should not be event driven.
28:40 Chris & Ethan – SECTION 2 REVIEW
- Chris: Being data driven combats complexity
- Ethan: Breaking down processes into smaller units reduces risk.
30:00 Cloud First is not Cloud Only. CNCF projects are not VM specific, they are about abstractions that help developers be more productive. Ideally, the abstractions remove infrastructure because developers don’t want to do any infrastructure. We should not are about which type of infrastructure we are using
31:30 The similarities between the concepts is in their common outcomes/values. Cloud First wants to be infrastructure agnostic.
32:30 Chris ask how important CI/CD should be. Are these still important in non-Cloud environments. Rob things that Cloud Native may “cloud wash” architectures that are really just as important in traditional infrastructure.
34:00 Cloud Native was a defensive architecture because early cloud was not very good. CI/CD pipelines would be considered best practices in regular manufacturing.
35:00 These ideas are really good manufacturing process applied back to IT. Thankfully, there’s really nothing unexpected from repeatable production.
36:30 Lesson: Pay Equity. Traditionally operators are not paid as well as developers and that means that we’re giving them less respect. HiPPO (highest paid person in organization) is a very real effect where you can create a respect gap.
38:00 Lesson: Disrupt Less. We love the idea of disruption but they are very expensive and disproportionately to the operators. Change for Developers may be small but have big impacts to operators. More disruptive changes actually slow down adoption because that slows down inertia. SREs should be able to push back to insist on migration paths.
40:00 Rob talks about how RedFish, while good to replace IPMI, will take long time before it. There are pros and cons.

Ops integration will be scary, proceed with haste!

Posted on June 29, 2017 by Rob H

TL;DR: Your own tool silos (and the teams supporting them) are blocking your progress.

Last week, I examined some of my DevOps scar tissue and tweeted: “consider, ops integration will be scary – you have to give up control of individual actions and silos. it’s hard to give up control”

The tweet seemed to strike a nerve with others because change and control are so often at war. It was based on a recurring theme that the RackN team sees from ops organizations: antibodies towards integrated solutions in favor of DIY projects combining disparate tools.

It makes sense to me that operators want a sense of control and ownership; however, those same motivations are counter to the automation imperative that should be driving them forward. Patching together a solution today is adding technical debt that becomes insurmountable when used in production.

This challenge is why so much DevOps content is targeted at organization culture instead of tools. While this is clearly the root, I also think that our tools are not designed to work together as a system. The fact that teams prefer it that was is as key part of the problem.

Let’s do ourselves a favor – let’s take the time to solve operations issues at the system level like we’ve been trying to do with Digital Rebar. We’ll all move faster together.

The Case for Ops Engineering Pay Equity w/ Charity Majors

Posted on June 28, 2017 by Rob H

TL;DR: Operators need pay/status equity to succeed.

Charity Majors is one of my DevOps and SRE heroes* so it was great fun to be able to debate SRE with her at Gluecon this spring. Encouraged by Mike Maney to retell the story, we got to recapture our disagreement about “Is SRE is Good Term?” from the evening before.

While it’s hard to fully recapture with adult beverages, we were able to recreate the key points.

First, we both strongly agree that we need status and pay equity for operators. That part of the SRE message is essential regardless of the name of the department.

Then it get’s more nuanced. Charity, whose more of a Silicon Valley insider, believes that SRE is tainted by the “Google for Everyone” cargo cult. She has trouble separating the term SRE from the specific Google practices that helped define it.

As someone who simply commutes to Silicon Valley, I do not see that bias in the discussions I’ve been having. I do agree that companies that try to simply copy Google (or other unicorns) in every way is a failure pattern.

Charity: “I don’t want get paid to keep someone else’s ~~shit~~ site alive”

I think Google did a good job with the book by defining the term for a broad audience. Charity believes this signals that SRE means you are working for a big org. Charity suggested several better alternatives, Operations Engineer. At the end, the danger seems to be when Dev and Ops create silos instead of collaborating.

Consensus: Job Title? Who cares. The need to to make operations more respected and equal.

What did you think of the video? How is your team defining Operations titles and teams?

(*) yes, I’m working on an actual list – stay tuned.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: SRE

August 18 – Weekly Recap of All Things Site Reliability Engineering (SRE)

_____________

SRE Thinking : Reframing Dev + Ops

July 28 – Weekly Recap of All Things Site Reliability Engineering (SRE)

July 14 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Let’s DevOps IRL: my SRE postings on RackN!

July 7 – Weekly Recap of All Things Site Reliability Engineering (SRE)

June 30 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Ops integration will be scary, proceed with haste!

The Case for Ops Engineering Pay Equity w/ Charity Majors