What makes ops hard? SRE/DevOps challenge & imperative [from Cloudcast 301]

TL;DR: Operators (DevOps & SREs) have a hard job, we need to make time and room for them to redefine their jobs in a much more productive way.

Cloudcast-Logo-2015-Banner-BlueThe Cloudcast.net by Brian Gracely and Aaron Delp brings deep experience and perspective into their discussions based on their impressive technology careers and understanding of the subject matter.  Their podcasts go deep quickly with substantial questions that get to the heart of the issue.  This was my third time on the show (previous notes).

In episode 301, we go deeply into the meaning and challenges for Site Reliability Engineering (SRE) functions.  We also cover some popular technologies that are of general interest.

Author’s Note; For further information about SREs, listen to my discussion about “SRE vs DevOps vs Cloud Native” on the Datanauts podcast #89.  (transcript pending)

Here are my notes from Cloudcast 301. with bold added for emphasis:

  • 2:00 Rob defines SRE (more resources on RackN.com site).
    • 2:30 Google’s SRE book gave a name, even changed the definition, to what I’ve been doing my whole career. Evolved name from being just about sites to a full system perspective.  
    • 3:30 SRE and DevOps are aligned at the core.  While DevOps is about process and culture, SRE is more about the function and “factory.”
    • 4:30 Developers don’t want to be shoving coal into the engine, but someone, SREs, have to make sure that everything keeps running
  • 5:15 Brian asks about impedance mismatch between Dev and Ops.  How do we fix that?
    • 6:30 Rob talks about the crisis brewing for operations innovation gap (link).  Digital Rebar is designed to create site-to-site automation so Operators can share repeatable best practices.
    • 7:30 OpenStack ran aground because Operators because we never created a the practices that could be repeated.  “Managed service as the required pattern is a failure of building good operational software.”
    • 8:00 RackN decomposes operations into isolated units so that individual changes don’t break the software on top

  • 9:20 Brian talks about the increasing rate of releases means that operations doesn’t have the skills to keep up with patching.
    • 10:10 That’s “underlay automation” and even scarier because software is composited with all sorts of parts that have their own release cycles that are not synchronized.
    • 11:30 We need to get system level patch/security.update hygiene to be automatic
    • 12:20 This is really hard!

  • 13:00 Brian asks what are the baby steps?
    • 13:20 We have to find baby steps where there are nice clean boundaries at every layer from the very most basic.  For RackN, that’s DHCP and PXE and then upto Kubernetes.
    • 15:15 Rob rants that renaming Ops teams as SRE is a failure because SRE has objectives like job equity that need to be included.
    • 16:00 Org silos get in the way of automation that have antibodies that make it difficult for SREs and DevOps to succeed.
    • 17:10 Those people have to be empowered to make change
    • 17:40 The existing tools must be pluggable or you are hurting operators.  There’s really no true greenfield, so we help people by making things work in existing data centers.
    • 19:00 Scripts may have technical debt but that does not mean they should just be disposed.
    • 19:20 New and shiney does not equal better.  For example, Container Linux (aka CoreOS) does not solve all problems.  
    • 20:10 We need to do better creating bridges between existing and new.
    • 20:40 How do we make Day 2 compelling?

  • 21:15 Brian asks about running OpenStack on Kubernetes.
    • 22:00 Rob is a fan of Kubernetes on Metal, but really, we don’t want metal and vms to be different.  That means that Kubernetes can be a universal underlay which is threatening to OpenStack.
    • 23:00 This is no longer a JOKE: “Joint OpenStack Kubernetes Environments”
    • 23:30 Running things on Kubernetes (or OpenStack) is great because the abstractions hide complexity of infrastructure; however, at the physical layer you need something that exposes that complexity (which is what RackN does).

  • 25:00 Brian asks at what point do you need to get past the easy abstractions
    • 25:30 You want to never care ever.  But sometimes you need the information for special cases.
    • 26:20 We don’t want to make the core APIs complex just to handle the special cases.
    • 27:00 There’s still a class of people who need to care about hardware.  These needs should not be embedded into the Kubernetes (or OpenStack) API.

  • 28:00 Brian summarizes that we should not turn 1% use cases into complexity for everyone.  We need to foster the skill of coding for operators
    • 28:45 For SREs, turning Operators into coding & automation is essential.  That’s a key point in the 50% programming statement for SREs.
    • In the closing, Rob suggested checking out Digital Rebar Provision as a Cobbler replacement.

We’re very invested in talking about SRE and want to hear from you! How is your company transforming operations work to make it more sustainable, robust and human?We want to hear your stories and questions.

The mess and success of building open leadership (notes from Kubernetes Leadership Summit)

TL;DR: Working on building open governance that is both inclusive and able to make hard decisions.

building-joy-planning-plansThree weeks ago, Kubernetes leaders met for a very busy day to reflect and plan how the community was being growing.  I was humbled to be part of the Kubernetes Leadership Summit due to my work as the Cluster Ops SIG co-chair.    Please join us every other Thursday at 1 PT to share stories about running or planning to run Kubernetes.

This event had to thread a delicate balance for an open project:  we needed to limit attendance to focus discussions while ensuring that the community was represented.  Our notes (captured in Google Docs) are being transcribed to markdown here.

Here are some key topics that shaped the day from my perspective:

  • A consensus that core needed to focus on paying down debt and getting smaller.  The core project is seen as a bottleneck to growth.  The comes from number of people trying to interact in the repo and from having too much technical debt,  As a group, we agreed that paying this debt was very important; however, we did not define or authorize specific action to address it.  I felt that just acknowledging this focus by a show of hands was a positive action.
  • Moving forward on formation of a Steering Committee.  The bootstrapping committee reviewed their Steering Committee proposal.  The concepts here are to design a governing body that intentionally delegates their authority.  I think it’s an interesting approach that will help to empower more people in the project.  This design is different than a corporate board that’s focused on supervision.  Here’s the draft document we reviewed as input into the next phase proposal.
  • Continue using SIGs to divide work.  A consequence of the governance design is that we are (ab)using special interest groups (SIG) to organize the coding and feature work for Kubernetes.  They also carry the load for releases, product management and operations.  The push from the meeting was to have all SIGs with specific deliverables.  I think that works well for some SIGs, but more user/operator focused groups (like Cluster Ops) will feel that it’s harder to find the right engagement models.

Overall, the event was very positive with lively group discussions.  This group is focused on building Kubernetes, so there was very little vendor, marketing, user or operator focus.  As the project grows, I believe these other focus areas will be important to manage.  Likely, those concerns cannot be addressed until the Steering Committee is formed.

RackN is committed to helping make Kubernetes operable and improve the operator experience.  I’m interested in hearing about your remote or local impressions of this event.  What items should have gotten more discussion?  What is the project missing?

June 23 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rack ngo)

SRE Items of the Week

Datanauts 089: SRE vs Cloud Native vs DevOps
http://bit.ly/2txPXWV

Rob Hirschfeld joins the Datanauts to talk about the term Site Reliability Engineer (SRE) and what it means for IT operations.

Rob explores how the SRE designation is an effort to put operations teams on a more equal footing with developers within an organization. Rob and the Datanauts also discuss how SREs line up with other industry trends such as the cloud native and DevOps movements. LISTEN HERE

Why Does DevOps Require a New Operating Model? By Mustafa Kapadia @MKapadiaTweets
https://devops.com/why-should-cios-redesign-their-organizations/

For many, redesigning the operating model is table stakes for a successful DevOps transformation. But have you ever wondered why? Popular wisdom will have you believe that the main reason for operating model redesign are to…

“Improve collaboration between business and IT”
“Realign metrics”
“Take full advantage of the new tools”
“And even jump start culture change”

While these are all good reasons, frankly they miss the point. Experience suggests there is a more practical reason – match ownership with desired output.

What do we mean by that? Well first, let’s look at how the current model works. READ MORE

What can developers learn from being on call? By Julia Evans @b0rk http://jvns.ca/blog/2017/06/18/operate-your-software/

We often talk about being on call as being a bad thing. For example, the night before I wrote this my phone woke me up in the middle of the night because something went wrong on a computer. That’s no fun! I was grumpy.

In this post, though, we’re going to talk about what you can learn from being on call and how it can make you a better software engineer!. And to learn from being on call you don’t necessarily need to get woken up in the middle of the night. By “being on call”, here, I mean “being responsible for your code when it breaks”. It could mean waking up to issues that happened overnight and needing to fix them during your workday! READ MORE

Kargo Ansible Playbooks foster Collaborative Kubernetes Ops
http://bit.ly/2qENw3I   

Why Kargo?
Making Kubernetes operationally strong is a widely held priority and I track many deployment efforts around the project. The incubated Kargo project is of particular interest for me because it uses the popular Ansible toolset to build robust, upgradable clusters on both cloud and physical targets. I believe using tools familiar to operators grows our community.

We’re excited to see the breadth of platforms enabled by Kargo and how well it handles a wide range of options like integrating Ceph for StatefulSet persistence and Helm for easier application uploads. Those additions have allowed us to fully integrate the OpenStack Helm charts (demo video). READ MORE

newsletter

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

  • 2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

June 16 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rack ngo)

SRE Items of the Week

The Cloudcast #301 – SRE and Infrastructure Operations
http://www.thecloudcast.net/2017/06/the-cloudcast-301-sre-and.html

Description: Brian talks with Rob Hirschfeld (@zehicle, Founder/CEO of @RackN) about the concepts of SRE (Site Reliability Engineering), the challenges of maintaining infrastructure software, emerging tools and the next-generation of operations.

Show Notes:

  • Topic 1 – Welcome back to the show. Let’s start by talking about the concept of SRE (Site Reliability Engineering). Give us the basics and maybe explain how it differs from what people define in DevOps.
  • Topic 2 – Application development has been moving faster for quite a while (agile development, etc.). But now infrastructure/operations teams have to deal with faster software – especially around updates (e.g. Kubernetes releases every 3 months). How are companies managing this?
  • Topic 3 – Given that this pace of operations change may not slow down, how do you think about the challenge in terms of process/operations versus technology/tools?
  • Topic 4 – What are some of the steps that companies take to better prepare for this type of operational model? Tools, process, skills, etc.
  • Topic 5 – Do you see SRE as being a progression for existing infrastructure/operations people, or is this more focused on sysadmins or developers that want to get away from building applications?

_____________

DevOps Enterprise Summit London: Tales of courage and community
https://techbeacon.com/devops-enterprise-summit-london-tales-courage-community

After spending two amazing days with 700 of my closest DevOps cohorts from Europe, the Middle East, Africa, and beyond, I learned all about the latest and greatest IT and technology transformation reports at the DevOps Enterprise Summit London. With substantial growth in attendance from the first year, in 2016, the buzz around the show was palpable. And, what a location! From the venue, the QEII Centre, we had 360-degree views of central London, from Big Ben to the London Eye and beyond.

Read more from Steve Brodie, CEO of Electric Cloud @stbrodie
_____________

.IO! .IO! It’s off to a Service Mesh you should go [Gluecon 2017 notes]
http://bit.ly/2rjw4We  

Gluecon turned out to be all about a microservice concept called a “service mesh” which was being promoted by Buoyant with Linkerd and IBM/Google/Lyft with Istio.  This class of services is a natural evolution of the rush to microservices and something that I’ve written microservice technical architecture on TheNewStack about in the past. READ MORE
_____________

A few things I’ve learned about Kubernetes
https://jvns.ca/blog/2017/06/04/learning-about-kubernetes/

I’ve been learning about Kubernetes at work recently. I only started seriously thinking about it maybe 6 months ago – my partner Kamal has been excited about Kubernetes for a few years (him: “julia! you can run programs without worrying what computers they run on! it is so cool!“, me: “I don’t get it, how is that even possible”), but I understand it a lot better now.

This isn’t a comprehensive explanation or anything, it’s some things I learned along the way that have helped me understand what’s going on.

Read more from Julia Evans @b0rk
_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

  • 2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

 

 

.IO! .IO! It’s off to a Service Mesh you should go [Gluecon 2017 notes]

TL;DR: If you are containerizing your applications, you need to be aware of this “service mesh” architectural pattern to help manage your services.

Gluecon turned out to be all about a microservice concept called a “service mesh” which was being promoted by Buoyant with Linkerd and IBM/Google/Lyft with Istio.  This class of services is a natural evolution of the rush to microservices and something that I’ve written microservice technical architecture on TheNewStack about in the past.

servicemeshA service mesh is the result of having a dependency grid of microservices.  Since we’ve decoupled the application internally, we’ve created coupling between the services.  Hard coding those relationships causes serious failure risks so we need to have a service that intermediates the services.  This pattern has been widely socialized with this zipkin graphic (Srdan Srepfler’s microservice anatomy presentation)

IMHO, it’s healthy to find service mesh architecturally scary.

One of the hardest things about scaling software is managing the dependency graph.  This challenge is unavoidable from early days of Windows “DLL Hell” to the mixed joy/terror of working with Ruby Gem, Python Pip and Node.js NPM.  We get tremendous acceleration from using external modules and services, but we also pay a price to manage those dependencies.

For microservice and Cloud Native designs, the service mesh is that dependency management price tag.

A service mesh is not just a service injected between services.  It’s simplest function is to provide a reverse proxy so that multiple services can be consolidated under a single end-point.  That quickly leads to needing load balancers, discovery and encrypted back-end communication.  From there, we start thinking about circuit breaker patterns, advanced logging and A/B migrations.  Another important consideration is that service meshes are for internal services and not end-user facing, that means layers of load balancers.

It’s easy to see how a service mesh becomes a very critical infrastructure component.

If you are working your way through containerization then these may seem like very advanced concepts that you can postpone learning.  That blissful state will not last for long and I highly suggest being aware of the pattern before your development teams start writing their own versions of this complex abstraction layer.  Don’t assume this is a development concern: the service mesh is deeply tied to infrastructure and operations.

The service mesh is one of those tricky dev/ops intersections and should be discussed jointly.

Has your team been working with a service mesh?  We’d love to hear your stories about it!

Related Reading:

June 2 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

SRE Items of the Week

RackN and our Co-Founder and CEO Rob Hirschfeld openly called for a significant change to the OpenStack and Kubernetes communities in his VMBlog.com post, How is OpenStack so dead AND yet so very alive to SREs?

“We’re going to keep solving problems in and around the OpenStack community.  I’m excited to see the Foundation embracing that mission.  There are still many hard decisions to make.  For example, I believe that Kubernetes as an underlay is compelling for operators and will drive the OpenStack code base into a more limited role as a Kubernetes workload (check out my presentation about that at Boston).  While that may refocus the coding efforts, I believe it expands the relevance of the open infrastructure community we’ve been building.

Building infrastructure software is hard and complex.  It’s better to do it with friends so please join me in helping keep these open operations priorities very much alive.”

To provide more information on this idea, Rob posted a new blog, OpenStack’s Big Pivot: our suggestion to drop everything and focus on being a Kubernetes VM management workload.

“Sometimes paradigm changes demand a rapid response and I believe unifying OpenStack services under Kubernetes has become an such an urgent priority that we must freeze all other work until this effort has been completed.”

This proposal has caused significant readership for a typical RackN blog as well as on social media so Rob has posted a 2nd post to further the proposal. (re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes.

It’s essential to solve these problems in an open way so that we can work together as a community of operators.”

As you would expect, RackN is very interested in your thoughts on this proposal and its impact not only on the OpenStack and Kubernetes communities but also how it can transform the ability of IT infrastructure teams to deploy complex technologies in a reliable and scalable manner.

Please contact @zehicle and @rackngo to join the conversation.
_____________

Using Containers and Kubernetes to Enable the Development of Object-Oriented Infrastructure: Brendan Burns GlueCon Presentation

Is SRE a Good Term?
Interview with Rob Hirschfeld (RackN) and Charity Majors (Honeycomb) at Gluecon 2017


_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

Velocity : June 19 – 20 in San Jose, CA

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly)Issue #74

(re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes

TL;DR: infrastructure operations is hard and we need to do a lot more to make these systems widely accessible, easy to sustain and lower risk.  We’re discussing these topics on twitter…please join in.  Themes include “do we really have consensus and will to act” and “already a solved problem” and “this hurts OpenStack in the end.”  

pexels-photo-229949I am always looking for ways to explain (and solve!) the challenges that we face in IT operations and open infrastructure.  I’ve been writing a lot about my concern that data center automation is not keeping pace and causing technical debt.  That concern led to my recent SRE blogging for RackN.

It’s essential to solve these problems in an open way so that we can work together as a community of operators.

It feels like developers are quick to rally around open platforms and tools while operators tend to be tightly coupled to vendor solutions because operational work is tightly coupled to infrastructure.  From that perspective, I’m been very involved in OpenStack and Kubernetes open source infrastructure platforms because I believe the create communities where we can work together.

This week, I posted connected items on VMblog and RackN that layout a position where we bring together these communities.

Of course, I do have a vested interest here.  Our open underlay automation platform, Digital Rebar, was designed to address a missing layer of physical and hybrid automation under both of these projects.  We want to help accelerate these technologies by helping deliver shared best practices via software.  The stack is additive – let’s build it together.

I’m very interested in hearing from you about these ideas here or in the context of the individual posts.  Thanks!

OpenStack’s Big Pivot: our suggestion to drop everything and focus on being a Kubernetes VM management workload

TL;DR: Sometimes paradigm changes demand a rapid response and I believe unifying OpenStack services under Kubernetes has become an such an urgent priority that we must freeze all other work until this effort has been completed.

See Also Rob’s VMblog.com post How is OpenStack so dead AND yet so very alive

By design, OpenStack chose to be unopinionated about operations.

pexels-photo-422290That made sense for a multi-vendor project that was deeply integrated with the physical infrastructure and virtualization technologies.  The cost of that decision has been high for everyone because we did not converge to shared practices that would drive ease of operations, upgrade or tuning.  We ended up with waves of vendors vying to have the the fastest, simplest and openest version.  

Tragically, install became an area of competition instead an area of collaboration.

Containers and microservice architecture (as required for Kubernetes and other container schedulers) is providing an opportunity to correct this course.  The community is already moving towards containerized services with significant interest in using Kubernetes as the underlay manager for those services.  I’ve laid out the arguments for and challenges ahead of this approach in other places.  

These technical challenges involve tuning the services for cloud native configuration and immutable designs.  They include making sure the project configurations can be injected into containers securely and the infra-service communication can handle container life-cycles.  Adjacent concerns like networking and storage also have to be considered.  These are all solvable problems that can be more quickly resolved if the community acts together to target just one open underlay.

The critical fact is that the changes are manageable and unifying the solution makes the project stronger.

Using Kubernetes for OpenStack service management does not eliminate or even solve the challenges of deep integration.  OpenStack already has abstractions that manage vendor heterogeneity and those abstractions are a key value for the project.  Kubernetes solves a different problem: it manages the application services that run OpenStack with a proven, understood pattern.  By adopting this pattern fully, we finally give operators consistent, shared and open upgrade, availability and management tooling.

Having a shared, open operational model would help drive OpenStack faster.

There is a risk to this approach: driving Kubernetes as the underlay for OpenStack will force OpenStack services into a more narrow scope as an infrastructure service (aka IaaS).  This is a good thing in my opinion.   We need multiple abstractions when we build effective IT systems.  

The idea that we can build a universal single abstraction for all uses is a dangerous distraction; instead; we need to build platform layers collaborativity.  

While initially resisting, I have become enthusiatic about this approach.  RackN has been working hard on the upgradable & highly available Kubernetes on Metal prerequisite.  We’ve also created prototypes of the fully integrated stack.  We believe strongly that this work should be done as a community effort and not within a distro.

My call for a Kubernetes underlay pivot embraces that collaborative approach.  If we can keep these platforms focused on their core value then we can build bridges between what we have and our next innovation.  What do you think?  Is this a good approach?  Contact us if you’d like to work together on making this happen.

See Also Rob’s VMblog.com post How is OpenStack so dead AND yet so very alive to SREs? 

May 19 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

SRE Items of the Week


Kargo Ansible Playbooks foster Collaborative Kubernetes Ops
http://blog.kubernetes.io/2017/05/kargo-ansible-collaborative-kubernetes-ops.html

kubernetes

Making Kubernetes operationally strong is a widely held priority and I track many deployment efforts around the project. The incubated Kargo project is of particular interest for me because it uses the popular Ansible toolset to build robust, upgradable clusters on both cloud and physical targets. I believe using tools familiar to operators grows our community.

We’re excited to see the breadth of platforms enabled by Kargo and how well it handles a wide range of options like integrating Ceph for StatefulSet persistence and Helm for easier application uploads. Those additions have allowed us to fully integrate the OpenStack Helm charts (demo video). READ MORE
___________

Cybercrime for Profit? Five reasons why we need to start driving much more dynamic IT Operations
https://rackn.com/2017/05/16/cybercrime-for-profit-five-reasons-why-we-need-to-starting-driving-much-more-dynamic-it-operations/
pexels-photo-169617

There’s a frustrating cyberattack driven security awareness cycle in IT Operations. Exploits and vulnerabilities are neither new nor unexpected; however, there is a new element taking shape that should raise additional alarm.

Cyberattacks are increasingly profit generating and automated. READ MORE
_____________

Building the SRE Culture at LinkedIn
https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin

Being a Site Reliability Engineer (SRE) means having to talk about hard problems. Site outages, complex failure scenarios, and other technical emergencies are the things we have to be prepared to deal with every day. When we’re not dealing with problems, we’re discussing them. We regularly perform post-mortems and root cause analyses, and we generally dig into complex technical problems in an unflinching way. READ MORE
_____________
Virtual Panel: OpenStack Summit Boston 2017 Debriefing


_____________

SRE vs. DevOps — a False Distinction?
https://devops.com/sre-vs-devops-false-distinction/

Just a few days before he died at the beginning of the 1990s, a wise man taught us that “the show must go on.” Freddie Mercury’s parting words have long provided the guiding light for many, if not all, ops teams. In their eyes, the production environment should be exposed to minimum risk, even at the expense of new features and problem resolution.

About 10 years ago, Google decided to change its approach to production management. It took the company only a few years to realize that while R&D focused on creating new features and pushing them to production, the Operations group was trying to keep production as stable as possible—the two teams were pulling in opposite directions. This tension arose due to the groups’ different backgrounds, skill sets, incentives and metrics by which they were measured. READ MORE
_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

Gluecon : May 24 – 25, 2017 in Denver, CO

  • Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly)Issue #72

May 12 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

SRE Items of the Week

RobatOpenStack

OpenStack on Kubernetes: Will it blend? (OpenStack Summit Session) w/ Rob Hirschfeld

OpenStack and Kubernetes: Combining the Best of Both Worlds (OpenStack Summit Session) w/ Rob Hirschfeld

OpenStack Summit Boston Day 1 Notes by Rob Hirschfeld
https://robhirschfeld.com/2017/05/09/openstack-boston-day-1-notes/

Contrary to pundit expectations, OpenStack did not roll over and die during the keynotes yesterday.

In fact, I saw the signs of a maturing project seeing real use and adoption. More critically, OpenStack leadership started the event with an acknowledgement of being part of, not owning, the vibrant open infrastructure community. READ MORE

_______
Immutable Infrastructure Webinar

Attendees:

  • Greg Althaus, Co-Founder and CTO, RackN
  • Erica Windisch, Founder and CEO, Piston 
  • Christopher MacGown, Advisor, IOpipe
  • Riyaz Faizullabhoy,  Security Engineer, Docker
  • Sheng Liang, Founder and CEO Rancher Labs
  • Moderated by Stephen Spector, HPE, Cloud Evangelist

_______
SREies Part1: Configuration Management by Krishelle Hardson-Hurley

SREies is a series on topics related to my job as a Site Reliability Engineer (SRE). About a month ago, I wrote an article about what it means to be an SRE which included a compatibility quiz and resource list to those who were intrigued by the role. If you are unfamiliar with SRE, I would suggest starting there before moving on.

In this series, I will extend my description to include more specific summaries of concepts that I have learned during my first six months at Dropbox. In this edition, I will be discussing Configuration Management. READ MORE

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

Interop ITX : May 15 – 19, 2017 in Las Vegas, NV

Gluecon : May 24 – 25, 2017 in Denver, CO

  • Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly)Issue #71