The Case for Ops Engineering Pay Equity w/ Charity Majors

Posted on June 28, 2017 by Rob H

TL;DR: Operators need pay/status equity to succeed.

Charity Majors is one of my DevOps and SRE heroes* so it was great fun to be able to debate SRE with her at Gluecon this spring. Encouraged by Mike Maney to retell the story, we got to recapture our disagreement about “Is SRE is Good Term?” from the evening before.

While it’s hard to fully recapture with adult beverages, we were able to recreate the key points.

First, we both strongly agree that we need status and pay equity for operators. That part of the SRE message is essential regardless of the name of the department.

Then it get’s more nuanced. Charity, whose more of a Silicon Valley insider, believes that SRE is tainted by the “Google for Everyone” cargo cult. She has trouble separating the term SRE from the specific Google practices that helped define it.

As someone who simply commutes to Silicon Valley, I do not see that bias in the discussions I’ve been having. I do agree that companies that try to simply copy Google (or other unicorns) in every way is a failure pattern.

Charity: “I don’t want get paid to keep someone else’s ~~shit~~ site alive”

I think Google did a good job with the book by defining the term for a broad audience. Charity believes this signals that SRE means you are working for a big org. Charity suggested several better alternatives, Operations Engineer. At the end, the danger seems to be when Dev and Ops create silos instead of collaborating.

Consensus: Job Title? Who cares. The need to to make operations more respected and equal.

What did you think of the video? How is your team defining Operations titles and teams?

(*) yes, I’m working on an actual list – stay tuned.

June 16 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on June 16, 2017 by Rob H

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rack ngo)

SRE Items of the Week

The Cloudcast #301 – SRE and Infrastructure Operations
http://www.thecloudcast.net/2017/06/the-cloudcast-301-sre-and.html

Description: Brian talks with Rob Hirschfeld (@zehicle, Founder/CEO of @RackN) about the concepts of SRE (Site Reliability Engineering), the challenges of maintaining infrastructure software, emerging tools and the next-generation of operations.

Show Notes:

Topic 1 – Welcome back to the show. Let’s start by talking about the concept of SRE (Site Reliability Engineering). Give us the basics and maybe explain how it differs from what people define in DevOps.

Topic 2 – Application development has been moving faster for quite a while (agile development, etc.). But now infrastructure/operations teams have to deal with faster software – especially around updates (e.g. Kubernetes releases every 3 months). How are companies managing this?

Topic 3 – Given that this pace of operations change may not slow down, how do you think about the challenge in terms of process/operations versus technology/tools?

Topic 4 – What are some of the steps that companies take to better prepare for this type of operational model? Tools, process, skills, etc.

Topic 5 – Do you see SRE as being a progression for existing infrastructure/operations people, or is this more focused on sysadmins or developers that want to get away from building applications?

_____________

DevOps Enterprise Summit London: Tales of courage and community
https://techbeacon.com/devops-enterprise-summit-london-tales-courage-community

After spending two amazing days with 700 of my closest DevOps cohorts from Europe, the Middle East, Africa, and beyond, I learned all about the latest and greatest IT and technology transformation reports at the DevOps Enterprise Summit London. With substantial growth in attendance from the first year, in 2016, the buzz around the show was palpable. And, what a location! From the venue, the QEII Centre, we had 360-degree views of central London, from Big Ben to the London Eye and beyond.

Read more from Steve Brodie, CEO of Electric Cloud @stbrodie
_____________

.IO! .IO! It’s off to a Service Mesh you should go [Gluecon 2017 notes]
http://bit.ly/2rjw4We

Gluecon turned out to be all about a microservice concept called a “service mesh” which was being promoted by Buoyant with Linkerd and IBM/Google/Lyft with Istio. This class of services is a natural evolution of the rush to microservices and something that I’ve written microservice technical architecture on TheNewStack about in the past. READ MORE
_____________

A few things I’ve learned about Kubernetes
https://jvns.ca/blog/2017/06/04/learning-about-kubernetes/

I’ve been learning about Kubernetes at work recently. I only started seriously thinking about it maybe 6 months ago – my partner Kamal has been excited about Kubernetes for a few years (him: “julia! you can run programs without worrying what computers they run on! it is so cool!“, me: “I don’t get it, how is that even possible”), but I understand it a lot better now.

This isn’t a comprehensive explanation or anything, it’s some things I learned along the way that have helped me understand what’s going on.

Read more from Julia Evans @b0rk
_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #76
The DevOps/WebOps Marketing Geek – LINK from @LukasHertig
Julie Evans Blog – LINK

June 2 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on June 2, 2017 by Rob H

SRE Items of the Week

RackN and our Co-Founder and CEO Rob Hirschfeld openly called for a significant change to the OpenStack and Kubernetes communities in his VMBlog.com post, How is OpenStack so dead AND yet so very alive to SREs?

“We’re going to keep solving problems in and around the OpenStack community. I’m excited to see the Foundation embracing that mission. There are still many hard decisions to make. For example, I believe that Kubernetes as an underlay is compelling for operators and will drive the OpenStack code base into a more limited role as a Kubernetes workload (check out my presentation about that at Boston). While that may refocus the coding efforts, I believe it expands the relevance of the open infrastructure community we’ve been building.

Building infrastructure software is hard and complex. It’s better to do it with friends so please join me in helping keep these open operations priorities very much alive.”

To provide more information on this idea, Rob posted a new blog, OpenStack’s Big Pivot: our suggestion to drop everything and focus on being a Kubernetes VM management workload.

“Sometimes paradigm changes demand a rapid response and I believe unifying OpenStack services under Kubernetes has become an such an urgent priority that we must freeze all other work until this effort has been completed.”

This proposal has caused significant readership for a typical RackN blog as well as on social media so Rob has posted a 2nd post to further the proposal. (re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes.

“It’s essential to solve these problems in an open way so that we can work together as a community of operators.”

As you would expect, RackN is very interested in your thoughts on this proposal and its impact not only on the OpenStack and Kubernetes communities but also how it can transform the ability of IT infrastructure teams to deploy complex technologies in a reliable and scalable manner.

Please contact @zehicle and @rackngo to join the conversation.
_____________

Using Containers and Kubernetes to Enable the Development of Object-Oriented Infrastructure: Brendan Burns GlueCon Presentation

Is SRE a Good Term?
Interview with Rob Hirschfeld (RackN) and Charity Majors (Honeycomb) at Gluecon 2017

_____________

UPCOMING EVENTS

Velocity : June 19 – 20 in San Jose, CA

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #74

May 26 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on May 26, 2017 by Rob H

SRE Items of the Week

Co-Founders of RackN Rob Hirschfeld and Greg Althaus at GlueCon

Reuven Cohen and Rob Hirschfeld Chat at GlueCon17
Reuven Cohen (@ruv) and Rob Hirschfeld discuss data center infrastructure trends concerning provisioning, automation and challenges. Rob highlights his company RackN and the open source project Digital Rebar sponsored by RackN.

_____________

Is SRE a Good Term?
Interview with Rob Hirschfeld (RackN) and Charity Majors (Honeycomb) at Gluecon 2017

_____________

How Google Runs its Production Systems – Get the Book
http://www.techrepublic.com/article/want-to-understand-how-google-runs-its-production-systems-read-this-free-book/

The book Site Reliability Engineering helps readers understand how some Googlers think: It contains the ideas of more than 125 authors. The four editors, Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, managed to weave all of the different perspectives into a unified work that conveys a coherent approach to managing distributed production systems.

Site Reliability Engineering delivers 34 chapters—totaling more than 500 printed pages from O’Reilly Media—that encompass the principles and practices that keep Google’s production systems working. The entire book is available online at https://landing.google.com/sre/book.html, along with links to other talks, interviews, publications, and events.

UPCOMING EVENTS

Velocity : June 19 – 20 in San Jose, CA

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #73

May 19 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on May 19, 2017 by Rob H

SRE Items of the Week

Kargo Ansible Playbooks foster Collaborative Kubernetes Ops
http://blog.kubernetes.io/2017/05/kargo-ansible-collaborative-kubernetes-ops.html

Making Kubernetes operationally strong is a widely held priority and I track many deployment efforts around the project. The incubated Kargo project is of particular interest for me because it uses the popular Ansible toolset to build robust, upgradable clusters on both cloud and physical targets. I believe using tools familiar to operators grows our community.

We’re excited to see the breadth of platforms enabled by Kargo and how well it handles a wide range of options like integrating Ceph for StatefulSet persistence and Helm for easier application uploads. Those additions have allowed us to fully integrate the OpenStack Helm charts (demo video). READ MORE
___________

Cybercrime for Profit? Five reasons why we need to start driving much more dynamic IT Operations
https://rackn.com/2017/05/16/cybercrime-for-profit-five-reasons-why-we-need-to-starting-driving-much-more-dynamic-it-operations/
pexels-photo-169617

There’s a frustrating cyberattack driven security awareness cycle in IT Operations. Exploits and vulnerabilities are neither new nor unexpected; however, there is a new element taking shape that should raise additional alarm.

Cyberattacks are increasingly profit generating and automated. READ MORE
_____________

Building the SRE Culture at LinkedIn
https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin

Being a Site Reliability Engineer (SRE) means having to talk about hard problems. Site outages, complex failure scenarios, and other technical emergencies are the things we have to be prepared to deal with every day. When we’re not dealing with problems, we’re discussing them. We regularly perform post-mortems and root cause analyses, and we generally dig into complex technical problems in an unflinching way. READ MORE
_____________
Virtual Panel: OpenStack Summit Boston 2017 Debriefing

_____________

SRE vs. DevOps — a False Distinction?
https://devops.com/sre-vs-devops-false-distinction/

Just a few days before he died at the beginning of the 1990s, a wise man taught us that “the show must go on.” Freddie Mercury’s parting words have long provided the guiding light for many, if not all, ops teams. In their eyes, the production environment should be exposed to minimum risk, even at the expense of new features and problem resolution.

About 10 years ago, Google decided to change its approach to production management. It took the company only a few years to realize that while R&D focused on creating new features and pushing them to production, the Operations group was trying to keep production as stable as possible—the two teams were pulling in opposite directions. This tension arose due to the groups’ different backgrounds, skill sets, incentives and metrics by which they were measured. READ MORE
_____________

UPCOMING EVENTS

Gluecon : May 24 – 25, 2017 in Denver, CO

Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #72

OpenStack Boston Day 1 Notes

Posted on May 9, 2017 by Rob H

Contrary to pundit expectations, OpenStack did not roll over and die during the keynotes yesterday.

In my 2011 Boston Summit shirt.

In fact, I saw the signs of a maturing project seeing real use and adoption. More critically, OpenStack leadership started the event with an acknowledgement of being part of, not owning, the vibrant open infrastructure community.

Continued Growth in Core Areas

Practical reasons for running dedicated infrastructure (compliance, control and cost) make OpenStack relevant for companies and governments with significant budgets. There is also a healthy shared infrastructure (aka public cloud) market living in the shadow of the big 3 players. It’s still unclear how this ecosystem will make money for the vendors.

What do customers buy? Should the Core be free?

My personal experience is that most customers are reluctant to (but grudgingly do) buy distros for the core open technology. They are much more willing to pay for adjacencies like security, storage and networking.

Emerging Challenges from Adjacent Technologies

Containers and Kubernetes are making a significant impact on the OpenStack community. At points, the OpenStack keynote was more about Kubernetes than OpenStack. It’s also clear that customers want to use containers as an abstraction layer to make infrastructure less visible or locked-in. That opens the market for using servers directly (bare metal) or other clouds. That portability is likely to help OpenStack more than hurt it because customers can exit workloads from the Big 3 players.

Friction for adoption remains a critical hurdle.

Containers, which are cloud first platforms, have much less friction than IaaS platforms. IaaS platforms, even managed ones, require physical infrastructure with the matching complexity and investment.

OpenStack: an open infrastructure software community

Overall, the summit remains an amazing community space for open infrastructure software and cloud alternatives to the Big 3 players. The Foundation’s pivot to embrace Kubernetes and foster several other open technologies helps maintain the central enthusiasm for open source infrastructure that gave birth to the platform in the first place.

A healthy pragmatic vibe

The summit may not have the same heady taking-on-the-world feeling as the early days; instead, it has a healthy pragmatic vibe. Considering how frothy this space remains, that may be a welcome relief.

What are your impressions? I’m looking forward to hearing from you!

May 5 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on May 5, 2017 by Rob H

SRE Items of the Week

RackN Announcement
[PRESS RELEASE] RackN Ends DevOps Gridlock in Data Center

Today we announced the availability of Digital Rebar Provision, the industry’s first cloud-native physical provisioning utility. We’ve had this in the Digital Rebar community for a few weeks before offering support and response has been great! READ MORE
_______

Cloud Native PHYSICAL PROVISIONING? Come on! Really?! By Rob Hirschfeld

Today, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement. Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning. We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity. READ MORE

RackN @ DevOpsDays Austin

Slides from Rob Hirschfeld’s talk – The Server Cage Match

SRE vs DevOps vs Cloud Native: The Server Cage Match by Rob Hirschfeld

I don’t believe in DevOps shaming. Our community seems compelled to correct use of DevOps as an adjective for tools, teams and teapots. The frustration is reasonable: DevOps clearly taps into head space for both devs and operators who see a brighter automated future together. For example, check out this excellent DevOps discourse by Cindy Sridharan.

As an industry, we crave artificial conflict so it’s natural to try and distill site reliability engineering (SRE), DevOps and cloud native into warring factions when they are not. They all share a focus on Lean process. READ MORE

SRE News

What is DevOps? By Cindy Sridharan @copyconstruct
https://medium.com/@cindysridharan/what-is-devops-5b0181fdb953

It happened again this week.

At this Wednesday’s Prometheus meetup I was hosting, I asked one of the attendees what he did for work.

He looked at me briefly before he barked one word in reply — DevOps — and then promptly made a beeline for the pizza at the back of the room. READ MORE
________

An Influx of Kubernetes Installers Raises Questions Around Conformance
https://thenewstack.io/kubernetes-installer-explosion-natural-enthusiasm/

For the Kubecon Europe last month, industry observer Joseph Jacks pulled together a list of over SIXTY (yes, 60) Kubernetes installers and services. This wealth of variation that made itself known as the conference, happily, kicked off a conformance effort to ensure that users get a consistent experience. I’m a strong believer that clear conformance builds ecosystems and have deep experience working on that from my OpenStack DefCore efforts.

In short, conformance is not a vendor issue: it’s a user experience and ecosystem issue. READ MORE

UPCOMING EVENTS

OpenStack Summit : May 8 – 11, 2017 in Boston, MA

OpenStack and Kubernetes. Combining the best of both worlds – Kubernetes Day

Interop ITX : May 15 – 19, 2017 in Las Vegas, NV during Open Source IT Summit – Tuesday, May 16, 9:00 – 5:00pm

3:15 – 4:05pm OpenStack and Kubernetes
4:05 – 5:00pm Kubernetes for All

Gluecon : May 24 – 25, 2017 in Denver, CO

Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #70

April 21 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on April 21, 2017 by Rob H

SRE Items of the Week

DigitalRebar Provision deploy Docker’s LinuxKit Kubernetes

_____________

Install Digital Rebar PXE Provision on a Mac OSX System and Test Boot using Virtual Box

_____________

Packet Pushers 333 Automation & Orchestration in Networking
http://packetpushers.net/podcast/podcasts/show-333-orchestration-vs-automation/

While the discussion is all about NETWORK DevOps, they do a good job of decrying WHY current state of system orchestration is so sad – in a word: heterogeneity. It’s not going away because the alternative is lock-in. They also do a good job of describing the difference between automation and orchestration; however, I think there’s a middle tier of resource “scheduling” that better describes OpenStack and Kubernetes.

Around 5:00 minutes into the podcast, they effectively describe the composable design of Digital Rebar and the rationale for the way that we’ve abstracted interfaces for automation. If you guys really do want to cash in by consulting with it (at 10 minutes), just contact Rob H.
_____________

Digital Magazine Launch: Increment On-Call
https://increment.com/on-call/

Increment is dedicated to covering how teams build and operate software systems at scale, one issue at a time. In this, our inaugural issue, we focus on industry best practices around on-call and incident response.
_____________

Need PXW? Try out this Cobbler Replacement
https://robhirschfeld.com/2017/04/11/provision-preview/

INTRO
We wanted to make open basic provisioning API-driven, secure, scalable and fast. So we carved out the Provision & DHCP services as a stand alone unit from the larger open Digital Rebar project. While this Golang service lacks orchestration, this complete service is part of Digital Rebar infrastructure and supports the discovery boot process, templating, security and extensive image library (Linux, ESX, Windows, … ) from the main project.

TL;DR: FIVE MINUTES TO REPLACE COBBLER? YES.

The project APIs and CLIs are complete for all provisioning functions with good Swagger definitions and docs. After all, it’s third generation capability from the Digital Rebar project. The integrated UX is still evolving.
_____________

UPCOMING EVENTS

DevOpsDays Austin : May 4-5, 2017 in Austin TX

CloudNative vs SRE vs DevOps: The Ultimate Server Cage Match
Not Actually a DevOps Talk with Michael Cote (May 4 at 4:50pm)

OpenStack Summit : May 8 – 11, 2017 in Boston, MA

OpenStack and Kubernetes. Combining the best of both worlds – Kubernetes Day

Interop ITX : May 15 – 19, 2017 in Las Vegas, NV

Open Source IT Summit – Tuesday, May 16, 9:00 – 5:00pm : Rob Hirschfeld to speak

Gluecon : May 24 – 25, 2017 in Denver, CO

Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #68

Open Source Collaboration: The Power of No

Posted on April 11, 2017 by Rob H

TL;DR: The days of using open software passively from vendors are past, users need to have a voice and opinion about project governance. This post is a joint effort with Rob Hirschfeld, RackN, and Chris Ferris, IBM, based on their IBM Interconnect 2017 “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest?” presentation.

It’s a common misconception that open source collaboration means saying YES to all ideas; however, the reality of successful projects is the opposite.

Permissive open source licenses drive a delicate balance for projects. On one hand, projects that adopt permissive licenses should be accepting of contributions to build community and user base. On the other, maintainers need to adopt a narrow focus to ensure project utility and simplicity. If the project’s maintainers are too permissive, the project bloats and wanders without a clear purpose. If they are too restrictive then the project fails to build community.

It is human nature to say yes to all collaborators, but that can frustrate core developers and users.

For that reason, stronger open source projects have a clear, focused, shared vision. Historically, that vision was enforced by a benevolent dictator for life (BDFL); however, recent large projects have used a consensus of project elders to make the task more sustainable. These roles serve a critical need: they say “no” to work that does not align with the project’s mission and vision. The challenge of defining that vision can be a big one, but without a clear vision, it’s impossible for the community to sustain growth because new contributors can dilute the utility of projects. [author’s note: This is especially true of celebrity projects like OpenStack or Kubernetes that attract “shared glory” contributors]

There is tremendous social and commercial pressure driving this vision vs. implementation balance.

The most critical one is the threat of “forking.” Forking is what happens when the code/collaborator base of a project splits into multiple factions and stops working together on a single deliverable. The result is incompatible products with a shared history. While small forks are required to support releases, and foster development; diverging community forks can have unpredictable impacts for a project.

Forks are not always bad: they provide a control mechanism for communities.

The fundamental nature of open source projects that adopt a permissive license is what allows forks to become the primary governance tool. The nature of permissive licenses allows anyone to create a new line of development that’s different than the original line. Forks can allow special interests in a code base to focus on their needs. That could be new features or simply stabilization. Many times, a major release version of a project evolves into forks where both old and newer versions have independent communities because of deployment inertia. It can also allow new leadership or governance without having to directly displace an entrenched “owner”.

But forking is expensive because it makes it harder for communities to collaborate.

To us, the antidote for forking is not simply vision but a strong focus on interoperability. Interoperability (or interop) means ensuring that different implementations remain compatible for users. A simplified example would be having automation that works on one OpenStack cloud also work on all the others without modification. Strong interop creates an ecosystem for a project by making users confident that their downstream efforts will not be disrupted by implementation variance or version changes.

Good Interop relieves the pressure of forking.

Interop can only work when a project defines what is expected behavior and creates tests that enforce those standards. That activity forces project contributors to agree on project priorities and scope. Projects that refuse to define interop expectations end up disrupting their user and collaborator base in frustrating ways that lead to forking (Rob’s commentary on the potential Docker fork of 2016).

Unfortunately, Interop is not a generally a developer priority.

In the end, interoperability is a user feature that competes with other features. Sadly, it is often seen as hurting feature development because new features must work to maintain existing interop standards. For that reason, new contributors may see interop demands as a impediment to forward progress; however, it’s a strong driver for user adoption and growth.

The challenge is that those users are typically more focused on their own implementation and less visible to the project leadership. Vendors have similar disincentives to do work that benefits other vendors in the community. These tensions will undermine the health of communities that do not have strong BDFL or Elders leadership. So, who then provides the adult supervision?

Ultimately, users must demand interop and provide commercial preference for vendors that invest in interop.

Open source has definitely had an enormous impact on the software industry; generally, a change for the better. But, that change comes at a cost – the need for involvement, not just of vendors and individual developers, but, ultimately it demands the participation of consumers/users.

Interop isn’t naturally a vendor priority because it levels the playing field for all vendors; however, vendors do prioritize what their customers want.

Ideally, customer needs translate into new features that have a broad base of consumer interest. Interop ensure that features can be used broadly. Thus interop is an important attribute to consumers not only for vendors, but by the open source communities building the software. This alignment then serves as the foundation upon which (increasingly) that vendor software is based.

Customers should be actively and publicly supportive of interop efforts of projects on which their vendor’s offerings depend. If there isn’t such an initiative in those projects, then they should demand one be started through their vendor partners and in the public forums for the project.

Further, if consumers of an open source project sense that it lacks a strong, focused, vision and is wandering off course, they need to get involved and say so, either directly and/or through their vendor partners.

While open source has changing the IT industry, it also has a cost. The days of using software passively from vendors are past, users need to have a voice and opinion. The need to ensure that their chosen vendors are also supporting the health of the community.

What do you think? Reach out to Rob (@zehicle) and Chris (@christo4ferris) and let us know!

Note: Cross posted on IBM OpenTech site.

Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on April 7, 2017 by Rob H

pexels-photo-273011

Welcome to the first post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com.

SRE Items of the Week

Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites by Ian Miell

INTRO TO POST

For several years I managed the 3rd line site reliability operation for many of the world’s busiest gambling sites, working for a little-known company that built and ran the core backend online software for several businesses that each at peak could take tens of millions of pounds in revenue per hour. I left a couple of years ago, so it’s a good time to reflect on what I learned in the process.

In many ways, what we did was similar to what’s now called an SRE function (I’m going to call us SREs, but the acronym didn’t exist at the time). We were on call, had to respond to incidents, made recommendations for re-engineering, provided robust feedback to developers and customer teams, managed escalations and emergency situations, ran monitoring systems, and so on.

The team I joined was around 5 engineers (all former developers and technical leaders), which grew to around 50 of more mixed experience across multiple locations by the time I left.

I’m going to focus here on process and documentation, since I don’t think they’re talked about usefully enough where I do read about them.
_____

2017 trend lines: When DevOps and hybrid collide by Rob Hirschfeld (@zehicle)
IBM Cloud Computing News

INTRO TO POST

What happens when DevOps methods meet hybrid environments? Following are some emerging trends and my commentary on each.

There are two major casualties as the pace of innovation in IT continues to accelerate: manual processes (non-DevOps) and tightly-coupled software stacks (non-hybrid).

We are changing some things much too quickly for developers and operators to keep up using processes that require human intervention in routine activities like integrated testing or deployment. Furthermore, monolithic platforms—our traditional “duck-and-cover” protection from pace of change—are less attractive for numerous reasons, including slower pace, vendor lock-in and lack of choice.

RECENT SRE AND DEVOPS EVENTS

SRECon17 Americas

Videos of all Sessions from Woodland Hunter

CloudNativeCon + KubeCon 2017 March 29-30, 2017 in Berlin

YouTube Videos of all sessions

IBM Interconnect March in Las Vegas, NV

Christopher Ferris, IBM CTO Open Technology and Rob Hirschfeld “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest” – SLIDES

DevOps Summit

“Best Practices in Operating Hybrid Infrastructure that Spans Clouds and the Data Center” – BLOG / SLIDES

UPCOMING MEETUPS & PODCASTS

Continuous Discussions (#c9d9) Episode 66: Scaling Agile and DevOps in the Enterprise – April 11, 2017 at 10am PT. Rob Hirschfeld a guest in this Electric Cloud podcast.

UPCOMING EVENTS FOR RACKN

DockerCon 2017 : April 17 – 20, 2017 in Austin, TX
DevOpsDays Austin : May 4-5, 2017 in Austin TX
OpenStack Summit : May 8 – 11, 2017 in Boston, MA
Interop ITX : May 15 – 19, 2017 in Las Vegas, NV
Open Source IT Summit – Tuesday, May 16, 9:00 – 5:00pm : Rob Hirschfeld to speak
Gluecon : May 24 – 25, 2017 in Denver, CO

Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #66

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: Events

The Case for Ops Engineering Pay Equity w/ Charity Majors

June 16 – Weekly Recap of All Things Site Reliability Engineering (SRE)

_____________

June 2 – Weekly Recap of All Things Site Reliability Engineering (SRE)

May 26 – Weekly Recap of All Things Site Reliability Engineering (SRE)

May 19 – Weekly Recap of All Things Site Reliability Engineering (SRE)

OpenStack Boston Day 1 Notes

May 5 – Weekly Recap of All Things Site Reliability Engineering (SRE)

April 21 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Open Source Collaboration: The Power of No

Weekly Recap of All Things Site Reliability Engineering (SRE)