SRE role with DevOps for Enterprise [@HPE podcast]

sre-series

My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

Yes, DevOps and SRE are complementary

In this short 16 minute podcast, HPE’s Stephen Spector and I discuss how DevOps and SRE thinking overlaps and where are the differences.  We also discuss how Enterprises should be evaluating Site Reliability Engineering as a function and where it fits in their organization.

“Why SRE?” Discussion with Eric @Discoposse Wright

sre-series My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

ericewrightI was a guest on Eric “@discoposse” Wright of the Green Circle Community #42 Podcast (my previous appearance).

LISTEN NOW: Podcast #42

In this action-packed 30 minute conversation, we discuss the industry forces putting pressure on operations teams.  These pressures require operators to be investing much more heavily on reusable automation.

That leads us towards why Kubernetes is interesting and what went wrong with OpenStack (I actually use the phrase “dumpster fire”).  We ultimately talk about how those lessons embedded in Digital Rebar architecture.

Apparently IT death smells like kickstart files. Six Reasons why.

Today, I’m sharing a parable about always being focused on adding value.

Recently, I was on a call with an IT Ops manager who insisted that his team had their on-premises operations under control with “python scripts and manual kickstart files” because they “really don’t change their infrastructure setup.” He explained that he and his team was comfortable with this because it was something they understood and did not require learning new systems. While I understand his position, I was sort of sad for him and his employer because…

No value is created for his company by maintaining custom kickstart, preseeds or boot files.

Maintaining kickstarts is fatal for many reasons. Is there a way to make it less fatal? Yes, and it involves investing in learning tools that let you move up stack.

Contrary to popular IT mythology, managing physical infrastructure is still a reality for many IT teams and will remain a part of best practices until every workload simply runs on Amazon and it becomes their problem.  Since that “Utopian” future is unlikely, let’s deal with some practical realities of hybrid IT.

Here are my six reasons why custom kickstarts (and other site-specific boot provisioning scripts) are dangerous:

1. Creating Site Unique Processes

Every infrastructure is unique and that’s a practical reality that we have to accept because otherwise we would never be able to make improvements and corrects without touching everything that already deployed. However, we really want to work hard to minimize places where we inject variation into the environment. That means that server and site specific kickstarts with lots of post-provisioning steps forces operators to maintain additional information about each server.

2. Building Server Specific Configurations

When we create server specific templates, it becomes nearly impossible to recreate server builds. That directly leads to fragile infrastructure because teams cannot quickly redeploy or automate refreshes. Static IT infrastructure is a known fail pattern and makes enterprises vulnerable to staff changes, hacking and inability to manage and patch.

3. Having Opaque Configurations

Kickstart is hard to understand (and even harder to troubleshoot). When teams take actions during the provisioning process they are often not tracked or managed like other operational scripting tools. Failures or injections can easily go undetected. Even if they are tracked, the number of operators who can read and manage these scripts is limited. That means that critical aspects of your operational environment happen outside of your awareness.

4. Being Less Secure

Kickstart processes generally include injecting SSH keys, certificates and other authentication credentials. These embedded credentials are often hard coded into the process with minimal awareness of the operational team leaving you vulnerable at the most foundational level. This is not an acceptable security process; however, teams who hack kickstarts often don’t want to consider the implications.

Security side note: most teams don’t have the expertise to integrate TPM or HSM into their kickstart processes; consequently, these key security technologies are generally unused and ignored. If you want to talk about this, please contact me!

5. Diverging Provisioning Patterns

Cloud does not use kickstarts. Provisioning variation increases when teams keep/add logic and configuration into server provisioning instead of doing it as post-provision automation. If your physical provisioning team is not rehearsing on cloud then you’re in a serious IT hole because all workloads should be managed as hybrid-ready. Deployment fidelity helps accelerate teams and reduces cost.

6. Reusing Community Practice

Finally, managing your own kickstarts makes it impossible to leverage community patterns and practices. Kickstarts are not exactly a hive of innovation so you are not creating any competitive advantage by adding variation there. In cases like that, reusing community tooling is a net benefit to your organization. Why have we not done this already? Until recently, provisioning tools were not API driven or focused on reusable shared practice.

While Kickstart or similar is pretty much required for physical, we have a solution for these issues.

One of the key design elements of Digital Rebar is an templated, API driven boot provisioner. Our approach uses kickstarts, preseeds and other tools; however, we’ve worked hard to minimize their span and decompose them into reusable components. That allows users to inject site specific code as snippets that are centrally managed and hardware neutral.

Critically, our approach allows SRE and Ops teams to get out of the kickstart business and focus on provisioning workflow and automation. Yes, there’s some learning curve but there are a lot of benefits to moving up stack.

It’s not too late to “:q!” those kickstart edits and accelerate your infrastructure.

Spiraling Ops Debt & the SRE coding imperative

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

c2wfuvaveaaronn

Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components.  For example, Kubernetes has a 3 month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

 

Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)

What is a Google SRE?  Charity Majors gave a great overview on Datanauts #65, Susan Fowler from Uber talks about “no ops” tensions and Patrick Hill from Atlassian wrote up a good review too.  This is not new: Ben Treynor defined it back in 2014.

DevOps is under attack.

Well, not DevOps exactly but the common misconception that DevOps is about Developers doing Ops (it’s really about lean process, system thinking, and positive culture).  It turns out the Ops is hard and, as I recently discussed with John Furrier, developers really really don’t want be that focused on infrastructure.

In fact, I see containers and serverless as a “developers won’t waste time on ops revolt.”  (I discuss this more in my 2016 retrospective).

The tension between Ops and Dev goes way back and has been a source of confusion for me and my RackN co-founders.  We believe we are developers, except that we spend our whole time focused on writing code for operations.  With the rise of Site Reliability Engineers (SRE) as a job classification, our type of black swan engineer is being embraced as a critical skill.  It’s recognized as the only way to stay ahead of our ravenous appetite for  computing infrastructure.

I’ve been writing about Site Reliability Engineering (SRE) tasks for nearly 5 years under a lot of different names such as DevOps, Ready State, Open Operations and Underlay Operations. SRE is a term popularized by Google (there’s a book!) for the operators who build and automate their infrastructure. Their role is not administration, it is redefining how infrastructure is used and managed within Google.

Using infrastructure effectively is a competitive advantage for Google and their SREs carry tremendous authority and respect for executing on that mission.

ManagersMeanwhile, we’re in the midst of an Enterprise revolt against running infrastructure. Companies, for very good reasons, are shutting down internal IT efforts in favor of using outsourced infrastructure. Operations has simply not been able to complete with the capability, flexibility and breadth of infrastructure services offered by Amazon.

SRE is about operational excellence and we keep up with the increasingly rapid pace of IT.  It’s a recognition that we cannot scale people quickly as we add infrastructure.  And, critically, it is not infrastructure specific.

Over the next year, I’ll continue to dig deeply into the skills, tools and processes around operations.  I think that SRE may be the right banner for these thoughts and I’d like to hear your thoughts about that.

MORE?  Here’s the next post in the series about Spiraling Ops Debt.  Or Skip to Podcasts with Eric Wright and Stephen Spector.