Let’s DevOps IRL: my SRE postings on RackN!

I’m investing in these Site Reliability Engineering (SRE) discussions because I believe operations (and by extension DevOps) is facing a significant challenge in keeping up with development tooling.   The links below have been getting a lot of interest on twitter and driving some good discussion.

datanauts_logo_300

15967

Addressing this Ops debt is our primary mission at my company, RackN: we believe that integrated system level tooling is required.  We also believe that new tools should not disrupt environments so we work very hard to adapt to requirements of individual sites.

SRE is urgent because it provides a pragmatic path and rationale for investment.

Even if you don’t agree with Google’s term or all their practices, I think fundamental concepts of system thinking, status/pay, automation investment and developer collaboration are essential.  It should come as no surprise that these are all Lean/DevOps concepts; however, SRE has the pragmatic side of being a job function.

Here are some recent relevant discussions I’ve been having about SREs with links to both the audio and my text show notes.

Of course, RackN is also doing a WEEKLY SRE update that captures general interest items.  Check that out and subscribe.

Cloud Native PHYSICAL PROVISIONING? Come on! Really?!

We believe Cloud Native development disciplines are essential regardless of the infrastructure.

imageToday, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement.  Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning.  We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity.

I’m going to post more about Provision can help address the political struggles of SRE and DevOps that I’ve been watching in our industry.   A hint is in the release, but the Cloud Native comment needs to be addressed.

First, Cloud Native is an architecture, not an infrastructure statement.

There is no requirement that we use VMs or AWS in Cloud Native.  From that perspective, “Cloud” is a useful but deceptive adjective.  Cloud Native is born from applications that had to succeed in hands-off, lower SLA infrastructure with fast delivery cycles on untrusted systems.  These are very hostile environments compared to “legacy” IT.

What makes Digital Rebar Provision Cloud Native?  A lot!

The following is a list of key attributes I consider essential for Cloud Native design.

Micro-services Enabled: The larger Digital Rebar project is a micro-services design.  Provision reflects a stand-alone bundling of two services: DHCP and Provision.  The new Provision service is designed to both stand alone (with embedded UX) and be part of a larger system.

Swagger RESTful API: We designed the APIs first based on years of experience.  We spent a lot of time making sure that the API conformed to spec and that includes maintaining the Swagger spec so integration is easy.

Remote CLI: We build and test our CLI extensively.  In fact, we expect that to be the primary user interface.

Security Designed In: We are serious about security even in challenging environments like PXE where options are limited by 20 year old protocols.  HTTPS is required and user or bearer token authentication is required.  That means that even API calls from machines can be secured.

12 Factor & API Config: There is no file configuration for Provision.  The system starts with command line flags or environment variables.  Deeper configuration is done via API/CLI.  That ensures that the system can be fully managed by remote and configured securely becausee credentials are required for configuration.

Fast Start / Golang:  Provision is a totally self-contained golang app including the UX.  Even so, it’s very small.  You can run it on a laptop from nothing in about 2 minutes including download.

CI/CD Coverage: We committed to deep test coverage for Provision and have consistently increased coverage with every commit.  It ensures quality and prevents regressions.

Documentation In-project Auto-generated: On-boarding is important since we’re talking about small, API-driven units.  A lot of Provisioning documentation is generated directly from the code into the actual project documentation.  Also, the written documentation is in Restructured Text in the project with good indexes and cross-references.  We regenerate the documentation with every commit.

We believe these development disciplines are essential regardless of the infrastructure.  That’s why we made sure the v3 Provision (and ultimately every component of Digital Rebar as we iterate to v3) was built to these standards.

What do you think?  Is this Cloud Native?  What did we miss?

Why IBM’s hybrid “no-single-way” is a good plan

I got to spend a few days hearing IBM’s cloud plans at IBM Interconnect including a presentation, dinner and guest blogging.  Read below for links to that content.

As part of their CloudMinds group, we’re encouraged to look at the big picture of the conference and there’s a lot to take in. IBM has serious activity around machine learning, cognitive, serverless, functional languages, block chain, platform and infrastructure as a service. Frankly, that’s a confusing array of technologies.

Does this laundry list of technologies fit into a unified strategy? No, and that’s THE POINT.

Anyone who thinks they can predict a definitive right mix of technologies to solve customer problems is not paying attention to the pace of innovation. IBM is listening to their customers and hearing that needs are expanding not consolidating. In this type of market, limiting choice hurts customers.

That means that a hybrid strategy with overlapping offerings serves their customers interests.

IBM has the luxury and scale of being able to chase multiple technologies to find winners. Of course, there’s a danger of hanging on to losers too long too. So far, it looks like they are doing a good job of riding that sweet spot. Their agility here may be the only way that they can reasonably find a chink in Amazon’s cloud armour.

While the hybrid story is harder to tell, it’s the right one for this market.

Four Posts For Deeper Reading

The posts below cover a broad range of topics! Chris Ferris and I did some serious writing about collaboration and my DevOps/Hybrid post has been getting some attention. It’s all recommended reading so I’ve included some highlights.

#CloudMinds tackle the future of cognitive in Las Vegas huddle

Rob is part of the IBM CloudMinds group that meets occasionally to discuss rising cloud, infrastructure and technology challenges.

“Cognitive cannot and will not exist without trust. Humans will not trust cognitive unless we can show that our cognitive solutions understand them.”

How open communities can hurt, and help, interoperability

“The days of using open software passively from vendors are past, users need to have a voice and opinion about project governance. This post is a joint effort with Rob Hirschfeld, RackN, and Chris Ferris, IBM, based on their IBM Interconnect 2017 “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest?” presentation.”

When DevOps and hybrid collide (2017 trend lines)

“We’ve clearly learned that DevOps automation pays back returns in agility and performance. Originally, small-batch, lean thinking was counter-intuitive. Now it’s time to make similar investments in hybrid automation so that we can leverage the most innovation available in IT today.”

Open Source Collaboration: The Power of No & Interoperability

“Users and operators can put significant pressure on project leaders and vendors to ensure that the platforms are interoperable. “

Packet Pushers 333: Orchestration v Automation < YES, this is what we are doing!

Iix34grhy_400x400 highly recommend catching Packet Pushers 333 “Automation & Orchestration In Networking” by Drew Conry-Murray and guests Pete Lumbis and Michael Damkot.

While the discussion is all about NETWORK DevOps, they do a good job of decrying WHY current state of system orchestration is so sad – in a word: heterogeneity.  It’s not going away because the alternative is lock-in.  They also do a good job of describing the difference between automation and orchestration; however, I think there’s a middle tier  of resource “scheduling” that better describes OpenStack and Kubernetes.

Around 5:00 minutes into the podcast, they effectively describe the composable design of Digital Rebar and the rationale for the way that we’ve abstracted interfaces for automation.  If you guys really do want to cash in by consulting with it (at 10 minutes), just give me a call.

It’s great to hear acknowledgement of both the complexity and need for solving these problems.   Thanks for the great podcast Drew, Pete and Michael!

Oh… and I’m going to be presenting at Interop ITX also.  Hopefully, I’ll get a chance to talk 1×1 with Drew.

How scared do we need to be for Ops collaboration & investment?

Note: Yesterday RackN posted Are you impatient enough to be an SRE?  and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?” 

Cia leak (1)As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project.  Curious?  Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.”

What does it take to Operate Open Platforms? Answers in Datanauts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanauts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

SRE role with DevOps for Enterprise [@HPE podcast]

sre-series

My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

Yes, DevOps and SRE are complementary

In this short 16 minute podcast, HPE’s Stephen Spector and I discuss how DevOps and SRE thinking overlaps and where are the differences.  We also discuss how Enterprises should be evaluating Site Reliability Engineering as a function and where it fits in their organization.