Why IBM’s hybrid “no-single-way” is a good plan

Posted on April 23, 2017 by Rob H

I got to spend a few days hearing IBM’s cloud plans at IBM Interconnect including a presentation, dinner and guest blogging. Read below for links to that content.

As part of their CloudMinds group, we’re encouraged to look at the big picture of the conference and there’s a lot to take in. IBM has serious activity around machine learning, cognitive, serverless, functional languages, block chain, platform and infrastructure as a service. Frankly, that’s a confusing array of technologies.

Does this laundry list of technologies fit into a unified strategy? No, and that’s THE POINT.

Anyone who thinks they can predict a definitive right mix of technologies to solve customer problems is not paying attention to the pace of innovation. IBM is listening to their customers and hearing that needs are expanding not consolidating. In this type of market, limiting choice hurts customers.

That means that a hybrid strategy with overlapping offerings serves their customers interests.

IBM has the luxury and scale of being able to chase multiple technologies to find winners. Of course, there’s a danger of hanging on to losers too long too. So far, it looks like they are doing a good job of riding that sweet spot. Their agility here may be the only way that they can reasonably find a chink in Amazon’s cloud armour.

While the hybrid story is harder to tell, it’s the right one for this market.

Four Posts For Deeper Reading

The posts below cover a broad range of topics! Chris Ferris and I did some serious writing about collaboration and my DevOps/Hybrid post has been getting some attention. It’s all recommended reading so I’ve included some highlights.

#CloudMinds tackle the future of cognitive in Las Vegas huddle

Rob is part of the IBM CloudMinds group that meets occasionally to discuss rising cloud, infrastructure and technology challenges.

“Cognitive cannot and will not exist without trust. Humans will not trust cognitive unless we can show that our cognitive solutions understand them.”

How open communities can hurt, and help, interoperability

“The days of using open software passively from vendors are past, users need to have a voice and opinion about project governance. This post is a joint effort with Rob Hirschfeld, RackN, and Chris Ferris, IBM, based on their IBM Interconnect 2017 “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest?” presentation.”

When DevOps and hybrid collide (2017 trend lines)

“We’ve clearly learned that DevOps automation pays back returns in agility and performance. Originally, small-batch, lean thinking was counter-intuitive. Now it’s time to make similar investments in hybrid automation so that we can leverage the most innovation available in IT today.”

Open Source Collaboration: The Power of No & Interoperability

“Users and operators can put significant pressure on project leaders and vendors to ensure that the platforms are interoperable. “

How scared do we need to be for Ops collaboration & investment?

Posted on March 8, 2017 by Rob H

Note: Yesterday RackN posted Are you impatient enough to be an SRE? and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?”

Cia leak (1) As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project. Curious? Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way. Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on. We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification. That practice and pace should be the norm instead of the exception.”

Apparently IT death smells like kickstart files. Six Reasons why.

Posted on January 25, 2017 by Rob H

Today, I’m sharing a parable about always being focused on adding value.

Recently, I was on a call with an IT Ops manager who insisted that his team had their on-premises operations under control with “python scripts and manual kickstart files” because they “really don’t change their infrastructure setup.” He explained that he and his team was comfortable with this because it was something they understood and did not require learning new systems. While I understand his position, I was sort of sad for him and his employer because…

No value is created for his company by maintaining custom kickstart, preseeds or boot files.

Maintaining kickstarts is fatal for many reasons. Is there a way to make it less fatal? Yes, and it involves investing in learning tools that let you move up stack.

Contrary to popular IT mythology, managing physical infrastructure is still a reality for many IT teams and will remain a part of best practices until every workload simply runs on Amazon and it becomes their problem. Since that “Utopian” future is unlikely, let’s deal with some practical realities of hybrid IT.

Here are my six reasons why custom kickstarts (and other site-specific boot provisioning scripts) are dangerous:

1. Creating Site Unique Processes

Every infrastructure is unique and that’s a practical reality that we have to accept because otherwise we would never be able to make improvements and corrects without touching everything that already deployed. However, we really want to work hard to minimize places where we inject variation into the environment. That means that server and site specific kickstarts with lots of post-provisioning steps forces operators to maintain additional information about each server.

2. Building Server Specific Configurations

When we create server specific templates, it becomes nearly impossible to recreate server builds. That directly leads to fragile infrastructure because teams cannot quickly redeploy or automate refreshes. Static IT infrastructure is a known fail pattern and makes enterprises vulnerable to staff changes, hacking and inability to manage and patch.

3. Having Opaque Configurations

Kickstart is hard to understand (and even harder to troubleshoot). When teams take actions during the provisioning process they are often not tracked or managed like other operational scripting tools. Failures or injections can easily go undetected. Even if they are tracked, the number of operators who can read and manage these scripts is limited. That means that critical aspects of your operational environment happen outside of your awareness.

4. Being Less Secure

Kickstart processes generally include injecting SSH keys, certificates and other authentication credentials. These embedded credentials are often hard coded into the process with minimal awareness of the operational team leaving you vulnerable at the most foundational level. This is not an acceptable security process; however, teams who hack kickstarts often don’t want to consider the implications.

Security side note: most teams don’t have the expertise to integrate TPM or HSM into their kickstart processes; consequently, these key security technologies are generally unused and ignored. If you want to talk about this, please contact me!

5. Diverging Provisioning Patterns

Cloud does not use kickstarts. Provisioning variation increases when teams keep/add logic and configuration into server provisioning instead of doing it as post-provision automation. If your physical provisioning team is not rehearsing on cloud then you’re in a serious IT hole because all workloads should be managed as hybrid-ready. Deployment fidelity helps accelerate teams and reduces cost.

6. Reusing Community Practice

Finally, managing your own kickstarts makes it impossible to leverage community patterns and practices. Kickstarts are not exactly a hive of innovation so you are not creating any competitive advantage by adding variation there. In cases like that, reusing community tooling is a net benefit to your organization. Why have we not done this already? Until recently, provisioning tools were not API driven or focused on reusable shared practice.

While Kickstart or similar is pretty much required for physical, we have a solution for these issues.

One of the key design elements of Digital Rebar is an templated, API driven boot provisioner. Our approach uses kickstarts, preseeds and other tools; however, we’ve worked hard to minimize their span and decompose them into reusable components. That allows users to inject site specific code as snippets that are centrally managed and hardware neutral.

Critically, our approach allows SRE and Ops teams to get out of the kickstart business and focus on provisioning workflow and automation. Yes, there’s some learning curve but there are a lot of benefits to moving up stack.

It’s not too late to “:q!” those kickstart edits and accelerate your infrastructure.

Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)

Posted on December 29, 2016 by Rob H

What is a Google SRE? Charity Majors gave a great overview on Datanauts #65, Susan Fowler from Uber talks about “no ops” tensions and Patrick Hill from Atlassian wrote up a good review too. This is not new: Ben Treynor defined it back in 2014.

DevOps is under attack.

Well, not DevOps exactly but the common misconception that DevOps is about Developers doing Ops (it’s really about lean process, system thinking, and positive culture). It turns out the Ops is hard and, as I recently discussed with John Furrier, developers really really don’t want be that focused on infrastructure.

In fact, I see containers and serverless as a “developers won’t waste time on ops revolt.” (I discuss this more in my 2016 retrospective).

The tension between Ops and Dev goes way back and has been a source of confusion for me and my RackN co-founders. We believe we are developers, except that we spend our whole time focused on writing code for operations. With the rise of Site Reliability Engineers (SRE) as a job classification, our type of black swan engineer is being embraced as a critical skill. It’s recognized as the only way to stay ahead of our ravenous appetite for computing infrastructure.

I’ve been writing about Site Reliability Engineering (SRE) tasks for nearly 5 years under a lot of different names such as DevOps, Ready State, Open Operations and Underlay Operations. SRE is a term popularized by Google (there’s a book!) for the operators who build and automate their infrastructure. Their role is not administration, it is redefining how infrastructure is used and managed within Google.

Using infrastructure effectively is a competitive advantage for Google and their SREs carry tremendous authority and respect for executing on that mission.

Managers Meanwhile, we’re in the midst of an Enterprise revolt against running infrastructure. Companies, for very good reasons, are shutting down internal IT efforts in favor of using outsourced infrastructure. Operations has simply not been able to complete with the capability, flexibility and breadth of infrastructure services offered by Amazon.

SRE is about operational excellence and we keep up with the increasingly rapid pace of IT. It’s a recognition that we cannot scale people quickly as we add infrastructure. And, critically, it is not infrastructure specific.

Over the next year, I’ll continue to dig deeply into the skills, tools and processes around operations. I think that SRE may be the right banner for these thoughts and I’d like to hear your thoughts about that.

MORE? Here’s the next post in the series about Spiraling Ops Debt. Or Skip to Podcasts with Eric Wright and Stephen Spector.

Cloudcast.net gem about Cluster Ops Gap

Posted on November 29, 2016 by Rob H

Podcast juxtaposition can be magical. In this case, I heard back-to-back sessions with pragmatic for cluster operations and then how developers are rebelling against infrastructure.

Last week, I was listening to Brian Gracely’s “Automatic DevOps” discussion with John Troyer (CEO at TechReckoning, a community for IT pros) followed by his confusingly titled “operators” talk with Brandon Phillips (CTO at CoreOS).

John’s mid-recording comments really resonated with me:

At 16 minutes: “IT is going to be the master of many environments… If you have an environment is hybrid & multi-cloud, then you still need to care about infrastructure… we are going to be living with that for at least 10 years.”

At 18 minutes: “We need a layer that is cloud-like, devops-like and agile-like that can still be deployed in multiple places. This middle layer, Cluster Ops, is really important because it’s the layer between the infrastructure and the app.”

The conversation with Brandon felt very different where the goal was to package everything “operator” into Kubernetes semantics including Kubernetes running itself. This inception approach to running the cluster is irresistible within the community because the goal of the community is to stop having to worry about infrastructure. [Brian – call me if you want to a do podcast of the counter point to self-hosted].

Infrastructure is hard and complex. There’s good reason to limit how many people have to deal with that, but someone still has to deal with it.

I’m a big fan of container workloads generally and Kubernetes specifically as a way to help isolate application developers from infrastructure; consequently, it’s not designed to handle the messy infrastructure requirements that make Cluster Ops a challenge. This is a good thing because complexity explodes when platforms expose infrastructure details.

For Kubernetes and similar, I believe that injecting too much infrastructure mess undermines the simplicity of the platform.

There’s a different type of platform needed for infrastructure aware cluster operations where automation needs to address complexity via composability. That’s what RackN is building with open Digital Rebar: a the hybrid management layer that can consistently automate around infrastructure variation.

If you want to work with us to create system focused, infrastructure agnostic automation then take a look at the work we’ve been doing on underlay and cluster operations.

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

Posted on October 7, 2016 by Rob H

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration). While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues. Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration. Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates. Once the platform is installed, then we use the platform as a services. On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance. Using a load balancer? You can’t configure it until you’ve got the node addresses allocated. And then it needs to be updated as you manage your cluster. This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration. The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence. For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries. The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic. No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Open Source as Reality TV and Burning Data Centers [gcOnDemand podcast notes]

Posted on May 24, 2016 by Rob H

During the OpenStack summit, Eric Wright (@discoposse) and I talked about a wide range of topics from scoring success of OpenStack early goals to burning down traditional data centers.

Why burn down your data center (and move to public cloud)? Because your ops process are too hard to change. Rob talks about how hybrid provides a path if we can made ops more composable.

Here are my notes from the audio podcast (source):

1:30 Why “zehicle” as a handle? Portmanteau from electrics cars… zero + vehicle

Let’s talk about OpenStack & Cloud…

OpenStack History
- 2:15 Rob’s OpenStack history from Dell and Hyperscale
- 3:20 Early thoughts of a Cloud API that could be reused
- 3:40 The practical danger of Vendor lock-in
- 4:30 How we implemented “no main corporate owner” by choice
About the Open in OpenStack
- 5:20 Rob decomposes what “open” means because there are multiple meanings
- 6:10 Price of having all open tools for “always open” choice and process
- 7:10 Observation that OpenStack values having open over delivering product
- 8:15 Community is great but a trade off. We prioritize it over implementation.
Q: 9:10 What if we started later? Would Docker make an impact?
- Part of challenge for OpenStack was teaching vendors & corporate consumers “how to open source”
Q: 10:40 Did we accomplish what we wanted from the first summit?
- Mixed results – some things we exceeded (like growing community) while some are behind (product adoption & interoperability).
13:30 Interop, Refstack and Defcore Challenges. Rob is disappointed on interop based on implementations.
Q: 15:00 Who completes with OpenStack?
- There are real alternatives. APIs do not matter as much as we thought.
- 15:50 OpenStack vendor support is powerful
Q: 16:20 What makes OpenStack successful?
- Big tent confuses the ecosystem & push the goal posts out
- “Big community” is not a good definition of success for the project.
18:10 Reality TV of open source – people like watching train wrecks
18:45 Hybrid is the reality for IT users
20:10 We have a need to define core and focus on composability. Rob has been focused on the link between hybrid and composability.
22:10 Rob’s preference is that OpenStack would be smaller. Big tent is really ecosystem projects and we want that ecosystem to be multi-cloud.

Now, about RackN, bare metal, Crowbar and Digital Rebar….

23:30 (re)Intro
24:30 VC market is not metal friendly even though everything runs on metal!
25:00 Lack of consistency translates into lack of shared ops
25:30 Crowbar was an MVP – the key is to understand what we learned from it
26:00 Digital Rebar started with composability and focus on operations
27:00 What is hybrid now? Not just private to public.
30:00 How do we make infrastructure not matter? Multi-dimensional hybrid.
31:00 Digital Rebar is orchestration for composable infrastructure.
Q: 31:40 Do people get it?
- Yes. Automation is moving to hybrid devops – “ops is ops” and it should not matter if it’s cloud or metal.
32:15 “I don’t want to burn down my data center” – can you bring cloud ops to my private data center?

5 Key Aspects of High Fidelity DevOps [repost from DevOps.com]

Posted on May 5, 2016 by Rob H

For all our cloud enthusiasm, I feel like ops automation is suffering as we increase choice and complexity. Why is this happening? It’s about loss of fidelity.

Nearly a year ago, I was inspired by a mention of “Fidelity Gaps” during a Cloud Foundry After Dark session. With additional advice from DevOps leader Gene Kim, this narrative about the why and how of DevOps Fidelity emerged.

As much as we talk about how we should have shared goals spanning Dev and Ops, it’s not nearly as easy as it sounds. To fuel a DevOps culture, we have to build robust tooling, also.

That means investing up front in five key areas: abstraction, composability, automation, orchestration, and idempotency.

Together, these concepts allow sharing work at every level of the pipeline. Unfortunately, it’s tempting to optimize work at one level and miss the true system bottlenecks.

Creating production-like fidelity for developers is essential: We need it for scale, security and upgrades. It’s not just about sharing effort; it’s about empathy and collaboration.

But even with growing acceptance of DevOps as a cultural movement, I believe deployment disparities are a big unsolved problem. When developers have vastly different working environments from operators, it creates a “fidelity gap” that makes it difficult for the teams to collaborate.

Before we talk about the costs and solutions, let me first share a story from back when I was a bright-eyed OpenStack enthusiast…

Read the Full Article on DevOps.com including my section about Why OpenStack Devstack harms the project and five specific ways to improve DevOps fidelity.

my 8 steps that would improve OpenStack Interop w/ AWS

Posted on April 27, 2016 by Rob H

I’ve been talking with a lot of OpenStack people about frustrating my attempted hybrid work on seven OpenStack clouds [OpenStack Session Wed 2:40]. This post documents the behavior Digital Rebar expects from the multiple clouds that we have integrated with so far. At RackN, we use this pattern for both cloud and physical automation.

Sunday, I found myself back in front of the the Board talking about the challenge that implementation variation creates for users. Ultimately, the question “does this harm users?” is answered by “no, they just leave for Amazon.”

I can’t stress this enough: it’s not about APIs! The challenge is twofold: implementation variance between OpenStack clouds and variance between OpenStack and AWS.

The obvious and simplest answer is that OpenStack implementers need to conform more closely to AWS patterns (once again, NOT the APIs).

Here are the eight Digital Rebar node allocation steps [and my notes about general availability on OpenStack clouds]:

Add node specific SSH key [YES]
Get Metadata on Networks, Flavors and Images [YES]
Pick correct network, flavors and images [NO, each site is distinct]
Request node [YES]
Get node PUBLIC address for node [NO, most OpenStack clouds do not have external access by default]
Login into system using node SSH key [PARTIAL, the account name varies]
Add root account with Rebar SSH key(s) and remove password login [PARTIAL, does not work on some systems]
Remove node specific SSH key [YES]

These steps work on every other cloud infrastructure that we’ve used. And they are achievable on OpenStack – DreamHost delivered this experience on their new DreamCompute infrastructure.

I think that this is very achievable for OpenStack, but we’re doing to have to drive conformance and figure out an alternative to the Floating IP (FIP) pattern (IPv6, port forwarding, or adding FIPs by default) would all work as part of the solution.

For Digital Rebar, the quick answer is to simply allocate a FIP for every node. We can easily make this a configuration option; however, it feels like a pattern fail to me. It’s certainly not a requirement from other clouds.

I hope this post provides specifics about delivering a more portable hybrid experience. What critical items do you want as part of your cloud ops process?

SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Posted on April 20, 2016 by Rob H

Originally posted on Kubernetes Blog. I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early. Please join us in this effort!

open We think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

Chat with us on the Cluster Ops Slack channel
Email us at the Cluster Ops SIG email list

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: Hybrid DevOps