About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

Can we control Hype & Over-Vendoring?

Posted on December 12, 2016 by Rob H

Q: Is over-vendoring when you’ve had to much to drink?
A: Yes, too much Kool Aid.

There’s a lot of information here – skip to the bottom if you want to see my recommendation.

Last week on TheNewStack, I offered eight ways to keep Kubernetes on the right track (abridged list here) and felt that item #6 needed more explanation and some concrete solutions.

DO: Focus on a Tight Core
DO: Build a Diverse Community
DO: Multi-cloud and Hybrid
DO: Be Humble and Honest
AVOID: “The One Ring” Universal Solution Hubris
AVOID: Over-Vendoring (discussed here)
AVOID: Coupling Installers, Brokers and Providers to the core
AVOID: Fast Release Cycles without LTS Releases

What is Over-Vendoring? It’s when vendors’ drive their companies’ brands ahead of the health of the project. Generally by driving an aggressive hype cycle where vendors are trying to jump on the hype bandwagon.

Hype can be very dangerous for projects (David Cassel’s TNS article) because it is easy to bypass the user needs and boring scale/stabilization processes to focus on vendor differentiation. Unfortunately, common use-cases do not drive differentiation and are invisible when it comes to company marketing budgets. That boring common core has the effect creating tragedy of the commons which undermines collaboration on shared code bases.

The solution is to aggressively keep the project core small so that vendors have specific and limited areas of coopetition.

A small core means we do not compel collaboration in many areas of project. This drives competition and diversity that can be confusing. The temptation to endorse or nominate companion projects is risky due to the hype cycle. Endorsements can create a bias that actually hurts innovation because early or loud vendors do not generally create the best long term approaches. I’ve heard this described as “people doing the real work don’t necessarily have time to brag about it.”

Keeping a small core mantra drives a healthy plug-in model where vendors can differentiate. It also ensures that projects can succeed with a bounded set of core contributors and support infrastructure. That means that we should not measure success by commits, committers or lines of code because these will drop as projects successfully modularize. My recommendation for a key success metric is to the ratio of committers to ecosystem members and users.

Tracking improving ratio of core to ecosystem shows that improving efficiency of investment. That’s a better sign of health than project growth.

It’s important to note that there is also a serious risk of under-vendoring too!

We must recognize and support vendors in open source communities because they sustain the project via direct contributions and bringing users. For a healthy ecosystem, we need to ensure that vendors can fairly profit. That means they must be able to use their brand in combination with the project’s brand. Apache Project is the anti-pattern because they have very strict “no vendor” trademark marketing guidelines that can strand projects without good corporate support.

I’ve come to believe that it’s important to allow vendors to market open source projects brands; however, they also need to have some limits on how they position the project.

How should this co-branding work? My thinking is that vendor claims about a project should be managed in a consistent and common way. Since we’re keeping the project core small, that should help limit the scope of the claims. Vendors that want to make ecosystem claims should be given clear spaces for marketing their own brand in participation with the project brand.

I don’t pretend that this is easy! Vendor marketing is planned quarters ahead of when open source projects are ready for them: that’s part of what feeds the hype cycle. That means that projects will be saying no to some free marketing from their ecosystem. Ideally, we’re saying yes to the right parts at the same time.

Ultimately, hype control means saying no to free marketing. For an open source project, that’s a hard but essential decision.

Cloudcast.net gem about Cluster Ops Gap

Posted on November 29, 2016 by Rob H

Podcast juxtaposition can be magical. In this case, I heard back-to-back sessions with pragmatic for cluster operations and then how developers are rebelling against infrastructure.

Last week, I was listening to Brian Gracely’s “Automatic DevOps” discussion with John Troyer (CEO at TechReckoning, a community for IT pros) followed by his confusingly titled “operators” talk with Brandon Phillips (CTO at CoreOS).

John’s mid-recording comments really resonated with me:

At 16 minutes: “IT is going to be the master of many environments… If you have an environment is hybrid & multi-cloud, then you still need to care about infrastructure… we are going to be living with that for at least 10 years.”

At 18 minutes: “We need a layer that is cloud-like, devops-like and agile-like that can still be deployed in multiple places. This middle layer, Cluster Ops, is really important because it’s the layer between the infrastructure and the app.”

The conversation with Brandon felt very different where the goal was to package everything “operator” into Kubernetes semantics including Kubernetes running itself. This inception approach to running the cluster is irresistible within the community because the goal of the community is to stop having to worry about infrastructure. [Brian – call me if you want to a do podcast of the counter point to self-hosted].

Infrastructure is hard and complex. There’s good reason to limit how many people have to deal with that, but someone still has to deal with it.

I’m a big fan of container workloads generally and Kubernetes specifically as a way to help isolate application developers from infrastructure; consequently, it’s not designed to handle the messy infrastructure requirements that make Cluster Ops a challenge. This is a good thing because complexity explodes when platforms expose infrastructure details.

For Kubernetes and similar, I believe that injecting too much infrastructure mess undermines the simplicity of the platform.

There’s a different type of platform needed for infrastructure aware cluster operations where automation needs to address complexity via composability. That’s what RackN is building with open Digital Rebar: a the hybrid management layer that can consistently automate around infrastructure variation.

If you want to work with us to create system focused, infrastructure agnostic automation then take a look at the work we’ve been doing on underlay and cluster operations.

DevOps vs Cloud Native: Damn, where did all this platform complexity come from?

Posted on November 17, 2016 by Rob H

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms. Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.
This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility. While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams? We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places. That means that operations still owns the complex underlay and platform jobs. Overall, I think that’s a workable diversion.

Provisioned Secure By Default with Integrated PKI & TLS Automation

Posted on November 16, 2016 by Rob H

Today, I’m presenting this topic (PKI automation & rotation) at Defragcon so I wanted to share this background more broadly as a companion for that presentation. I know this is a long post – hang with me, PKI is complex.

Building automation that creates a secure infrastructure is as critical as it is hard to accomplish. For all the we talk about repeatable automation, actually doing it securely is a challenge. Why? Because we cannot simply encode passwords, security tokens or trust into our scripts. Even more critically, secure configuration is antithetical to general immutable automation: it requires that each unit is different and unique.

Over the summer, the RackN team expanded open source Digital Rebar to include roles that build a service-by-service internal public key infrastructure (PKI).

untitled-drawing This is a significant advance in provisioning infrastructure because it allows bootstrapping transport layer security (TLS) encryption without having to assume trust at the edges. This is not general PKI: the goal is for internal trust zones that have no external trust anchors.

Before I explain the details, it’s important to understand that RackN did not build a new encryption model! We leveraged the ones that exist and automated them. The challenge has been automating PKI local certificate authorities (CA) and tightly scoped certificates with standard configuration tools. Digital Rebar solves this by merging service management, node configuration and orchestration.

I’ll try and break this down into the key elements of encryption, keys and trust.

The goal is simple: we want to be able to create secure communications (that’s TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog (that’s PKI). These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

To stand up a secure REST API service, we need to create a private key held by the server and a public key that is given to each client that wants secure communication with the server.

Now the parties can create secure communications (TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog. These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

Unfortunately, point-to-point key exchange is not enough to establish secure communications. It too easy to impersonate a service or intercept traffic.

Part of the solution is to include holder identity information into the key itself such as the name or IP address of the server. The more specific the information, the harder it is to break the trust. Unfortunately, many automation patterns simply use wildcard (or unspecific) identity because it is very difficult for them to predict the IP address or name of a server. To address that problem, we only generate certificates once the system details are known. Even better, it’s then possible to regenerate certificates (known as key rotation) after initial deployment.

While identity improves things, it’s still not sufficient. We need to have a trusted third party who can validate that the keys are legitimate to make the system truly robust. In this case, the certificate authority (CA) that issues the keys signs them so that both parties are able to trust each other. There’s no practical way to intercept communications between the trusted end points without signed keys from the central CA. The system requires that we can build and maintain these three way relationships. For public websites, we can rely on root certificates; however, that’s not practical or desirable for dynamic internal encryption needs.

So what did we do with Digital Rebar? We’ve embedded a certificate authority (CA) service into the core orchestration engine (called “trust me”).

The Digital Rebar CA can be told to generate a root certificate on a per service basis. When we add a server for that service, the CA issues a unique signed certificate matching the server identity. When we add a client for that service, the CA issues a unique signed public key for the client matching the client’s identity. The server will reject communication from unknown public keys. In this way, each service is able to ensure that it is only communicating with trusted end points.

Wow, that’s a lot of information! Getting security right is complex and often neglected. Our focus is provisioning automation, so these efforts do not cover all PKI lifecycle issues or challenges. We’ve got a long list of integrations, tools and next steps that we’d like to accomplish.

Our goal was to automate building secure communication as a default. We think these enhancements to Digital Rebar are a step in that direction. Please let us know if you think this approach is helpful.

Why RackN Is Joining Infrastructure Masons

Posted on November 15, 2016 by Rob H

A few months ago, Dean Nelson, who for many years ran the eBay data center strategy shared his vision of Infrastructure Masons ( http://www.infrastructuremasons.org ) with me and asked us if we were interested in participating. He invited us to attend the very first Infrastructure Masons Leadership Summit which is being held November 16th at Google HQ in Sunnyvale, CA. We were honored and are looking forward to it.

In short, the Infrastructure Masons organization is comprised of technologists, executives and partners entrusted with building and managing the physical and logical structures of the Digital Age. They are dedicated to the advancement of the industry, development of their fellow masons, and empowering business and personal use of infrastructure to better the economy, the environment, and society. membership-symbol

During our conversation, like water, electricity, or transportation, Dean explained his belief is the majority of internet users expect instant connectivity to their online services without much awareness of the complexity and scale of the physical infrastructure that makes those services possible (I try to explain this to my children and they don’t care) and is taken for granted. Dean wants people and organizations that enable connectivity to receive more recognition for their contributions, share their ideas, collaborate and advance infrastructure technology to the next level. With leaders from FaceBook, Google and Microsoft (RackN too!) participating, he has the right players at the table who are committed to help deliver on his vision.

Managing multiple clouds, data centers, services, vendors and operational models should not be an impediment to progress for CIOs, CTO’s, cloud operators and IT administrators but an advantage. The overwhelming complexity in melding networking, containers and security should be simple and necessary. In the software-defined infrastructure, utility-based computing needs to be just like turning on a light or running a water faucet. RackN believes the ongoing lifecycle of automating, provisioning and managing hybrid clouds and data center infrastructure together under one operational control plane is possible. At RackN, we work hard to make that vision a reality.

Looking forward into the future, we share Dean’s vision and look forward to helping drive the Infrastructure Masons mission.

Want to hear more? Read an open ops take by RackN CEO, Rob Hirschfeld.

Author: Dan Choquette, Co-Founder/COO of RackN

Infrastructure Masons is building a community around data center practice

Posted on November 15, 2016 by Rob H

IT is subject to seismic shifts right now. Here’s how we cope together.

For a long time, I’ve advocated for open operations (“OpenOps”) as a way to share best practices about running data centers. I’ve worked hard in OpenStack and, recently, Kubernetes communities to have operators collaborate around common architectures and automation tools. I believe the first step in these efforts starts with forming a community forum.

I’m very excited to have the RackN team and technology be part of the newly formed Infrastructure Masons effort because we are taking this exact community first approach.

Here’s how Dean Nelson, IM organizer and head of Uber Compute, describes the initiative:

An Infrastructure Mason Partner is a professional who develop products, build or support infrastructure projects, or operate infrastructure on behalf of end users. Like their end users peers, they are dedicated to the advancement of the Industry, development of their fellow masons, and empowering business and personal use of the infrastructure to better the economy, the environment, and society.

We’re in the midst of tremendous movement in IT infrastructure. The change to highly automated and scale-out design was enabled by cloud but is not cloud specific. This requirement is reshaping how IT is practiced at the most fundamental levels.

We (IT Ops) are feeling amazing pressure on operations and operators to accelerate workflow processes and innovate around very complex challenges.

Open operations loses if we respond by creating thousands of isolated silos or moving everything to a vendor specific island like AWS. The right answer is to fund ways to share practices and tooling that is tolerant of real operational complexity and the legitimate needs for heterogeneity.

Interested in more? Get involved with the group! I’ll be sharing more details here too.

Will OpenStack Go Supernova? It’s Time to Refocus on Core.

Posted on November 10, 2016 by Rob H

There’s no gentle way to put this but everyone (and I mean everyone) I’ve talked with thinks that this position should be heard.

OpenStack is bleeding off development resources (Networkworld) and that may be a good thing if the community responds by refocusing.

#AfterStack Crowd

I spent a fantastic week in Barcelona catching-up with many old and new friends at the OpenStack summit. The community continues to grow and welcome new participants. As one of the “project elders,” I was on the hallway track checking-in on both public and private plans around the project.

One trend was common: companies are scaling back or redirecting resources away from the project. While there are many reasons for this; the negative impact to development and test velocity is very clear.

When a sun goes nova, it blows off excess mass and is left with a dense energetic core. That would be better than going supernova in which the star burns intensely and then dies.

For OpenStack, a similar process would involve clearly redirecting technical efforts to the integrated Core from an increasingly frothy list of “big tent” extensions. This would both help focus resources and improve ecosystem collaboration. I believe OpenStack is facing a choice between going nova (core focus) and supernova (burning out).

I am highly in favor of a strong and diverse ecosystem around OpenStack as demonstrated by my personal investments in OpenStack Interoperability (aka DefCore). However, when I moved out of the OpenStack echo chamber; I heard clearly that users have a much broader desire for interoperability. They need tools that are both hybrid and multi-cloud because their businesses are not limited to single infrastructures.

The community needs to embrace multi-cloud tools because that is the reality for its users.

Building an OpenStack specific ecosystem (as per “big tent”) undermines an essential need for OpenStack users. Now is the time for OpenStack for course correct to a narrower mission that focuses on the integrated functional platform that is already widely adopted. Now is the time for OpenStack live up to its original name and go “Nova.”

Czan we consider Ansible Inventory as simple service registry?

Posted on October 30, 2016 by Rob H

... "docker exec configure file" is a sad but common pattern ...

np2utaoe_400x400 Interesting discussions happen when you hang out with straight-talking Paul Czarkowski. There’s a long chain of circumstance that lead us from an Interop panel together at Barcelona (video) to bemoaning Ansible and Docker integration early one Sunday morning outside a gate in IAD.

What started as a rant about czray ways people find of injecting configuration into containers (we seemed to think file mounting configs was “least horrific”) turned into an discussion about how to retro-fit application registry features (like consul or etcd) into legacy applications.

Ansible Inventory is basically a static registry service.

While we both acknowledge that Ansible inventory is distinctly not a registry service, the idea is a useful way to help explain the interaction between registry and configuration. The most basic goal of a registry (there are others!) is to have system components be able to find and integrate with other system components. In that sense, the inventory creates allows operators to pre-wire this information in advance in a functional way.

The utility quickly falls apart because it’s difficult to create re-runable Ansible (people can barely pronounce idempotent as it is) that could handle incremental updates. Also, a registry provides many other important functions like service health and basic cross node storage that are import.

It may not be perfect, but I thought it was @pczarkowski insight worth passing on. What do you think?

Why we can’t move past installers to talk about operations – the underlay gap

Posted on October 12, 2016 by Rob H

20 minutes. That’s the amount of time most developers are willing to spend installing a tool or platform that could become the foundation for their software. I’ve watched our industry obsess on the “out of box” experience which usually translates into a single CLI command to get started (and then fails to scale up).

Secure, scalable and robust production operations is complex. In fact, most of these platforms are specifically designed to hide that fact from developers.

That means that these platforms intentionally hide the very complexity that they themselves need to run effectively. Adding that complexity, at best, undermines the utility of the platform and, at worst, causes distractions that keep us forever looping on “day 1” installation issues.

I believe that systems designed to manage ops process and underlay are different than the platforms designed to manage developer life-cycle. This is different than the fidelity gap which is about portability. Accepting that allows us to focus on delivering secure, scalable and robust infrastructure for both users.

In a pair of DevOps.com posts, I lay out my arguments about the harm being caused by trying to blend these concepts in much more detail:

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

Posted on October 7, 2016 by Rob H

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration). While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues. Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration. Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates. Once the platform is installed, then we use the platform as a services. On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance. Using a load balancer? You can’t configure it until you’ve got the node addresses allocated. And then it needs to be updated as you manage your cluster. This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration. The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence. For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries. The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic. No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Author Archives: Rob H