Cloudcast.net gem about Cluster Ops Gap

15967Podcast juxtaposition can be magical.  In this case, I heard back-to-back sessions with pragmatic for cluster operations and then how developers are rebelling against infrastructure.

Last week, I was listening to Brian Gracely’s “Automatic DevOps” discussion with  John Troyer (CEO at TechReckoning, a community for IT pros) followed by his confusingly titled “operators” talk with Brandon Phillips (CTO at CoreOS).

John’s mid-recording comments really resonated with me:

At 16 minutes: “IT is going to be the master of many environments… If you have an environment is hybrid & multi-cloud, then you still need to care about infrastructure… we are going to be living with that for at least 10 years.”

At 18 minutes: “We need a layer that is cloud-like, devops-like and agile-like that can still be deployed in multiple places.  This middle layer, Cluster Ops, is really important because it’s the layer between the infrastructure and the app.”

The conversation with Brandon felt very different where the goal was to package everything “operator” into Kubernetes semantics including Kubernetes running itself.  This inception approach to running the cluster is irresistible within the community because the goal of the community is to stop having to worry about infrastructure.  [Brian – call me if you want to a do podcast of the counter point to self-hosted].

Infrastructure is hard and complex.  There’s good reason to limit how many people have to deal with that, but someone still has to deal with it.

I’m a big fan of container workloads generally and Kubernetes specifically as a way to help isolate application developers from infrastructure; consequently, it’s not designed to handle the messy infrastructure requirements that make Cluster Ops a challenge.  This is a good thing because complexity explodes when platforms expose infrastructure details.

For Kubernetes and similar, I believe that injecting too much infrastructure mess undermines the simplicity of the platform.

There’s a different type of platform needed for infrastructure aware cluster operations where automation needs to address complexity via composability.  That’s what RackN is building with open Digital Rebar: a the hybrid management layer that can consistently automate around infrastructure variation.

If you want to work with us to create system focused, infrastructure agnostic automation then take a look at the work we’ve been doing on underlay and cluster operations.

 

DevOps vs Cloud Native: Damn, where did all this platform complexity come from?

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms.  Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.
11-11-16-hirschfeld-1080x675This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility.  While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams?  We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places.  That means that operations still owns the complex underlay and platform jobs.  Overall, I think that’s a workable diversion.

Provisioned Secure By Default with Integrated PKI & TLS Automation

Today, I’m presenting this topic (PKI automation & rotation) at Defragcon  so I wanted to share this background more broadly as a companion for that presentation.  I know this is a long post – hang with me, PKI is complex.

Building automation that creates a secure infrastructure is as critical as it is hard to accomplish. For all the we talk about repeatable automation, actually doing it securely is a challenge. Why? Because we cannot simply encode passwords, security tokens or trust into our scripts. Even more critically, secure configuration is antithetical to general immutable automation: it requires that each unit is different and unique.

Over the summer, the RackN team expanded open source Digital Rebar to include roles that build a service-by-service internal public key infrastructure (PKI).

untitled-drawingThis is a significant advance in provisioning infrastructure because it allows bootstrapping transport layer security (TLS) encryption without having to assume trust at the edges.  This is not general PKI: the goal is for internal trust zones that have no external trust anchors.

Before I explain the details, it’s important to understand that RackN did not build a new encryption model!  We leveraged the ones that exist and automated them.  The challenge has been automating PKI local certificate authorities (CA) and tightly scoped certificates with standard configuration tools.  Digital Rebar solves this by merging service management, node configuration and orchestration.

I’ll try and break this down into the key elements of encryption, keys and trust.

The goal is simple: we want to be able to create secure communications (that’s TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog (that’s PKI). These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

To stand up a secure REST API service, we need to create a private key held by the server and a public key that is given to each client that wants secure communication with the server.

Now the parties can create secure communications (TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog. These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

Unfortunately, point-to-point key exchange is not enough to establish secure communications.  It too easy to impersonate a service or intercept traffic.  

Part of the solution is to include holder identity information into the key itself such as the name or IP address of the server.  The more specific the information, the harder it is to break the trust.  Unfortunately, many automation patterns simply use wildcard (or unspecific) identity because it is very difficult for them to predict the IP address or name of a server.   To address that problem, we only generate certificates once the system details are known.  Even better, it’s then possible to regenerate certificates (known as key rotation) after initial deployment.

While identity improves things, it’s still not sufficient.  We need to have a trusted third party who can validate that the keys are legitimate to make the system truly robust.  In this case, the certificate authority (CA) that issues the keys signs them so that both parties are able to trust each other.  There’s no practical way to intercept communications between the trusted end points without signed keys from the central CA.  The system requires that we can build and maintain these three way relationships.  For public websites, we can rely on root certificates; however, that’s not practical or desirable for dynamic internal encryption needs.

So what did we do with Digital Rebar?  We’ve embedded a certificate authority (CA) service into the core orchestration engine (called “trust me”).  

The Digital Rebar CA can be told to generate a root certificate on a per service basis.  When we add a server for that service, the CA issues a unique signed certificate matching the server identity.  When we add a client for that service, the CA issues a unique signed public key for the client matching the client’s identity.  The server will reject communication from unknown public keys.  In this way, each service is able to ensure that it is only communicating with trusted end points.

Wow, that’s a lot of information!  Getting security right is complex and often neglected.  Our focus is provisioning automation, so these efforts do not cover all PKI lifecycle issues or challenges.  We’ve got a long list of integrations, tools and next steps that we’d like to accomplish.

Our goal was to automate building secure communication as a default.  We think these enhancements to Digital Rebar are a step in that direction.  Please let us know if you think this approach is helpful.

Infrastructure Masons is building a community around data center practice

IT is subject to seismic shifts right now. Here’s how we cope together.

For a long time, I’ve advocated for open operations (“OpenOps”) as a way to share best practices about running data centers. I’ve worked hard in OpenStack and, recently, Kubernetes communities to have operators collaborate around common architectures and automation tools. I believe the first step in these efforts starts with forming a community forum.

I’m very excited to have the RackN team and technology be part of the newly formed Infrastructure Masons effort because we are taking this exact community first approach.

infrastructure_masons

Here’s how Dean Nelson, IM organizer and head of Uber Compute, describes the initiative:

An Infrastructure Mason Partner is a professional who develop products, build or support infrastructure projects, or operate infrastructure on behalf of end users. Like their end users peers, they are dedicated to the advancement of the Industry, development of their fellow masons, and empowering business and personal use of the infrastructure to better the economy, the environment, and society.

We’re in the midst of tremendous movement in IT infrastructure.  The change to highly automated and scale-out design was enabled by cloud but is not cloud specific.  This requirement is reshaping how IT is practiced at the most fundamental levels.

We (IT Ops) are feeling amazing pressure on operations and operators to accelerate workflow processes and innovate around very complex challenges.

Open operations loses if we respond by creating thousands of isolated silos or moving everything to a vendor specific island like AWS.  The right answer is to fund ways to share practices and tooling that is tolerant of real operational complexity and the legitimate needs for heterogeneity.

Interested in more?  Get involved with the group!  I’ll be sharing more details here too.

 

Will OpenStack Go Supernova? It’s Time to Refocus on Core.

There’s no gentle way to put this but everyone (and I mean everyone) I’ve talked with thinks that this position should be heard.

OpenStack is bleeding off development resources (Networkworld) and that may be a good thing if the community responds by refocusing.

cv4paiiwyaegopu

#AfterStack Crowd

I spent a fantastic week in Barcelona catching-up with many old and new friends at the OpenStack summit. The community continues to grow and welcome new participants. As one of the “project elders,” I was on the hallway track checking-in on both public and private plans around the project.

One trend was common: companies are scaling back or redirecting resources away from the project.  While there are many reasons for this; the negative impact to development and test velocity is very clear.

When a sun goes nova, it blows off excess mass and is left with a dense energetic core. That would be better than going supernova in which the star burns intensely and then dies.

For OpenStack, a similar process would involve clearly redirecting technical efforts to the integrated Core from an increasingly frothy list of “big tent” extensions. This would both help focus resources and improve ecosystem collaboration.  I believe OpenStack is facing a choice between going nova (core focus) and supernova (burning out).

I am highly in favor of a strong and diverse ecosystem around OpenStack as demonstrated by my personal investments in OpenStack Interoperability (aka DefCore). However, when I moved out of the OpenStack echo chamber; I heard clearly that users have a much broader desire for interoperability. They need tools that are both hybrid and multi-cloud because their businesses are not limited to single infrastructures.

The community needs to embrace multi-cloud tools because that is the reality for its users.

Building an OpenStack specific ecosystem (as per “big tent”) undermines an essential need for OpenStack users. Now is the time for OpenStack for course correct to a narrower mission that focuses on the integrated functional platform that is already widely adopted. Now is the time for OpenStack live up to its original name and go “Nova.”

Czan we consider Ansible Inventory as simple service registry?

... "docker exec configure file" is a sad but common pattern ...

np2utaoe_400x400Interesting discussions happen when you hang out with straight-talking Paul Czarkowski. There’s a long chain of circumstance that lead us from an Interop panel together at Barcelona (video) to bemoaning Ansible and Docker integration early one Sunday morning outside a gate in IAD.

What started as a rant about czray ways people find of injecting configuration into containers (we seemed to think file mounting configs was “least horrific”) turned into an discussion about how to retro-fit application registry features (like consul or etcd) into legacy applications.

Ansible Inventory is basically a static registry service.

While we both acknowledge that Ansible inventory is distinctly not a registry service, the idea is a useful way to help explain the interaction between registry and configuration.  The most basic goal of a registry (there are others!) is to have system components be able to find and integrate with other system components.  In that sense, the inventory creates allows operators to pre-wire this information in advance in a functional way.

The utility quickly falls apart because it’s difficult to create re-runable Ansible (people can barely pronounce idempotent as it is) that could handle incremental updates.  Also, a registry provides many other important functions like service health and basic cross node storage that are import.

It may not be perfect, but I thought it was @pczarkowski insight worth passing on.  What do you think?

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

aid1165255-728px-install-pergo-flooring-step-5-version-2If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).  While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues.  Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration.  Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates.  Once the platform is installed, then we use the platform as a services.  On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance.  Using a load balancer?  You can’t configure it until you’ve got the node addresses allocated.  And then it needs to be updated as you manage your cluster.  This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration.  The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence.  For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries.  The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic.  No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.