DevOps vs Cloud Native: Damn, where did all this platform complexity come from?

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms.  Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.
11-11-16-hirschfeld-1080x675This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility.  While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams?  We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places.  That means that operations still owns the complex underlay and platform jobs.  Overall, I think that’s a workable diversion.

Provisioned Secure By Default with Integrated PKI & TLS Automation

Today, I’m presenting this topic (PKI automation & rotation) at Defragcon  so I wanted to share this background more broadly as a companion for that presentation.  I know this is a long post – hang with me, PKI is complex.

Building automation that creates a secure infrastructure is as critical as it is hard to accomplish. For all the we talk about repeatable automation, actually doing it securely is a challenge. Why? Because we cannot simply encode passwords, security tokens or trust into our scripts. Even more critically, secure configuration is antithetical to general immutable automation: it requires that each unit is different and unique.

Over the summer, the RackN team expanded open source Digital Rebar to include roles that build a service-by-service internal public key infrastructure (PKI).

untitled-drawingThis is a significant advance in provisioning infrastructure because it allows bootstrapping transport layer security (TLS) encryption without having to assume trust at the edges.  This is not general PKI: the goal is for internal trust zones that have no external trust anchors.

Before I explain the details, it’s important to understand that RackN did not build a new encryption model!  We leveraged the ones that exist and automated them.  The challenge has been automating PKI local certificate authorities (CA) and tightly scoped certificates with standard configuration tools.  Digital Rebar solves this by merging service management, node configuration and orchestration.

I’ll try and break this down into the key elements of encryption, keys and trust.

The goal is simple: we want to be able to create secure communications (that’s TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog (that’s PKI). These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

To stand up a secure REST API service, we need to create a private key held by the server and a public key that is given to each client that wants secure communication with the server.

Now the parties can create secure communications (TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog. These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

Unfortunately, point-to-point key exchange is not enough to establish secure communications.  It too easy to impersonate a service or intercept traffic.  

Part of the solution is to include holder identity information into the key itself such as the name or IP address of the server.  The more specific the information, the harder it is to break the trust.  Unfortunately, many automation patterns simply use wildcard (or unspecific) identity because it is very difficult for them to predict the IP address or name of a server.   To address that problem, we only generate certificates once the system details are known.  Even better, it’s then possible to regenerate certificates (known as key rotation) after initial deployment.

While identity improves things, it’s still not sufficient.  We need to have a trusted third party who can validate that the keys are legitimate to make the system truly robust.  In this case, the certificate authority (CA) that issues the keys signs them so that both parties are able to trust each other.  There’s no practical way to intercept communications between the trusted end points without signed keys from the central CA.  The system requires that we can build and maintain these three way relationships.  For public websites, we can rely on root certificates; however, that’s not practical or desirable for dynamic internal encryption needs.

So what did we do with Digital Rebar?  We’ve embedded a certificate authority (CA) service into the core orchestration engine (called “trust me”).  

The Digital Rebar CA can be told to generate a root certificate on a per service basis.  When we add a server for that service, the CA issues a unique signed certificate matching the server identity.  When we add a client for that service, the CA issues a unique signed public key for the client matching the client’s identity.  The server will reject communication from unknown public keys.  In this way, each service is able to ensure that it is only communicating with trusted end points.

Wow, that’s a lot of information!  Getting security right is complex and often neglected.  Our focus is provisioning automation, so these efforts do not cover all PKI lifecycle issues or challenges.  We’ve got a long list of integrations, tools and next steps that we’d like to accomplish.

Our goal was to automate building secure communication as a default.  We think these enhancements to Digital Rebar are a step in that direction.  Please let us know if you think this approach is helpful.

Why RackN Is Joining Infrastructure Masons

A few months ago, Dean Nelson, who for many years ran the eBay data center strategy shared his vision of Infrastructure Masons ( http://www.infrastructuremasons.org ) with me and asked us if we were interested in participating. He invited us to attend the very first Infrastructure Masons Leadership Summit which is being held November 16th at Google HQ in Sunnyvale, CA. We were honored and are looking forward to it.

In short, the Infrastructure Masons organization is comprised of technologists, executives and partners entrusted with building and managing the physical and logical structures of the Digital Age. They are dedicated to the advancement of the industry, development of their fellow masons, and empowering business and personal use of infrastructure to better the economy, the environment, and society.membership-symbol

During our conversation, like water, electricity, or transportation, Dean explained his belief is the majority of internet users expect instant connectivity to their online services without much awareness of the complexity and scale of the physical infrastructure that makes those services possible (I try to explain this to my children and they don’t care) and is taken for granted. Dean wants people and organizations that enable connectivity to receive more recognition for their contributions, share their ideas, collaborate and advance infrastructure technology to the next level. With leaders from FaceBook, Google and Microsoft (RackN too!) participating, he has the right players at the table who are committed to help deliver on his vision.

Managing multiple clouds, data centers, services, vendors and operational models should not be an impediment to progress for CIOs, CTO’s, cloud operators and IT administrators but an advantage. The overwhelming complexity in melding networking, containers and security should be simple and necessary. In the software-defined infrastructure, utility-based computing needs to be just like turning on a light or running a water faucet. RackN believes the ongoing lifecycle of automating, provisioning and managing hybrid clouds and data center infrastructure together under one operational control plane is possible. At RackN, we work hard to make that vision a reality. 

Looking forward into the future, we share Dean’s vision and look forward to helping drive the Infrastructure Masons mission.

Want to hear more?  Read an open ops take by RackN CEO, Rob Hirschfeld.

Author: Dan Choquette, Co-Founder/COO of RackN

Infrastructure Masons is building a community around data center practice

IT is subject to seismic shifts right now. Here’s how we cope together.

For a long time, I’ve advocated for open operations (“OpenOps”) as a way to share best practices about running data centers. I’ve worked hard in OpenStack and, recently, Kubernetes communities to have operators collaborate around common architectures and automation tools. I believe the first step in these efforts starts with forming a community forum.

I’m very excited to have the RackN team and technology be part of the newly formed Infrastructure Masons effort because we are taking this exact community first approach.

infrastructure_masons

Here’s how Dean Nelson, IM organizer and head of Uber Compute, describes the initiative:

An Infrastructure Mason Partner is a professional who develop products, build or support infrastructure projects, or operate infrastructure on behalf of end users. Like their end users peers, they are dedicated to the advancement of the Industry, development of their fellow masons, and empowering business and personal use of the infrastructure to better the economy, the environment, and society.

We’re in the midst of tremendous movement in IT infrastructure.  The change to highly automated and scale-out design was enabled by cloud but is not cloud specific.  This requirement is reshaping how IT is practiced at the most fundamental levels.

We (IT Ops) are feeling amazing pressure on operations and operators to accelerate workflow processes and innovate around very complex challenges.

Open operations loses if we respond by creating thousands of isolated silos or moving everything to a vendor specific island like AWS.  The right answer is to fund ways to share practices and tooling that is tolerant of real operational complexity and the legitimate needs for heterogeneity.

Interested in more?  Get involved with the group!  I’ll be sharing more details here too.

 

Will OpenStack Go Supernova? It’s Time to Refocus on Core.

There’s no gentle way to put this but everyone (and I mean everyone) I’ve talked with thinks that this position should be heard.

OpenStack is bleeding off development resources (Networkworld) and that may be a good thing if the community responds by refocusing.

cv4paiiwyaegopu

#AfterStack Crowd

I spent a fantastic week in Barcelona catching-up with many old and new friends at the OpenStack summit. The community continues to grow and welcome new participants. As one of the “project elders,” I was on the hallway track checking-in on both public and private plans around the project.

One trend was common: companies are scaling back or redirecting resources away from the project.  While there are many reasons for this; the negative impact to development and test velocity is very clear.

When a sun goes nova, it blows off excess mass and is left with a dense energetic core. That would be better than going supernova in which the star burns intensely and then dies.

For OpenStack, a similar process would involve clearly redirecting technical efforts to the integrated Core from an increasingly frothy list of “big tent” extensions. This would both help focus resources and improve ecosystem collaboration.  I believe OpenStack is facing a choice between going nova (core focus) and supernova (burning out).

I am highly in favor of a strong and diverse ecosystem around OpenStack as demonstrated by my personal investments in OpenStack Interoperability (aka DefCore). However, when I moved out of the OpenStack echo chamber; I heard clearly that users have a much broader desire for interoperability. They need tools that are both hybrid and multi-cloud because their businesses are not limited to single infrastructures.

The community needs to embrace multi-cloud tools because that is the reality for its users.

Building an OpenStack specific ecosystem (as per “big tent”) undermines an essential need for OpenStack users. Now is the time for OpenStack for course correct to a narrower mission that focuses on the integrated functional platform that is already widely adopted. Now is the time for OpenStack live up to its original name and go “Nova.”

Czan we consider Ansible Inventory as simple service registry?

... "docker exec configure file" is a sad but common pattern ...

np2utaoe_400x400Interesting discussions happen when you hang out with straight-talking Paul Czarkowski. There’s a long chain of circumstance that lead us from an Interop panel together at Barcelona (video) to bemoaning Ansible and Docker integration early one Sunday morning outside a gate in IAD.

What started as a rant about czray ways people find of injecting configuration into containers (we seemed to think file mounting configs was “least horrific”) turned into an discussion about how to retro-fit application registry features (like consul or etcd) into legacy applications.

Ansible Inventory is basically a static registry service.

While we both acknowledge that Ansible inventory is distinctly not a registry service, the idea is a useful way to help explain the interaction between registry and configuration.  The most basic goal of a registry (there are others!) is to have system components be able to find and integrate with other system components.  In that sense, the inventory creates allows operators to pre-wire this information in advance in a functional way.

The utility quickly falls apart because it’s difficult to create re-runable Ansible (people can barely pronounce idempotent as it is) that could handle incremental updates.  Also, a registry provides many other important functions like service health and basic cross node storage that are import.

It may not be perfect, but I thought it was @pczarkowski insight worth passing on.  What do you think?

Why we can’t move past installers to talk about operations – the underlay gap

20 minutes.  That’s the amount of time most developers are willing to spend installing a tool or platform that could become the foundation for their software.  I’ve watched our industry obsess on the “out of box” experience which usually translates into a single CLI command to get started (and then fails to scale up).

Secure, scalable and robust production operations is complex.  In fact, most of these platforms are specifically designed to hide that fact from developers.  

That means that these platforms intentionally hide the very complexity that they themselves need to run effectively.  Adding that complexity, at best, undermines the utility of the platform and, at worst, causes distractions that keep us forever looping on “day 1” installation issues.

I believe that systems designed to manage ops process and underlay are different than the platforms designed to manage developer life-cycle.  This is different than the fidelity gap which is about portability. Accepting that allows us to focus on delivering secure, scalable and robust infrastructure for both users.

In a pair of DevOps.com posts, I lay out my arguments about the harm being caused by trying to blend these concepts in much more detail:

  1. It’s Time to Slay the Universal Installer Unicorn
  2. How the Lure of an ‘Easy Button’ Installer Traps Projects

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

aid1165255-728px-install-pergo-flooring-step-5-version-2If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).  While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues.  Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration.  Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates.  Once the platform is installed, then we use the platform as a services.  On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance.  Using a load balancer?  You can’t configure it until you’ve got the node addresses allocated.  And then it needs to be updated as you manage your cluster.  This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration.  The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence.  For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries.  The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic.  No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Breaking Up is Hard To Do – Why I Believe Ops Decomposition (pt 1)

Over the summer, the RackN team took a radical step with our previous Ansible Kubernetes workload install: we broke it into pieces.  Why?  We wanted to eliminate all “magic happens here” steps in the deployment.

320px-dominos_fallingThe result, DR Kompos8, is a faster, leaner, transparent and parallelized installation that allows for pluggable extensions and upgrades (video tour). We also chose the operationally simplest configuration choice: Golang binaries managed by SystemDGolang binaries managed by SystemD.

Why decompose and simplify? Let’s talk about our hard earned ops automation battle scars that let to composability as a core value:

Back in the early OpenStack days, when the project was actually much simpler, we were part of a community writing Chef Cookbooks to install it. These scripts are just a sequence of programmable steps (roles in Ops-speak) that drive the configuration of services on each node in the cluster. There is an ability to find cross-cluster information and lookup local inventory so we were able to inject specific details before the process began. However, once the process started, it was pretty much like starting a dominoes chain. If anything went wrong anywhere in the installation, we had to reset all the dominoes and start over.

Like a dominoes train, it is really fun to watch when it works. Also, like dominoes, it is frustrating to set up and fix. Often we literally were holding our breath during installation hoping that we’d anticipated every variation in the software, hardware and environment. It is no surprise that the first and must critical feature we’d created was a redeploy command.

It turned out the the ability to successfully redeploy was the critical measure for success. We would not consider a deployment complete until we could wipe the systems and rebuild it automatically at least twice.

What made cluster construction so hard? There were a three key things: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).

We’ll explore these three reasons in detail for part 2 of this post tomorrow.

Even without the details, it easy to understand that we want to avoid all magic in a deployment.

For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Likewise, we need to eliminate “it worked from my desktop” automation too.  Those systems are impossible to maintain, share and scale. Composed cluster operations addresses this problem by making work modular, predictable and transparent.

Kubernetes the NOT-so-hard way (7 RackN additions: keeping transparency, adding security)

At RackN, we take the KISS principle to heart, here are the seven ways that we worked to make Kubernetes easier to install and manage.

Container community crooner, Kelsey Hightower, created a definitive installation guide that he dubbed “Kubernetes the Hard Way” or KTHW.  In that document, he laid out a manual sequence of steps needed to bring up a working Kubernetes Cluster.  For some, the lengthy sequence served as a rally cry to simplify and streamline the “boot to kube” process with additional configuration harnesses, more bells and and some new whistles.

For the RackN team, Kelsey’s process looked like a reliable and elegant basis for automation.  So, we automated that and eliminated the hard parts (see video)

 

Seven improvements for KTHW

Our operational approach to distributed systems (encoded in Digital Rebar) drives towards keeping things simple and transparent in operation.  When creating automation, it’s way too easy to add complexity that works on a desktop for a developer, but fails as we scale or move into sustaining operations.

The benefit of Kelsey’s approach was that it had to be simple enough to reproduce and troubleshoot manually; however, there were several KTHW challenges that we wanted to streamline while we automated.

  1. Respect the manual steps: Just automating is not enough. We wanted to be true to the steps so that users of the automation could look back that the process and understand it. The beauty of KTHW is that operators can read it and understand the inner workings of Kubernetes.
  2. Node Inventory: Manual node allocation is time consuming and error prone. We believe that the process should be able to (but not require a) start from zero with just raw hardware or cloud credentials. Anything else opens up a lot of potential configuration errors.
  3. Automatic Iteration: Going back to make adjustments to previous nodes is normal in cluster building and really annoying for users. This is especially true when clusters are expanded or contracted.
  4. PKI Security: We love that Kubernetes requires TLS communication; however, we’re generally horrified about sharing around private keys and wild card certificates even for development and test clusters.
  5. Go & SystemD: We use containers for a everything in Digital Rebar and our design has a lot of RESTful services behind a reverse proxy; however, it’s simply not needed for Kubernetes. Kubernetes binary are portable Golang programs and just the API service is a web service. We feel strongly that the simplest and most robust deployment just runs these programs under SystemD. It is just as easy to curl a single file and restart a service as the doing a docker pull. In fact, it’s measurably simpler, more secure and reliable.
  6. Pluggability: It’s hard to allow variation in a manual process. With Kubernetes open ecosystem, we see a need to operators to make practical configuration choices without straying dramatically from Kelsey’s basic process. Changes to the container run time or network model should not result in radically different install steps because the fundamentals of Kubernetes are not changed by these choices.
  7. Parallel Deploys & CI/CD Deployments: When we work on cluster deploys, we spin up lots and lots of independent installs to test variations and changes like AWS and Google and OpenStack or Ubuntu and Centos.  Consequently, it is important that we can run multiple installs in parallel.  Once that works, we want to have CI driven setup, test and tear down processes.

We’re excited about the clean, fast and portable installation the came out of our efforts to automation the KTHW process. We hope that you’ll take a look at our approach and help us continue to improve and streamline Kubernetes (and other!) platform installs.