10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better.  I’m not thinking about scale in number of nodes, but in operator efficiency.  The primary way to create that efficiency is limit site customization and to improve reuse.  Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers.  This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.

 

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers.  That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms.  Unfortunately, that does nothing for the platforms or infrastructure bound applications.  For this lower level software, we need to accept that operations environments are heterogeneous.

 

I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself.  To make things even harder, the infrastructure and dependencies are also constantly changing.  Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current.  These challenges limit the virtuous feedback cycles that help communities grow.  

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken.  That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns.  Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.  

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

Why is RackN advancing OpenStack on Kubernetes?

Yesterday, RackN CEO, Rob Hirschfeld, described the remarkable progress in OpenStack on Kubernetes using Helm (article link).  Until now, RackN had not been willing to officially support OpenStack deployments; however, we now believe that this approach is a game changer for OpenStack operators even if they are not actively looking at Kubernetes.

We are looking for companies that want to join in this work and fast-track it into production. If this is interesting, please contact us at sre@rackn.com.

Why should you sponsor? Current OpenStack operators facing “fork-lift upgrades” should want to find a path like this one that ensures future upgrades are baked into the plan. This approach provide a fast track to a general purpose, enterprise grade, upgradable Kubernetes infrastructure.

Here is Rob’s Demo

Rob’s Original Blog Post

RackN revisits OpenStack deployments with an eye on ongoing operations. I’ve been an outspoken skeptic of a Joint OpenStack Kubernetes Environment because I felt that the technical hurdles of cloud native architecture would prove challenging.

I was wrong: I underestimated how fast these issues could be addressed.

… read the rest at Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar) — Rob Hirschfeld

Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar)

RackN revisits OpenStack deployments with an eye on ongoing operations.

I’ve been an outspoken skeptic of a Joint OpenStack Kubernetes Environment (my OpenStack BCN presoSuper User follow-up and BOS Proposal) because I felt that the technical hurdles of cloud native architecture would prove challenging.  Issues like stable service positioning and persistent data are requirements for OpenStack and hard problems in Kubernetes.

I was wrong: I underestimated how fast these issues could be addressed.

youtube-thumb-nail-openstackThe Kubernetes Helm work out of the AT&T Comm Dev lab takes on the integration with a “do it the K8s native way” approach that the RackN team finds very effective.  In fact, we’ve created a fully integrated Digital Rebar deployment that lays down Kubernetes using Kargo and then adds OpenStack via Helm.  The provisioning automation includes a Ceph cluster to provide stateful sets for data persistence.  

This joint approach dramatically reduces operational challenges associated with running OpenStack without taking over a general purpose Kubernetes infrastructure for a single task.

sre-seriesGiven the rise of SRE thinking, the RackN team believes that this approach changes the field for OpenStack deployments and will ultimately dominate the field (which is already  mainly containerized).  There is still work to be completed: some complex configuration is required to allow both Kubernetes CNI and Neutron to collaborate so that containers and VMs can cross-communicate.

We are looking for companies that want to join in this work and fast-track it into production.  If this is interesting, please contact us at sre@rackn.com.

Why should you sponsor? Current OpenStack operators facing “fork-lift upgrades” should want to find a path like this one that ensures future upgrades are baked into the plan.  This approach provide a fast track to a general purpose, enterprise grade, upgradable Kubernetes infrastructure.

Closing note from my past presentations: We’re making progress on the technical aspects of this integration; however, my concerns about market positioning remain.

Shouldn’t we have Standard Automation for Commodity Infrastructure?

sre-seriesOur focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

bookAn entire chapter of the Google SRE book was dedicated to the benefits of improving data center provisioning via automation; however, the description was abstract with a focus on the importance of validation testing and self-healing. That lack of detail is not surprising: Google’s infrastructure automation is highly specialized and considered a competitive advantage.

Shouldn’t everyone be able to do this?

After all, data centers are built from the same basic components with the same protocols.

Unfortunately, the stack of small (but critical) variations between these components makes it very difficult to build a universal solution. Reasonable variations like hardware configuration, vendor out-of-band management protocol, operating system, support systems and networking topologies add up quickly. Even Google, with their tremendous SRE talent and time investments, only built a solution for their specific needs.

To handle this variation, our SRE teams bake assumptions about their infrastructure directly into their automation. That’s expedient because there’s generally little operational reward for creating generic solutions for specific problems. I see this all the time in data centers that have server naming conventions and IP address schemes that are the automation glue between their tools and processes. While this may be a practical tactic for integration, it is fragile and site specific.

Hard coding your operational environment into automation has serious downsides.

First, it creates operational debt [reference] just like hard coding values in regular development. Please don’t mistake this as a call for yak shaving provisioning scripts into open ended models! There’s a happy medium where the scripts can be robust about infrastructure like ips, NIC ordering, system names and operating system behavior without compromising readability and development time.

Second, it eliminates reuse because code that works in one place must be forked (or copied) to be used again.  Forking creates a proliferation of truth and technical debt.  Unlike a shared script, the forked scripts do not benefit from mutual improvements.  This is true for both internal use and when external communities advance.  I have seen many cases where a company’s decision to fork away from open source code to “adjust it for their needs” cause them to forever lose the benefits accrued in the upstream community.

Consequently, Ops debt is quickly created when these infrastructure specific items are coded into the scripts because you have to touch a lot of code to make small changes. You also end up with hidden dependencies

However, until recently, we have not given SRE teams an alternative to site customization.

Of course, the alternative requires some additional investment up front.  Hard coding and forking are faster out of the gate; however, the SRE mandate is to aggressively reduce ongoing maintenance tasks wherever possible.  When core automation is site customized, Ops loses the benefits of reuse both internally and externally.

That’s why we believe SRE teams work to reuse automation whenever possible.

rebar-1Digital Rebar was built from our frustration watching the OpenStack community struggle with exactly this lesson.  We felt that having a platform for sharing code was essential; however, we also observed that differences between sites made it impossible to share code.  Our solution was to isolate those changes into composable units.  That isolation allowed us take a system integration view that did not break when inevitable changes were introduced.

If you are interested in breaking out of the script customization death spiral then review what the RackN team has done with Digital Rebar.

Even if you don’t use the code, the approach could save your SRE team a lot of heartburn down the road.  Of course, if you do want to use it then just contact us at sre@rackn.com.

Surgical Ansible & Script Injections before, during or after deployment

RackN CEO, Rob Hirschfeld, has been posting about our unique composable operations approach with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling.

This post shows some remarkable flexibility enabled by the approach that allow operators to take limited, secure operations against running systems.

via Surgical Ansible & Script Injections before, during or after deployment. — Rob Hirschfeld

 

Surgical Ansible & Script Injections before, during or after deployment.

I’ve been posting about the unique composable operations approach the RackN team has taken with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling.  The orchestration design (what we call annealing) allows us to dynamically add roles to the environment and execute them as single role/node interactions in operational chains.

ansiblemtaWith our latest patches (short demo videos below), you can now create single role Ansible or Bash scripts dynamically and then incorporate them into the node execution.

That makes it very easy to extend an existing deployment on-the-fly for quick changes or as part of a development process.

You can also run an ad hoc bash script against one or groups of machines.  If that script is something unique to your environment, you can manage it without having to push it back upsteam because Digital Rebar workloads are composable and designed to be safely integrated from multiple sources.

Beyond tweaking running systems, this is fastest script development workflow that I’ve ever seen.  I can make fast, surgical iterative changes to my scripts without having to rerun whole playbooks or runlists.  Even better, I can build multiple operating system environments side-by-side and test changes in parallel.

For secure environments, I don’t have to hand out user SSH access to systems because the actions run in Digital Rebar context.  Digital Rebar can limit control per user or tenant.

I’m very excited about how this capability can be used for dev, test and production systems.  Check it out and let me know what you think.

 

 

 

Digital Rebar Training Videos

We’re excited to announce an updated set of Digital Rebar training videos.  In response to requests to go beyond the simple Quick Start guide, we created a dedicated training channel and have been producing 15 minute tutorials on a wide range of topics.

Want us to cover a topic?  Just ask us on Gitter!

 

In some cases, these videos contain information that has not made it into documentation yet.  Our documentation is open source, we’d love to incorporate your notes to help make the experience easier for the next user.

Thanks!

Provisioned Secure By Default with Integrated PKI & TLS Automation

Today, I’m presenting this topic (PKI automation & rotation) at Defragcon  so I wanted to share this background more broadly as a companion for that presentation.  I know this is a long post – hang with me, PKI is complex.

Building automation that creates a secure infrastructure is as critical as it is hard to accomplish. For all the we talk about repeatable automation, actually doing it securely is a challenge. Why? Because we cannot simply encode passwords, security tokens or trust into our scripts. Even more critically, secure configuration is antithetical to general immutable automation: it requires that each unit is different and unique.

Over the summer, the RackN team expanded open source Digital Rebar to include roles that build a service-by-service internal public key infrastructure (PKI).

untitled-drawingThis is a significant advance in provisioning infrastructure because it allows bootstrapping transport layer security (TLS) encryption without having to assume trust at the edges.  This is not general PKI: the goal is for internal trust zones that have no external trust anchors.

Before I explain the details, it’s important to understand that RackN did not build a new encryption model!  We leveraged the ones that exist and automated them.  The challenge has been automating PKI local certificate authorities (CA) and tightly scoped certificates with standard configuration tools.  Digital Rebar solves this by merging service management, node configuration and orchestration.

I’ll try and break this down into the key elements of encryption, keys and trust.

The goal is simple: we want to be able to create secure communications (that’s TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog (that’s PKI). These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

To stand up a secure REST API service, we need to create a private key held by the server and a public key that is given to each client that wants secure communication with the server.

Now the parties can create secure communications (TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog. These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

Unfortunately, point-to-point key exchange is not enough to establish secure communications.  It too easy to impersonate a service or intercept traffic.  

Part of the solution is to include holder identity information into the key itself such as the name or IP address of the server.  The more specific the information, the harder it is to break the trust.  Unfortunately, many automation patterns simply use wildcard (or unspecific) identity because it is very difficult for them to predict the IP address or name of a server.   To address that problem, we only generate certificates once the system details are known.  Even better, it’s then possible to regenerate certificates (known as key rotation) after initial deployment.

While identity improves things, it’s still not sufficient.  We need to have a trusted third party who can validate that the keys are legitimate to make the system truly robust.  In this case, the certificate authority (CA) that issues the keys signs them so that both parties are able to trust each other.  There’s no practical way to intercept communications between the trusted end points without signed keys from the central CA.  The system requires that we can build and maintain these three way relationships.  For public websites, we can rely on root certificates; however, that’s not practical or desirable for dynamic internal encryption needs.

So what did we do with Digital Rebar?  We’ve embedded a certificate authority (CA) service into the core orchestration engine (called “trust me”).  

The Digital Rebar CA can be told to generate a root certificate on a per service basis.  When we add a server for that service, the CA issues a unique signed certificate matching the server identity.  When we add a client for that service, the CA issues a unique signed public key for the client matching the client’s identity.  The server will reject communication from unknown public keys.  In this way, each service is able to ensure that it is only communicating with trusted end points.

Wow, that’s a lot of information!  Getting security right is complex and often neglected.  Our focus is provisioning automation, so these efforts do not cover all PKI lifecycle issues or challenges.  We’ve got a long list of integrations, tools and next steps that we’d like to accomplish.

Our goal was to automate building secure communication as a default.  We think these enhancements to Digital Rebar are a step in that direction.  Please let us know if you think this approach is helpful.

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

aid1165255-728px-install-pergo-flooring-step-5-version-2If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).  While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues.  Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration.  Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates.  Once the platform is installed, then we use the platform as a services.  On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance.  Using a load balancer?  You can’t configure it until you’ve got the node addresses allocated.  And then it needs to be updated as you manage your cluster.  This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration.  The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence.  For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries.  The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic.  No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Kubernetes the NOT-so-hard way (7 RackN additions: keeping transparency, adding security)

At RackN, we take the KISS principle to heart, here are the seven ways that we worked to make Kubernetes easier to install and manage.

Container community crooner, Kelsey Hightower, created a definitive installation guide that he dubbed “Kubernetes the Hard Way” or KTHW.  In that document, he laid out a manual sequence of steps needed to bring up a working Kubernetes Cluster.  For some, the lengthy sequence served as a rally cry to simplify and streamline the “boot to kube” process with additional configuration harnesses, more bells and and some new whistles.

For the RackN team, Kelsey’s process looked like a reliable and elegant basis for automation.  So, we automated that and eliminated the hard parts (see video)

 

Seven improvements for KTHW

Our operational approach to distributed systems (encoded in Digital Rebar) drives towards keeping things simple and transparent in operation.  When creating automation, it’s way too easy to add complexity that works on a desktop for a developer, but fails as we scale or move into sustaining operations.

The benefit of Kelsey’s approach was that it had to be simple enough to reproduce and troubleshoot manually; however, there were several KTHW challenges that we wanted to streamline while we automated.

  1. Respect the manual steps: Just automating is not enough. We wanted to be true to the steps so that users of the automation could look back that the process and understand it. The beauty of KTHW is that operators can read it and understand the inner workings of Kubernetes.
  2. Node Inventory: Manual node allocation is time consuming and error prone. We believe that the process should be able to (but not require a) start from zero with just raw hardware or cloud credentials. Anything else opens up a lot of potential configuration errors.
  3. Automatic Iteration: Going back to make adjustments to previous nodes is normal in cluster building and really annoying for users. This is especially true when clusters are expanded or contracted.
  4. PKI Security: We love that Kubernetes requires TLS communication; however, we’re generally horrified about sharing around private keys and wild card certificates even for development and test clusters.
  5. Go & SystemD: We use containers for a everything in Digital Rebar and our design has a lot of RESTful services behind a reverse proxy; however, it’s simply not needed for Kubernetes. Kubernetes binary are portable Golang programs and just the API service is a web service. We feel strongly that the simplest and most robust deployment just runs these programs under SystemD. It is just as easy to curl a single file and restart a service as the doing a docker pull. In fact, it’s measurably simpler, more secure and reliable.
  6. Pluggability: It’s hard to allow variation in a manual process. With Kubernetes open ecosystem, we see a need to operators to make practical configuration choices without straying dramatically from Kelsey’s basic process. Changes to the container run time or network model should not result in radically different install steps because the fundamentals of Kubernetes are not changed by these choices.
  7. Parallel Deploys & CI/CD Deployments: When we work on cluster deploys, we spin up lots and lots of independent installs to test variations and changes like AWS and Google and OpenStack or Ubuntu and Centos.  Consequently, it is important that we can run multiple installs in parallel.  Once that works, we want to have CI driven setup, test and tear down processes.

We’re excited about the clean, fast and portable installation the came out of our efforts to automation the KTHW process. We hope that you’ll take a look at our approach and help us continue to improve and streamline Kubernetes (and other!) platform installs.