Breaking Up is Hard To Do – Why I Believe Ops Decomposition (pt 1)

Over the summer, the RackN team took a radical step with our previous Ansible Kubernetes workload install: we broke it into pieces.  Why?  We wanted to eliminate all “magic happens here” steps in the deployment.

320px-dominos_fallingThe result, DR Kompos8, is a faster, leaner, transparent and parallelized installation that allows for pluggable extensions and upgrades (video tour). We also chose the operationally simplest configuration choice: Golang binaries managed by SystemDGolang binaries managed by SystemD.

Why decompose and simplify? Let’s talk about our hard earned ops automation battle scars that let to composability as a core value:

Back in the early OpenStack days, when the project was actually much simpler, we were part of a community writing Chef Cookbooks to install it. These scripts are just a sequence of programmable steps (roles in Ops-speak) that drive the configuration of services on each node in the cluster. There is an ability to find cross-cluster information and lookup local inventory so we were able to inject specific details before the process began. However, once the process started, it was pretty much like starting a dominoes chain. If anything went wrong anywhere in the installation, we had to reset all the dominoes and start over.

Like a dominoes train, it is really fun to watch when it works. Also, like dominoes, it is frustrating to set up and fix. Often we literally were holding our breath during installation hoping that we’d anticipated every variation in the software, hardware and environment. It is no surprise that the first and must critical feature we’d created was a redeploy command.

It turned out the the ability to successfully redeploy was the critical measure for success. We would not consider a deployment complete until we could wipe the systems and rebuild it automatically at least twice.

What made cluster construction so hard? There were a three key things: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).

We’ll explore these three reasons in detail for part 2 of this post tomorrow.

Even without the details, it easy to understand that we want to avoid all magic in a deployment.

For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Likewise, we need to eliminate “it worked from my desktop” automation too.  Those systems are impossible to maintain, share and scale. Composed cluster operations addresses this problem by making work modular, predictable and transparent.

Kubernetes the NOT-so-hard way (7 RackN additions: keeping transparency, adding security)

At RackN, we take the KISS principle to heart, here are the seven ways that we worked to make Kubernetes easier to install and manage.

Container community crooner, Kelsey Hightower, created a definitive installation guide that he dubbed “Kubernetes the Hard Way” or KTHW.  In that document, he laid out a manual sequence of steps needed to bring up a working Kubernetes Cluster.  For some, the lengthy sequence served as a rally cry to simplify and streamline the “boot to kube” process with additional configuration harnesses, more bells and and some new whistles.

For the RackN team, Kelsey’s process looked like a reliable and elegant basis for automation.  So, we automated that and eliminated the hard parts (see video)


Seven improvements for KTHW

Our operational approach to distributed systems (encoded in Digital Rebar) drives towards keeping things simple and transparent in operation.  When creating automation, it’s way too easy to add complexity that works on a desktop for a developer, but fails as we scale or move into sustaining operations.

The benefit of Kelsey’s approach was that it had to be simple enough to reproduce and troubleshoot manually; however, there were several KTHW challenges that we wanted to streamline while we automated.

  1. Respect the manual steps: Just automating is not enough. We wanted to be true to the steps so that users of the automation could look back that the process and understand it. The beauty of KTHW is that operators can read it and understand the inner workings of Kubernetes.
  2. Node Inventory: Manual node allocation is time consuming and error prone. We believe that the process should be able to (but not require a) start from zero with just raw hardware or cloud credentials. Anything else opens up a lot of potential configuration errors.
  3. Automatic Iteration: Going back to make adjustments to previous nodes is normal in cluster building and really annoying for users. This is especially true when clusters are expanded or contracted.
  4. PKI Security: We love that Kubernetes requires TLS communication; however, we’re generally horrified about sharing around private keys and wild card certificates even for development and test clusters.
  5. Go & SystemD: We use containers for a everything in Digital Rebar and our design has a lot of RESTful services behind a reverse proxy; however, it’s simply not needed for Kubernetes. Kubernetes binary are portable Golang programs and just the API service is a web service. We feel strongly that the simplest and most robust deployment just runs these programs under SystemD. It is just as easy to curl a single file and restart a service as the doing a docker pull. In fact, it’s measurably simpler, more secure and reliable.
  6. Pluggability: It’s hard to allow variation in a manual process. With Kubernetes open ecosystem, we see a need to operators to make practical configuration choices without straying dramatically from Kelsey’s basic process. Changes to the container run time or network model should not result in radically different install steps because the fundamentals of Kubernetes are not changed by these choices.
  7. Parallel Deploys & CI/CD Deployments: When we work on cluster deploys, we spin up lots and lots of independent installs to test variations and changes like AWS and Google and OpenStack or Ubuntu and Centos.  Consequently, it is important that we can run multiple installs in parallel.  Once that works, we want to have CI driven setup, test and tear down processes.

We’re excited about the clean, fast and portable installation the came out of our efforts to automation the KTHW process. We hope that you’ll take a look at our approach and help us continue to improve and streamline Kubernetes (and other!) platform installs.