What does it take to Operate Open Platforms? Answers in Datanaughts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanaughts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

Surgical Ansible & Script Injections before, during or after deployment.

I’ve been posting about the unique composable operations approach the RackN team has taken with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling.  The orchestration design (what we call annealing) allows us to dynamically add roles to the environment and execute them as single role/node interactions in operational chains.

ansiblemtaWith our latest patches (short demo videos below), you can now create single role Ansible or Bash scripts dynamically and then incorporate them into the node execution.

That makes it very easy to extend an existing deployment on-the-fly for quick changes or as part of a development process.

You can also run an ad hoc bash script against one or groups of machines.  If that script is something unique to your environment, you can manage it without having to push it back upsteam because Digital Rebar workloads are composable and designed to be safely integrated from multiple sources.

Beyond tweaking running systems, this is fastest script development workflow that I’ve ever seen.  I can make fast, surgical iterative changes to my scripts without having to rerun whole playbooks or runlists.  Even better, I can build multiple operating system environments side-by-side and test changes in parallel.

For secure environments, I don’t have to hand out user SSH access to systems because the actions run in Digital Rebar context.  Digital Rebar can limit control per user or tenant.

I’m very excited about how this capability can be used for dev, test and production systems.  Check it out and let me know what you think.




Provisioned Secure By Default with Integrated PKI & TLS Automation

Today, I’m presenting this topic (PKI automation & rotation) at Defragcon  so I wanted to share this background more broadly as a companion for that presentation.  I know this is a long post – hang with me, PKI is complex.

Building automation that creates a secure infrastructure is as critical as it is hard to accomplish. For all the we talk about repeatable automation, actually doing it securely is a challenge. Why? Because we cannot simply encode passwords, security tokens or trust into our scripts. Even more critically, secure configuration is antithetical to general immutable automation: it requires that each unit is different and unique.

Over the summer, the RackN team expanded open source Digital Rebar to include roles that build a service-by-service internal public key infrastructure (PKI).

untitled-drawingThis is a significant advance in provisioning infrastructure because it allows bootstrapping transport layer security (TLS) encryption without having to assume trust at the edges.  This is not general PKI: the goal is for internal trust zones that have no external trust anchors.

Before I explain the details, it’s important to understand that RackN did not build a new encryption model!  We leveraged the ones that exist and automated them.  The challenge has been automating PKI local certificate authorities (CA) and tightly scoped certificates with standard configuration tools.  Digital Rebar solves this by merging service management, node configuration and orchestration.

I’ll try and break this down into the key elements of encryption, keys and trust.

The goal is simple: we want to be able to create secure communications (that’s TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog (that’s PKI). These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

To stand up a secure REST API service, we need to create a private key held by the server and a public key that is given to each client that wants secure communication with the server.

Now the parties can create secure communications (TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog. These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

Unfortunately, point-to-point key exchange is not enough to establish secure communications.  It too easy to impersonate a service or intercept traffic.  

Part of the solution is to include holder identity information into the key itself such as the name or IP address of the server.  The more specific the information, the harder it is to break the trust.  Unfortunately, many automation patterns simply use wildcard (or unspecific) identity because it is very difficult for them to predict the IP address or name of a server.   To address that problem, we only generate certificates once the system details are known.  Even better, it’s then possible to regenerate certificates (known as key rotation) after initial deployment.

While identity improves things, it’s still not sufficient.  We need to have a trusted third party who can validate that the keys are legitimate to make the system truly robust.  In this case, the certificate authority (CA) that issues the keys signs them so that both parties are able to trust each other.  There’s no practical way to intercept communications between the trusted end points without signed keys from the central CA.  The system requires that we can build and maintain these three way relationships.  For public websites, we can rely on root certificates; however, that’s not practical or desirable for dynamic internal encryption needs.

So what did we do with Digital Rebar?  We’ve embedded a certificate authority (CA) service into the core orchestration engine (called “trust me”).  

The Digital Rebar CA can be told to generate a root certificate on a per service basis.  When we add a server for that service, the CA issues a unique signed certificate matching the server identity.  When we add a client for that service, the CA issues a unique signed public key for the client matching the client’s identity.  The server will reject communication from unknown public keys.  In this way, each service is able to ensure that it is only communicating with trusted end points.

Wow, that’s a lot of information!  Getting security right is complex and often neglected.  Our focus is provisioning automation, so these efforts do not cover all PKI lifecycle issues or challenges.  We’ve got a long list of integrations, tools and next steps that we’d like to accomplish.

Our goal was to automate building secure communication as a default.  We think these enhancements to Digital Rebar are a step in that direction.  Please let us know if you think this approach is helpful.

Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

aid1165255-728px-install-pergo-flooring-step-5-version-2If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).  While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues.  Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration.  Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates.  Once the platform is installed, then we use the platform as a services.  On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance.  Using a load balancer?  You can’t configure it until you’ve got the node addresses allocated.  And then it needs to be updated as you manage your cluster.  This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration.  The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence.  For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries.  The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic.  No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Breaking Up is Hard To Do – Why I Believe Ops Decomposition (pt 1)

Over the summer, the RackN team took a radical step with our previous Ansible Kubernetes workload install: we broke it into pieces.  Why?  We wanted to eliminate all “magic happens here” steps in the deployment.

320px-dominos_fallingThe result, DR Kompos8, is a faster, leaner, transparent and parallelized installation that allows for pluggable extensions and upgrades (video tour). We also chose the operationally simplest configuration choice: Golang binaries managed by SystemDGolang binaries managed by SystemD.

Why decompose and simplify? Let’s talk about our hard earned ops automation battle scars that let to composability as a core value:

Back in the early OpenStack days, when the project was actually much simpler, we were part of a community writing Chef Cookbooks to install it. These scripts are just a sequence of programmable steps (roles in Ops-speak) that drive the configuration of services on each node in the cluster. There is an ability to find cross-cluster information and lookup local inventory so we were able to inject specific details before the process began. However, once the process started, it was pretty much like starting a dominoes chain. If anything went wrong anywhere in the installation, we had to reset all the dominoes and start over.

Like a dominoes train, it is really fun to watch when it works. Also, like dominoes, it is frustrating to set up and fix. Often we literally were holding our breath during installation hoping that we’d anticipated every variation in the software, hardware and environment. It is no surprise that the first and must critical feature we’d created was a redeploy command.

It turned out the the ability to successfully redeploy was the critical measure for success. We would not consider a deployment complete until we could wipe the systems and rebuild it automatically at least twice.

What made cluster construction so hard? There were a three key things: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).

We’ll explore these three reasons in detail for part 2 of this post tomorrow.

Even without the details, it easy to understand that we want to avoid all magic in a deployment.

For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Likewise, we need to eliminate “it worked from my desktop” automation too.  Those systems are impossible to maintain, share and scale. Composed cluster operations addresses this problem by making work modular, predictable and transparent.

Kubernetes the NOT-so-hard way (7 RackN additions: keeping transparency, adding security)

At RackN, we take the KISS principle to heart, here are the seven ways that we worked to make Kubernetes easier to install and manage.

Container community crooner, Kelsey Hightower, created a definitive installation guide that he dubbed “Kubernetes the Hard Way” or KTHW.  In that document, he laid out a manual sequence of steps needed to bring up a working Kubernetes Cluster.  For some, the lengthy sequence served as a rally cry to simplify and streamline the “boot to kube” process with additional configuration harnesses, more bells and and some new whistles.

For the RackN team, Kelsey’s process looked like a reliable and elegant basis for automation.  So, we automated that and eliminated the hard parts (see video)


Seven improvements for KTHW

Our operational approach to distributed systems (encoded in Digital Rebar) drives towards keeping things simple and transparent in operation.  When creating automation, it’s way too easy to add complexity that works on a desktop for a developer, but fails as we scale or move into sustaining operations.

The benefit of Kelsey’s approach was that it had to be simple enough to reproduce and troubleshoot manually; however, there were several KTHW challenges that we wanted to streamline while we automated.

  1. Respect the manual steps: Just automating is not enough. We wanted to be true to the steps so that users of the automation could look back that the process and understand it. The beauty of KTHW is that operators can read it and understand the inner workings of Kubernetes.
  2. Node Inventory: Manual node allocation is time consuming and error prone. We believe that the process should be able to (but not require a) start from zero with just raw hardware or cloud credentials. Anything else opens up a lot of potential configuration errors.
  3. Automatic Iteration: Going back to make adjustments to previous nodes is normal in cluster building and really annoying for users. This is especially true when clusters are expanded or contracted.
  4. PKI Security: We love that Kubernetes requires TLS communication; however, we’re generally horrified about sharing around private keys and wild card certificates even for development and test clusters.
  5. Go & SystemD: We use containers for a everything in Digital Rebar and our design has a lot of RESTful services behind a reverse proxy; however, it’s simply not needed for Kubernetes. Kubernetes binary are portable Golang programs and just the API service is a web service. We feel strongly that the simplest and most robust deployment just runs these programs under SystemD. It is just as easy to curl a single file and restart a service as the doing a docker pull. In fact, it’s measurably simpler, more secure and reliable.
  6. Pluggability: It’s hard to allow variation in a manual process. With Kubernetes open ecosystem, we see a need to operators to make practical configuration choices without straying dramatically from Kelsey’s basic process. Changes to the container run time or network model should not result in radically different install steps because the fundamentals of Kubernetes are not changed by these choices.
  7. Parallel Deploys & CI/CD Deployments: When we work on cluster deploys, we spin up lots and lots of independent installs to test variations and changes like AWS and Google and OpenStack or Ubuntu and Centos.  Consequently, it is important that we can run multiple installs in parallel.  Once that works, we want to have CI driven setup, test and tear down processes.

We’re excited about the clean, fast and portable installation the came out of our efforts to automation the KTHW process. We hope that you’ll take a look at our approach and help us continue to improve and streamline Kubernetes (and other!) platform installs.

Open Source as Reality TV and Burning Data Centers [gcOnDemand podcast notes]

During the OpenStack summit, Eric Wright (@discoposse) and I talked about a wide range of topics from scoring success of OpenStack early goals to burning down traditional data centers.

Why burn down your data center (and move to public cloud)? Because your ops process are too hard to change. Rob talks about how hybrid provides a path if we can made ops more composable.

Here are my notes from the audio podcast (source):

1:30 Why “zehicle” as a handle? Portmanteau from electrics cars… zero + vehicle

Let’s talk about OpenStack & Cloud…

  • OpenStack History
    • 2:15 Rob’s OpenStack history from Dell and Hyperscale
    • 3:20 Early thoughts of a Cloud API that could be reused
    • 3:40 The practical danger of Vendor lock-in
    • 4:30 How we implemented “no main corporate owner” by choice
  • About the Open in OpenStack
    • 5:20 Rob decomposes what “open” means because there are multiple meanings
    • 6:10 Price of having all open tools for “always open” choice and process
    • 7:10 Observation that OpenStack values having open over delivering product
    • 8:15 Community is great but a trade off. We prioritize it over implementation.
  • Q: 9:10 What if we started later? Would Docker make an impact?
    • Part of challenge for OpenStack was teaching vendors & corporate consumers “how to open source”
  • Q: 10:40 Did we accomplish what we wanted from the first summit?
    • Mixed results – some things we exceeded (like growing community) while some are behind (product adoption & interoperability).
  • 13:30 Interop, Refstack and Defcore Challenges. Rob is disappointed on interop based on implementations.
  • Q: 15:00 Who completes with OpenStack?
    • There are real alternatives. APIs do not matter as much as we thought.
    • 15:50 OpenStack vendor support is powerful
  • Q: 16:20 What makes OpenStack successful?
    • Big tent confuses the ecosystem & push the goal posts out
    • “Big community” is not a good definition of success for the project.
  • 18:10 Reality TV of open source – people like watching train wrecks
  • 18:45 Hybrid is the reality for IT users
  • 20:10 We have a need to define core and focus on composability. Rob has been focused on the link between hybrid and composability.
  • 22:10 Rob’s preference is that OpenStack would be smaller. Big tent is really ecosystem projects and we want that ecosystem to be multi-cloud.

Now, about RackN, bare metal, Crowbar and Digital Rebar….

  • 23:30 (re)Intro
  • 24:30 VC market is not metal friendly even though everything runs on metal!
  • 25:00 Lack of consistency translates into lack of shared ops
  • 25:30 Crowbar was an MVP – the key is to understand what we learned from it
  • 26:00 Digital Rebar started with composability and focus on operations
  • 27:00 What is hybrid now? Not just private to public.
  • 30:00 How do we make infrastructure not matter? Multi-dimensional hybrid.
  • 31:00 Digital Rebar is orchestration for composable infrastructure.
  • Q: 31:40 Do people get it?
    • Yes. Automation is moving to hybrid devops – “ops is ops” and it should not matter if it’s cloud or metal.
  • 32:15 “I don’t want to burn down my data center” – can you bring cloud ops to my private data center?