Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar)

Posted on February 20, 2017 by Rob H

RackN revisits OpenStack deployments with an eye on ongoing operations.

I’ve been an outspoken skeptic of a Joint OpenStack Kubernetes Environment (my OpenStack BCN preso, Super User follow-up and BOS Proposal) because I felt that the technical hurdles of cloud native architecture would prove challenging. Issues like stable service positioning and persistent data are requirements for OpenStack and hard problems in Kubernetes.

I was wrong: I underestimated how fast these issues could be addressed.

The Kubernetes Helm work out of the AT&T Comm Dev lab takes on the integration with a “do it the K8s native way” approach that the RackN team finds very effective. In fact, we’ve created a fully integrated Digital Rebar deployment that lays down Kubernetes using Kargo and then adds OpenStack via Helm. The provisioning automation includes a Ceph cluster to provide stateful sets for data persistence.

This joint approach dramatically reduces operational challenges associated with running OpenStack without taking over a general purpose Kubernetes infrastructure for a single task.

Given the rise of SRE thinking, the RackN team believes that this approach changes the field for OpenStack deployments and will ultimately dominate the field (which is already mainly containerized). There is still work to be completed: some complex configuration is required to allow both Kubernetes CNI and Neutron to collaborate so that containers and VMs can cross-communicate.

We are looking for companies that want to join in this work and fast-track it into production. If this is interesting, please contact us at sre@rackn.com.

Why should you sponsor? Current OpenStack operators facing “fork-lift upgrades” should want to find a path like this one that ensures future upgrades are baked into the plan. This approach provide a fast track to a general purpose, enterprise grade, upgradable Kubernetes infrastructure.

Closing note from my past presentations: We’re making progress on the technical aspects of this integration; however, my concerns about market positioning remain.

“Why SRE?” Discussion with Eric @Discoposse Wright

Posted on February 17, 2017 by Rob H

sre-series My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

I was a guest on Eric “@discoposse” Wright of the Green Circle Community #42 Podcast (my previous appearance).

LISTEN NOW: Podcast #42 (transcript)

In this action-packed 30 minute conversation, we discuss the industry forces putting pressure on operations teams. These pressures require operators to be investing much more heavily on reusable automation.

That leads us towards why Kubernetes is interesting and what went wrong with OpenStack (I actually use the phrase “dumpster fire”). We ultimately talk about how those lessons embedded in Digital Rebar architecture.

Spiraling Ops Debt & the SRE coding imperative

Posted on January 17, 2017 by Rob H

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind? It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors. Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components. For example, Kubernetes has a 3 month release cycle. That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles. That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization. For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.

If the ratio is >50 then the team slowly sinks under growing operational load. If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help. Stop writing automation that you can buy or re-use. Get help running it. Don’t waste time solving problems that other people have solved. That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

Can we control Hype & Over-Vendoring?

Posted on December 12, 2016 by Rob H

Q: Is over-vendoring when you’ve had to much to drink?
A: Yes, too much Kool Aid.

There’s a lot of information here – skip to the bottom if you want to see my recommendation.

Last week on TheNewStack, I offered eight ways to keep Kubernetes on the right track (abridged list here) and felt that item #6 needed more explanation and some concrete solutions.

DO: Focus on a Tight Core
DO: Build a Diverse Community
DO: Multi-cloud and Hybrid
DO: Be Humble and Honest
AVOID: “The One Ring” Universal Solution Hubris
AVOID: Over-Vendoring (discussed here)
AVOID: Coupling Installers, Brokers and Providers to the core
AVOID: Fast Release Cycles without LTS Releases

What is Over-Vendoring? It’s when vendors’ drive their companies’ brands ahead of the health of the project. Generally by driving an aggressive hype cycle where vendors are trying to jump on the hype bandwagon.

Hype can be very dangerous for projects (David Cassel’s TNS article) because it is easy to bypass the user needs and boring scale/stabilization processes to focus on vendor differentiation. Unfortunately, common use-cases do not drive differentiation and are invisible when it comes to company marketing budgets. That boring common core has the effect creating tragedy of the commons which undermines collaboration on shared code bases.

The solution is to aggressively keep the project core small so that vendors have specific and limited areas of coopetition.

A small core means we do not compel collaboration in many areas of project. This drives competition and diversity that can be confusing. The temptation to endorse or nominate companion projects is risky due to the hype cycle. Endorsements can create a bias that actually hurts innovation because early or loud vendors do not generally create the best long term approaches. I’ve heard this described as “people doing the real work don’t necessarily have time to brag about it.”

Keeping a small core mantra drives a healthy plug-in model where vendors can differentiate. It also ensures that projects can succeed with a bounded set of core contributors and support infrastructure. That means that we should not measure success by commits, committers or lines of code because these will drop as projects successfully modularize. My recommendation for a key success metric is to the ratio of committers to ecosystem members and users.

Tracking improving ratio of core to ecosystem shows that improving efficiency of investment. That’s a better sign of health than project growth.

It’s important to note that there is also a serious risk of under-vendoring too!

We must recognize and support vendors in open source communities because they sustain the project via direct contributions and bringing users. For a healthy ecosystem, we need to ensure that vendors can fairly profit. That means they must be able to use their brand in combination with the project’s brand. Apache Project is the anti-pattern because they have very strict “no vendor” trademark marketing guidelines that can strand projects without good corporate support.

I’ve come to believe that it’s important to allow vendors to market open source projects brands; however, they also need to have some limits on how they position the project.

How should this co-branding work? My thinking is that vendor claims about a project should be managed in a consistent and common way. Since we’re keeping the project core small, that should help limit the scope of the claims. Vendors that want to make ecosystem claims should be given clear spaces for marketing their own brand in participation with the project brand.

I don’t pretend that this is easy! Vendor marketing is planned quarters ahead of when open source projects are ready for them: that’s part of what feeds the hype cycle. That means that projects will be saying no to some free marketing from their ecosystem. Ideally, we’re saying yes to the right parts at the same time.

Ultimately, hype control means saying no to free marketing. For an open source project, that’s a hard but essential decision.

Cloudcast.net gem about Cluster Ops Gap

Posted on November 29, 2016 by Rob H

Podcast juxtaposition can be magical. In this case, I heard back-to-back sessions with pragmatic for cluster operations and then how developers are rebelling against infrastructure.

Last week, I was listening to Brian Gracely’s “Automatic DevOps” discussion with John Troyer (CEO at TechReckoning, a community for IT pros) followed by his confusingly titled “operators” talk with Brandon Phillips (CTO at CoreOS).

John’s mid-recording comments really resonated with me:

At 16 minutes: “IT is going to be the master of many environments… If you have an environment is hybrid & multi-cloud, then you still need to care about infrastructure… we are going to be living with that for at least 10 years.”

At 18 minutes: “We need a layer that is cloud-like, devops-like and agile-like that can still be deployed in multiple places. This middle layer, Cluster Ops, is really important because it’s the layer between the infrastructure and the app.”

The conversation with Brandon felt very different where the goal was to package everything “operator” into Kubernetes semantics including Kubernetes running itself. This inception approach to running the cluster is irresistible within the community because the goal of the community is to stop having to worry about infrastructure. [Brian – call me if you want to a do podcast of the counter point to self-hosted].

Infrastructure is hard and complex. There’s good reason to limit how many people have to deal with that, but someone still has to deal with it.

I’m a big fan of container workloads generally and Kubernetes specifically as a way to help isolate application developers from infrastructure; consequently, it’s not designed to handle the messy infrastructure requirements that make Cluster Ops a challenge. This is a good thing because complexity explodes when platforms expose infrastructure details.

For Kubernetes and similar, I believe that injecting too much infrastructure mess undermines the simplicity of the platform.

There’s a different type of platform needed for infrastructure aware cluster operations where automation needs to address complexity via composability. That’s what RackN is building with open Digital Rebar: a the hybrid management layer that can consistently automate around infrastructure variation.

If you want to work with us to create system focused, infrastructure agnostic automation then take a look at the work we’ve been doing on underlay and cluster operations.

DevOps vs Cloud Native: Damn, where did all this platform complexity come from?

Posted on November 17, 2016 by Rob H

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms. Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.
This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility. While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams? We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places. That means that operations still owns the complex underlay and platform jobs. Overall, I think that’s a workable diversion.

Provisioned Secure By Default with Integrated PKI & TLS Automation

Posted on November 16, 2016 by Rob H

Today, I’m presenting this topic (PKI automation & rotation) at Defragcon so I wanted to share this background more broadly as a companion for that presentation. I know this is a long post – hang with me, PKI is complex.

Building automation that creates a secure infrastructure is as critical as it is hard to accomplish. For all the we talk about repeatable automation, actually doing it securely is a challenge. Why? Because we cannot simply encode passwords, security tokens or trust into our scripts. Even more critically, secure configuration is antithetical to general immutable automation: it requires that each unit is different and unique.

Over the summer, the RackN team expanded open source Digital Rebar to include roles that build a service-by-service internal public key infrastructure (PKI).

untitled-drawing This is a significant advance in provisioning infrastructure because it allows bootstrapping transport layer security (TLS) encryption without having to assume trust at the edges. This is not general PKI: the goal is for internal trust zones that have no external trust anchors.

Before I explain the details, it’s important to understand that RackN did not build a new encryption model! We leveraged the ones that exist and automated them. The challenge has been automating PKI local certificate authorities (CA) and tightly scoped certificates with standard configuration tools. Digital Rebar solves this by merging service management, node configuration and orchestration.

I’ll try and break this down into the key elements of encryption, keys and trust.

The goal is simple: we want to be able to create secure communications (that’s TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog (that’s PKI). These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

To stand up a secure REST API service, we need to create a private key held by the server and a public key that is given to each client that wants secure communication with the server.

Now the parties can create secure communications (TLS) between networked services. To do that, they need to be able to agree on encryption keys for dialog. These keys are managed in public and private pairs: one side uses the public key to encrypt a message that can only be decoded with the receiver’s private key.

Unfortunately, point-to-point key exchange is not enough to establish secure communications. It too easy to impersonate a service or intercept traffic.

Part of the solution is to include holder identity information into the key itself such as the name or IP address of the server. The more specific the information, the harder it is to break the trust. Unfortunately, many automation patterns simply use wildcard (or unspecific) identity because it is very difficult for them to predict the IP address or name of a server. To address that problem, we only generate certificates once the system details are known. Even better, it’s then possible to regenerate certificates (known as key rotation) after initial deployment.

While identity improves things, it’s still not sufficient. We need to have a trusted third party who can validate that the keys are legitimate to make the system truly robust. In this case, the certificate authority (CA) that issues the keys signs them so that both parties are able to trust each other. There’s no practical way to intercept communications between the trusted end points without signed keys from the central CA. The system requires that we can build and maintain these three way relationships. For public websites, we can rely on root certificates; however, that’s not practical or desirable for dynamic internal encryption needs.

So what did we do with Digital Rebar? We’ve embedded a certificate authority (CA) service into the core orchestration engine (called “trust me”).

The Digital Rebar CA can be told to generate a root certificate on a per service basis. When we add a server for that service, the CA issues a unique signed certificate matching the server identity. When we add a client for that service, the CA issues a unique signed public key for the client matching the client’s identity. The server will reject communication from unknown public keys. In this way, each service is able to ensure that it is only communicating with trusted end points.

Wow, that’s a lot of information! Getting security right is complex and often neglected. Our focus is provisioning automation, so these efforts do not cover all PKI lifecycle issues or challenges. We’ve got a long list of integrations, tools and next steps that we’d like to accomplish.

Our goal was to automate building secure communication as a default. We think these enhancements to Digital Rebar are a step in that direction. Please let us know if you think this approach is helpful.

Kubernetes the NOT-so-hard way (7 RackN additions: keeping transparency, adding security)

Posted on October 3, 2016 by Rob H

At RackN, we take the KISS principle to heart, here are the seven ways that we worked to make Kubernetes easier to install and manage.

Container community crooner, Kelsey Hightower, created a definitive installation guide that he dubbed “Kubernetes the Hard Way” or KTHW. In that document, he laid out a manual sequence of steps needed to bring up a working Kubernetes Cluster. For some, the lengthy sequence served as a rally cry to simplify and streamline the “boot to kube” process with additional configuration harnesses, more bells and and some new whistles.

For the RackN team, Kelsey’s process looked like a reliable and elegant basis for automation. So, we automated that and eliminated the hard parts (see video)

Seven improvements for KTHW

Our operational approach to distributed systems (encoded in Digital Rebar) drives towards keeping things simple and transparent in operation. When creating automation, it’s way too easy to add complexity that works on a desktop for a developer, but fails as we scale or move into sustaining operations.

The benefit of Kelsey’s approach was that it had to be simple enough to reproduce and troubleshoot manually; however, there were several KTHW challenges that we wanted to streamline while we automated.

Respect the manual steps: Just automating is not enough. We wanted to be true to the steps so that users of the automation could look back that the process and understand it. The beauty of KTHW is that operators can read it and understand the inner workings of Kubernetes.
Node Inventory: Manual node allocation is time consuming and error prone. We believe that the process should be able to (but not require a) start from zero with just raw hardware or cloud credentials. Anything else opens up a lot of potential configuration errors.
Automatic Iteration: Going back to make adjustments to previous nodes is normal in cluster building and really annoying for users. This is especially true when clusters are expanded or contracted.
PKI Security: We love that Kubernetes requires TLS communication; however, we’re generally horrified about sharing around private keys and wild card certificates even for development and test clusters.
Go & SystemD: We use containers for a everything in Digital Rebar and our design has a lot of RESTful services behind a reverse proxy; however, it’s simply not needed for Kubernetes. Kubernetes binary are portable Golang programs and just the API service is a web service. We feel strongly that the simplest and most robust deployment just runs these programs under SystemD. It is just as easy to curl a single file and restart a service as the doing a docker pull. In fact, it’s measurably simpler, more secure and reliable.
Pluggability: It’s hard to allow variation in a manual process. With Kubernetes open ecosystem, we see a need to operators to make practical configuration choices without straying dramatically from Kelsey’s basic process. Changes to the container run time or network model should not result in radically different install steps because the fundamentals of Kubernetes are not changed by these choices.
Parallel Deploys & CI/CD Deployments: When we work on cluster deploys, we spin up lots and lots of independent installs to test variations and changes like AWS and Google and OpenStack or Ubuntu and Centos. Consequently, it is important that we can run multiple installs in parallel. Once that works, we want to have CI driven setup, test and tear down processes.

We’re excited about the clean, fast and portable installation the came out of our efforts to automation the KTHW process. We hope that you’ll take a look at our approach and help us continue to improve and streamline Kubernetes (and other!) platform installs.

Container Migration 101: Cloudcast.net & Lachlan Evenson

Posted on September 11, 2016 by Rob H

Last week, the CloudCast.net interviewed Lachlan Evenson (now at Deis!). I highly recommend listening to the interview because he has a unique and deep experience with OpenStack, Kubernetes and container migration.

15967 I had the good fortune of lunching with Lachie just before the interview aired. We got compare notes about changes going on in the container space. Some of those insights will end up in my OpenStack Barcelona talk “Will it Blend? The Joint OpenStack Kubernetes Environment.”

There’s no practical way to rehash our whole lunch discussion as a post; however, I can point you to some key points [with time stamps] in his interview that I found highly insightful:

[7:20] In their pre-containers cloud pass, they’d actually made it clunky for the developers and it hurt their devops attempts.
[17.30] Developers advocating for their own use and value is a key to acceptance. A good story follows…
[29:50] We’d work with the app dev teams and if it didn’t fit then we did not try to make it fit.

Overall, I think Lachie does a good job reinforcing that containers create real value to development when there’s a fit between the need and the technology.

Also, thanks Brian and Aaron for keeping such a great podcast going!

Why Fork Docker? Complexity Wack-a-Mole and Commercial Open Source

Posted on August 31, 2016 by Rob H

Update 12/14/16: Docker announced that they would create a container engine only project, ContinainerD, to decouple the engine from management layers above. Hopefully this addresses this issues outlined in the post below.

Monday, The New Stack broke news about a possible fork of the Docker Engine and prominently quoted me saying “Docker consistently breaks backend compatibility.” The technical instability alone is not what’s prompting industry leaders like Google, Red Hat and Huawei to take drastic and potentially risky community action in a central project.

So what’s driving a fork? It’s the intersection of Cash, Complexity and Community.

hamster In fact, I’d warned about this risk over a year ago: Docker is both a core infrastucture technology (the docker container runner, aka Docker Engine) and a commercial company that manages the Docker brand. The community formed a standard, runC, to try and standardize; however, Docker continues to deviate from (or innovate faster) that base.

It’s important for me to note that we use Docker tools and technologies heavily. So far, I’ve been a long-time advocate and user of Docker’s innovative technology. As such, we’ve also had to ride the rapid release roller coaster.

Let’s look at what’s going on here in three key areas:

1. Cash

The expected monetization of containers is the multi-system orchestration and support infrastructure. Since many companies look to containers as leading the disruptive next innovation wave, the idea that Docker is holding part of their plans hostage is simply unacceptable.

So far, the open source Docker Engine has been simply included without payment into these products. That changed in version 1.12 when Docker co-mingled their competitive Swarm product into the Docker Engine. That effectively forces these other parties to advocate and distribute their competitors product.

2. Complexity

When Docker added cool Swarm Orchestration features into the v1.12 runtime, it added a lot of complexity too. That may be simple from a “how many things do I have to download and type” perspective; however, that single unit is now dragging around a lot more code.

In one of the recent comments about this issue, Bob Wise bemoaned the need for infrastructure to be boring. Even as we look to complex orchestration like Swarm, Kubernetes, Mesos, Rancher and others to perform application automation magic, we also need to reduce complexity in our infrastructure layers.

Along those lines, operators want key abstractions like containers to be as simple and focused as possible. We’ve seen similar paths for virtualization runtimes like KVM, Xen and VMware that focus on delivering a very narrow band of functionality very well. There is a lot of pressure from people building with containers to have a similar experience from the container runtime.

This approach both helps operators manage infrastructure and creates a healthy ecosystem of companies that leverage the runtimes.

Note: My company, RackN, believes strongly in this need and it’s a core part of our composable approach to automation with Digital Rebar.

3. Community

Multi-vendor open source is a very challenging and specialized type of community. In these communities, most of the contributors are paid by companies with a vested (not necessarily transparent) interest in the project components. If the participants of the community feel that they are not being supported by the leadership then they are likely to revolt.

Ultimately, the primary difference between Docker and a fork of Docker is the brand and the community. If there companies paying the contributors have the will then it’s possible to move a whole community. It’s not cheap, but it’s possible.

Developers vs Operators

One overlooked aspect of this discussion is the apparent lock that Docker enjoys on the container developer community. The three Cs above really focus on the people with budgets (the operators) over the developers. For a fork to succeed, there needs to be a non-Docker set of tooling that feeds the platform pipeline with portable application packages.

In Conclusion…

The world continues to get more and more heterogeneous. We already had multiple container runtimes before Docker and the idea of a new one really is not that crazy right now. We’ve already got an explosion of container orchestration and this is a reflection of that.

My advice? Worry less about the container format for now and focus on automation and abstractions.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Tag Archives: Kubernetes