Kubernetes 18+ ways – yes, you can have it your way

Lately, I’ve been talking about the general concept of hybrid DevOps adding composability, orchestration and services to traditional configuration.  It’s time add a concrete example: the RackN team using Hybrid DevOps to deliver Kubernetes via Digital Rebar using community tools and scripts.

64671543So far, we provision over 18 different configurations of Kubernetes simply by changing command line flags [videos below].  That does not include optional post install steps like tests and applications.

By taking advantage of the Digital Rebar underlay abstractions and orchestration, we are able to use open community installation playbooks for a wide range of configurations.

So far, we’re testing against:

  • Three different clouds (AWS, Google and Packet.net) not including the option of using bare metal.
  • Two different operating systems (Ubuntu and Centos)
  • Three different software defined networking systems (Flannel, Calico and OpenContrail)

Those 18 are just the tip of the iceberg that we are actively testing.  The actual matrix is much deeper.

BUT THAT’S AN EXPLODING TEST MATRIX!?!  No.  It’s not.

The composable architecture of Digital Rebar means that all of these variations are isolated.  We are not creating 18 distinct variations; instead, the system chains options together and abstracts the differences between steps.

That means that we could add different logging options, test sequences or configuration choices into the deployment with minimal coupling of previous steps.  This enables operator choice and vendor injection in a way to allows collaboration around common components.  By design, we’ve eliminated fragile installation monoliths.

All it takes is a Packet, AWS or Google account to try this out for yourself!

Using the CLI script to install Kubernetes:

Deep Dive into adding OpenContrail SDN:

Docker Swarm Cluster Ops – focus on using, not building, with standard automation

At RackN, we’re huge fans of Docker.  We’ve been using the engine for years (since v0.8!) and you can read about our lessions from when we rearchitected around Docker Compose.  Now we’ve built  “one-click” hybrid cluster automation for Kubernetes, Docker Swarm and others.

However, I’m concerned that Docker installs reveal a lack of cluster operation focus.  These platforms are evolving very rapidly exposing users to both breaking upgrades and security risks.  This drives a requirement for cluster automation.

What are cluster operations?  It is the system level activity of creating an integrated platform that is repeatable, secure, networked and sustainable.  As use of Docker transitions from single node activity into multi-node and hybrid clusters, we need to approach the install and configuration as a system activity.

Cluster configuration requires system activity because there are so many moving pieces and necessary pre-configurations of networking, security, storage and roles.  These choices need to be implemented before the actual cluster software is installed because they drive how the cluster is configured and managed.  They continue to be a major factor as we grow, shrink and ugprade the cluster.

Why don’t people do this already?  Because cluster configuration requires additional setup and planning.  Operators are struggling just to keep up with API changes between quarterly updates.

Our mission is to eliminate the overhead of cluster operations so you can focus on using the Cluster not building it.

The RackN team has been working on deployment of Docker Swarm (and container orchestration more generally) to make sure Cluster Operations and underlay are robust and automated on every platform from cloud to metal.

The video below (and others in my channel) show how we’ve made it “one click” easy to create container clusters in nearly any environment.  While this is an evolving process, we believe that it is critical to start with cluster automation.

Let us show you how we’ve made that both fast and painless.

 

Cloudcast Notes: “doing real work” in containers and cloud

cloudcast-logo-2015-banner-blue

Last week, I made my second apperance on the Brian Gracely & Aaron Delp’s excellent podcast, The Cloudcast.   A lot has changed since my first appearance in 2011 but we’re still struggling to create consistent operations around these new platforms.  Then it was OpenStack, today it is container orchestration.

I loved this closing comment from Brian, “[the Cloudcast] loves people who are down in the dirt… you are living it and it’s going into the product.”

Total time, 38 minutes

  • 02:15: Interview Starts
  • 03:15: RackN path to Digital Rebar.  History of team going back to Crowbar
  • 06:05: Why we moved to containers for Digital Rebar (blog details here)
  • 07:20: The process to transform from monolith to services
  • 07:50: As backgrouind, what is Digital Rebar?  Configuration & Services in Sequence.
  • 09:30: How/Why to use Consul for services
  • 10:30: Why Immutability is Hard  (technial use of the word “cheese”)
  • 12:20: Challenge of restarting & state in Microservices
  • 14:30: Need for Iterative design process to improve as you learn the pattern.
  • 15:10: “If you are not using containers for at least packaging, your are crazy”
  • 15:40: We choose not to talk about OpenStack!
  • 16:15: Fidelity Gap and cloud portability
  • 17:10: Rob does funny voice about idea that with containers “devs don’t have to do ops”
  • 18:00: Why adding some overhead for deveopers is a good investment.
  • 18:40: Rob throws OpenStack under the bus for Devstack and “it worked in Devstack” mentality
  • 21:20: Containers do not solve all problems, in some ways they make things harder (especially on networking)/
  • 21:55: “we are about to put a serious hurt on networking management”
  • 21:50: Networking configuration is hard to build in a consisent way.   You have to automated it – there is no other choice.
  • 24:20: Hybrid Cloud priorities with RackN
  • 25:00: We “declared default” on trying to create a mono-cloud and accepted that infrastructure is hybrid.
  • 25:40: Openness comes from having multiple providers.  Composable ops allows you to cope with heterogeneous APIs
  • 28:55: Businesses want choice and control about infrastructure.  Do not want to deployments to hardcode to platforms or tooling.
  • 29:30: “I have not met anyone who is just using one cloud, tool or platform”
  • 30:30: Brian asks Rob to pick winners and trends.  “we like to let people pick and choose.”
  • 31:00: Container orchestration with networking and storage are going to be huge.
  • 31:30: Rob compares Kubernetes, Docker, Mesos, Rancher and Cloudsoft.
  • 32:20: The importance of adjacencies.  Things you need to make the core stuff work.
  • 34:20: “Watch out for the adjacencies because they will slow you down.”
  • 36:10: “We love guests who live in the dirt” and “built the technology that they wanted to get their jobs”

DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underwear” you can find.

DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underware” you can find.

We need DevOps without Borders! Is that “Hybrid DevOps?”

The RackN team has been working on making DevOps more portable for over five years.  Portable between vendors, sites, tools and operating systems means that our automation needs be to hybrid in multiple dimensions by design.

Why drive for hybrid?  It’s about giving users control.

launch!I believe that application should drive the infrastructure, not the reverse.  I’ve heard may times that the “infrastructure should be invisible to the user.”  Unfortunately, lack of abstraction and composibility make it difficult to code across platforms.  I like the term “fidelity gap” to describe the cost of these differences.

What keeps DevOps from going hybrid?  Shortcuts related to platform entangled configuration management.

Everyone wants to get stuff done quickly; however, we make the same hard-coded ops choices over and over again.  Big bang configuration automation that embeds sequence assumptions into the script is not just technical debt, it’s fragile and difficult to upgrade or maintain.  The problem is not configuration management (that’s a critical component!), it’s the lack of system level tooling that forces us to overload the configuration tools.

What is system level tooling?  It’s integrating automation that expands beyond configuration into managing sequence (aka orchestration), service orientation, script modularity (aka composibility) and multi-platform abstraction (aka hybrid).

My ops automation experience says that these four factors must be solved together because they are interconnected.

What would a platform that embraced all these ideas look like?  Here is what we’ve been working towards with Digital Rebar at RackN:

Mono-Infrastructure IT “Hybrid DevOps”
Locked into a single platform Portable between sites and infrastructures with layered ops abstractions.
Limited interop between tools Adaptive to mix and match best-for-job tools.  Use the right scripting for the job at hand and never force migrate working automation.
Ad hoc security based on site specifics Secure using repeatable automated processes.  We fail at security when things get too complex change and adapt.
Difficult to reuse ops tools Composable Modules enable Ops Pipelines.  We have to be able to interchange parts of our deployments for collaboration and upgrades.
Fragile Configuration Management Service Oriented simplifies API integration.  The number of APIs and services is increasing.  Configuration management is not sufficient.
 Big bang: configure then deploy scripting Orchestrated action is critical because sequence matters.  Building a cluster requires sequential (often iterative) operations between nodes in the system.  We cannot build robust deployments without ongoing control over order of operations.

Should we call this “Hybrid Devops?”  That sounds so buzz-wordy!

I’ve come to believe that Hybrid DevOps is the right name.  More technical descriptions like “composable ops” or “service oriented devops” or “cross-platform orchestration” just don’t capture the real value.  All these names fail to capture the portability and multi-system flavor that drives the need for user control of hybrid in multiple dimensions.

Simply put, we need devops without borders!

What do you think?  Do you have a better term?

We need DevOps without Borders! Is that “Hybrid DevOps?”

The RackN team has been working on making DevOps more portable for over five years.  Portable between vendors, sites, tools and operating systems means that our automation needs be to hybrid in multiple dimensions by design.

Why drive for hybrid?  It’s about giving users control.

launch!I believe that application should drive the infrastructure, not the reverse.  I’ve heard may times that the “infrastructure should be invisible to the user.”  Unfortunately, lack of abstraction and composibility make it difficult to code across platforms.  I like the term “fidelity gap” to describe the cost of these differences.

What keeps DevOps from going hybrid?  Shortcuts related to platform entangled configuration management.

Everyone wants to get stuff done quickly; however, we make the same hard-coded ops choices over and over again.  Big bang configuration automation that embeds sequence assumptions into the script is not just technical debt, it’s fragile and difficult to upgrade or maintain.  The problem is not configuration management (that’s a critical component!), it’s the lack of system level tooling that forces us to overload the configuration tools.

What is system level tooling?  It’s integrating automation that expands beyond configuration into managing sequence (aka orchestration), service orientation, script modularity (aka composibility) and multi-platform abstraction (aka hybrid).

My ops automation experience says that these four factors must be solved together because they are interconnected.

What would a platform that embraced all these ideas look like?  Here is what we’ve been working towards with Digital Rebar at RackN:

Mono-Infrastructure IT “Hybrid DevOps”
Locked into a single platform Portable between sites and infrastructures with layered ops abstractions.
Limited interop between tools Adaptive to mix and match best-for-job tools.  Use the right scripting for the job at hand and never force migrate working automation.
Ad hoc security based on site specifics Secure using repeatable automated processes.  We fail at security when things get too complex change and adapt.
Difficult to reuse ops tools Composable Modules enable Ops Pipelines.  We have to be able to interchange parts of our deployments for collaboration and upgrades.
Fragile Configuration Management Service Oriented simplifies API integration.  The number of APIs and services is increasing.  Configuration management is not sufficient.
 Big bang: configure then deploy scripting Orchestrated action is critical because sequence matters.  Building a cluster requires sequential (often iterative) operations between nodes in the system.  We cannot build robust deployments without ongoing control over order of operations.

Should we call this “Hybrid Devops?”  That sounds so buzz-wordy!

I’ve come to believe that Hybrid DevOps is the right name.  More technical descriptions like “composable ops” or “service oriented devops” or “cross-platform orchestration” just don’t capture the real value.  All these names fail to capture the portability and multi-system flavor that drives the need for user control of hybrid in multiple dimensions.

Simply put, we need devops without borders!

What do you think?  Do you have a better term?

Full Metal DevOps: 12 things we needed beyond Cobbler

The RackN team did not plan to replace Cobbler, we just needed something that responded to our need for full-cycle cross-platform DevOps automation.

Provisioning an O/S is never enough!  You need to coordinate a lot of operational activity to deploy a multi-node system, like OpenStack, Kubernetes, Docker Swarm or Ceph.  Since we believe an automated upgrade path is also required, there is a huge gap in provisioning.

So what was needed?  Here’s our (rather long!) list of gaps to fill for full Metal DevOps provisioning:

Gap Commentary
1 Needs to work with Cobbler! Improve? Yes.  Disrupt?  Hell No!  It has to be OK to leave Cobbler in place while we do something better.  I’d be OK to tweak my Cobber to point it to the new stuff.
2 REST API & JSON CLI Beyond the obvious API, we really want a way to write scripts that drive deployment proactively.
3 Modular Components If I’ve got my own DNS, DHCP, NTP, etc then let me use those instead (see #1 above)
4 Control over the discovery image RAM Discovery images are awesome BUT please let me mess with it too!  Inject my keys and let me control when it exits.
5 Configure heterogenous RAID, BIOS & IPMI Servers are a mix of in-band (in the O/S) or out-of-band (BMC) configs.  Don’t make me pick, I can’t.
6 Inject DevOps scripts dynamically based on system inventory or state Depending on the node’s role, I want to run a set of scripts AFTER the O/S is installed.  And, please let me mix Chef, Puppet, Ansible and Bash.  Bash?  Especially Bash.
7 Portable Scripts between Cloud or Metal I’m going to practice on VMs and AWS.  In fact, my devs only work there.  I need high fidelity between my cloud and metal deploys.
8 One-click to reset and start over I don’t care if you want to call this “Metal as a Service.”  Deployments are iterative and we need to go faster.
9 Don’t require PXE or IP control to add nodes to the system Beyond #2, I want to get control of servers that don’t PXE or are already provisioned.
10 System Inventory including Network topology.  Then Push it. No surprise that we need inventory to make provisioning decisions.  Can we make that API available?  Maybe push into CMDB?
11 Control SSH keys per system, group and deployment  Darn, Security is near the bottom again!  Can we please control keys and access from first boot.  It should be table stakes.
0 AND NEVER HAVE TO TOUCH KICKSTART or PRESEED TEMPLATES Well, there are times I have to do it (like soft raid for O/S drives), so at least create a template system because Cobbler’s was pretty good.

We built Digital Rebar to close these gaps and many others (like transparent in operation, working in containers, and failing fast).  We think it’s time to bring cloud operational practices into metal.  With this type of automation, we can make it happen!

What are your biggest challenges with Metal Ops?   Does it match this list?  I’ve love to hear your opinion.

Post-OpenStack DefCore, I’m Chasing “open infrastructure” via cross-platform Interop

Like my previous DefCore interop windmill tilting, this is not something that can be done alone. Open infrastructure is a collaborative effort and I’m looking for your help and support. I believe solving this problem benefits us as an industry and individually as IT professionals.

2013-09-13_18-56-39_197So, what is open infrastructure?   It’s not about running on open source software. It’s about creating platform choice and control. In my experience, that’s what defines open for users (and developers are not users).

I’ve spent several years helping lead OpenStack interoperability (aka DefCore) efforts to ensure that OpenStack cloud APIs are consistent between vendors. I strongly believe that effort is essential to build an ecosystem around the project; however, in talking to enterprise users, I’ve learned that that their  real  interoperability gap is between that many platforms, AWS, Google, VMware, OpenStack and Metal, that they use everyday.

Instead of focusing inward to one platform, I believe the bigger enterprise need is to address automation across platforms. It is something I’m starting to call hybrid DevOps because it allows users to mix platforms, service APIs and tools.

Open infrastructure in that context is being able to work across platforms without being tied into one platform choice even when that platform is based on open source software. API duplication is not sufficient: the operational characteristics of each platform are different enough that we need a different abstraction approach.

We have to be able to compose automation in a way that tolerates substitution based on infrastructure characteristics. This is required for metal because of variation between hardware vendors and data center networking and services. It is equally essential for cloud because of variation between IaaS capabilities and service delivery models. Basically, those  minor  differences between clouds create significant challenges in interoperability at the operational level.

Rationalizing APIs does little to address these more structural differences.

The problem is compounded because the differences are not nicely segmented behind abstraction layers. If you work to build and sustain a fully integrated application, you must account for site specific needs throughout your application stack including networking, storage, access and security. I’ve described this as all deployments have 80% of the work common but the remaining 20% is mixed in with the 80% instead of being nicely layers. So, ops is cookie dough not vinaigrette.

Getting past this problem for initial provisioning on a single platform is a false victory. The real need is portable and upgrade-ready automation that can be reused and shared. Critically, we also need to build upon the existing foundations instead of requiring a blank slate. There is openness value in heterogeneous infrastructure so we need to embrace variation and design accordingly.

This is the vision the RackN team has been working towards with open source Digital Rebar project. We now able to showcase workload deployments (Docker, Kubernetes, Ceph, etc) on multiple cloud platforms that also translate to full bare metal deployments. Unlike previous generations of this tooling (some will remember Crowbar), we’ve been careful to avoid injecting external dependencies into the DevOps scripts.

While we’re able to demonstrate a high degree of portability (or fidelity) across multiple platforms, this is just the beginning. We are looking for users and collaborators who want to want to build open infrastructure from an operational perspective.

You are invited to join us in making open cross-platform operations a reality.