DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underware” you can find.

DNS is critical – getting physical ops integrations right matters

Why DNS? Maintaining DNS is essential to scale ops.  It’s not as simple as naming servers because each server will have multiple addresses (IPv4, IPv6, teams, bridges, etc) on multiple NICs depending on the systems function and applications. Plus, Errors in DNS are hard to diagnose.

Names MatterI love talking about the small Ops things that make a huge impact in quality of automation.  Things like automatically building a squid proxy cache infrastructure.

Today, I get to rave about the DNS integration that just surfaced in the OpenCrowbar code base. RackN CTO, Greg Althaus, just completed work that incrementally updates DNS entries as new IPs are added into the system.

Why is that a big deal?  There are a lot of names & IPs to manage.

In physical ops, every time you bring up a physical or virtual network interface, you are assigning at least one IP to that interface. For OpenCrowbar, we are assigning two addresses: IPv4 and IPv6.  Servers generally have 3 or more active interfaces (e.g.: BMC, admin, internal, public and storage) so that’s a lot of references.  It gets even more complex when you factor in DNS round robin or other common practices.

Plus mistakes are expensive.  Name resolution is an essential service for operations.

I know we all love memorizing IPv4 addresses (just wait for IPv6!) so accurate naming is essential.  OpenCrowbar already aligns the address 4th octet (Admin .106 goes to the same server as BMC .106) but that’s not always practical or useful.  This is not just a Day 1 problem – DNS drift or staleness becomes an increasing challenging problem when you have to reallocate IP addresses.  The simple fact is that registering IPs is not the hard part of this integration – it’s the flexible and dynamic updates.

What DNS automation did we enable in OpenCrowbar?  Here’s a partial list:

  1. recovery of names and IPs when interfaces and systems are decommissioned
  2. use of flexible naming patterns so that you can control how the systems are registered
  3. ability to register names in multiple DNS infrastructures
  4. ability to understand sub-domains so that you can map DNS by region
  5. ability to register the same system under multiple names
  6. wild card support for C-Names
  7. ability to create a DNS round-robin group and keep it updated

But there’s more! The integration includes both BIND and PowerDNS integrations. Since BIND does not have an API that allows incremental additions, Greg added a Golang service to wrap BIND and provide incremental updates and deletes.

When we talk about infrastructure ops automation and ready state, this is the type of deep integration that makes a difference and is the hallmark of the RackN team’s ops focus with RackN Enterprise and OpenCrowbar.

Manage Hardware like a BOSS – latest OpenCrowbar brings API to Physical Gear

A few weeks ago, I posted about VMs being squeezed between containers and metal.   That observation comes from our experience fielding the latest metal provisioning feature sets for OpenCrowbar; consequently, so it’s exciting to see the team has cut the next quarterly release:  OpenCrowbar v2.2 (aka Camshaft).  Even better, you can top it off with official software support.

Camshaft coordinates activity

Dual overhead camshaft housing by Neodarkshadow from Wikimedia Commons

The Camshaft release had two primary objectives: Integrations and Services.  Both build on the unique functional operations and ready state approach in Crowbar v2.

1) For Integrations, we’ve been busy leveraging our ready state API to make physical servers work like a cloud.  It gets especially interesting with the RackN burn-in/tear-down workflows added in.  Our prototype Chef Provisioning driver showed how you can use the Crowbar API to spin servers up and down.  We’re now expanding this cloud-like capability for Saltstack, Docker Machine and Pivotal BOSH.

2) For Services, we’ve taken ops decomposition to a new level.  The “secret sauce” for Crowbar is our ability to interweave ops activity between components in the system.  For example, building a cluster requires setting up pieces on different systems in a very specific sequence.  In Camshaft, we’ve added externally registered services (using Consul) into the orchestration.  That means that Crowbar will either use existing DNS, Database, or NTP services or set it’s own.  Basically, Crowbar can now work FIT YOUR EXISTING OPS ENVIRONMENT without forcing a dedicated Crowbar only services like DHCP or DNS.

In addition to all these features, you can now purchase support for OpenCrowbar from RackN (my company).  The Enterprise version includes additional server life-cycle workflow elements and features like HA and Upgrade as they are available.

There are AMAZING features coming in the next release (“Drill”) including a message bus to broadcast events from the system, more operating systems (ESXi, Xenserver, Debian and Mirantis’ Fuel) and increased integration/flexibility with existing operational environments.  Several of these have already been added to the develop branch.

It’s easy to setup and test OpenCrowbar using containers, VMs or metal.  Want to learn more?  Join our community in Gitteremail list or weekly interactive community meetings (Wednesdays @ 9am PT).

Showing to how others explain Ready State & OpenCrowbar

I’m working on a series for to explain Functional Ops (expect it to start early next week!) and it’s very hard to convey it’s east-west API nature.  So I’m always excited to see how other people explain how OpenCrowbar does ops and ready state.

Ready State PictureThis week I was blown away by the drawing that I’ve recreated for this blog post.  It’s very clear graphic showing the operational complexity of heterogeneous infrastructure AND how OpenCrowbar normalizes it into a ready state.

It’s critical to realize that the height of each component tower varies by vendor and also by location with in the data center topology.  Ready state is not just about normalizing different vendors gear; it’s really about dealing with the complexity that’s inherent in building a functional data center.  It’s “little” things liking knowing how to to enumerate the networking interfaces and uplinks to build the correct teams.

If you think this graphic helps, please let me know.

Online Meetup Today (1/13): Build a rock-solid foundation under your OpenStack cloud

Reminder: Online meetup w/ Crowbar + OpenStack DEMO TODAY

HFoundation Rawere’s the notice from the site (with my added Picture)

Building cloud infrastructure requires a rock-solid foundation. 

In this hour, Rob Hirschfeld will demo automated tooling, specifically OpenCrowbar, to prepare and integrate physical infrastructure to ready state and then use PackStack to install OpenStack.


The OpenCrowbar project started in 2011 as an OpenStack installer and had grown into a general purpose provisioning and infrastructure orchestration framework that works in parallel with multiple hardware vendors, operating systems and devops tools.  These tools create a fast, durable and repeatable environment to install OpenStack, Ceph, Kubernetes, Hadoop or other scale platforms.


Rob will show off the latest features and discuss key concepts from the Crowbar operational model including Ready State, Functional Operations and Late Binding. These concepts, built into Crowbar, can be applied generally to make your operations more robust and scalable.

OpenCrowbar v2.1 Video Tour from Metal to OpenStack and beyond

With the OpenCrowbar v2.1 out, I’ve been asked to update the video library of Crowbar demos.  Since a complete tour is about 3 hours, I decided to cut it down into focused demos that would allow you to start at an area of interest and work backwards.

I’ve linked all the videos below by title.  Here’s a visual table on contents:

Video Progression

Crowbar v2.1 demo: Visual Table of Contents [click for playlist]

The heart of the demo series is the Annealer and Ready State (video #3).

  1. Prepare Environment
  2. Bootstrap Crowbar
  3. Add Nodes ♥ Ready State (good starting point)
  4. Boot Hardware
  5. Install OpenStack (Juno using PackStack on CentOS 7)
  6. Integrate with Chef & Chef Provisioning
  7. Integrate with SaltStack

I’ve tried to do some post-production so limit dead air and focus on key areas.  As always, I value content over production values so feedback is very welcome!

Delicious 7 Layer DIP (DevOps Infrastructure Provisioning) model with graphic!

Applying architecture and computer science principles to infrastructure automation helps us build better controls.  In this post, we create an OSI-like model that helps decompose the ops environment.

The RackN team discussions about “what is Ready State” have led to some interesting realizations about physical ops.  One of the most critical has been splitting the operational configuration (DNS, NTP, SSH Keys, Monitoring, Security, etc) from the application configuration.

Interactions between these layers is much more dynamic than developers and operators expect.  

In cloud deployments, you can use ask for the virtual infrastructure to be configured in advance via the IaaS and/or golden base images.  In hardware, the environment build up needs to be more incremental because that variations in physical infrastructure and operations have to be accommodated.

Greg Althaus, Crowbar co-founder, and I put together this 7 layer model (it started as 3 and grew) because we needed to be more specific in discussion about provisioning and upgrade activity.  The system view helps explain how layer 5 and 6 operate at the system layer.

7 Layer DIP

The Seven Layers of our DIP:

  1. shared infrastructure – the base layer is about the interconnects between the nodes.  In this model, we care about the specific linkage to the node: VLAN tags on the switch port, which switch is connected, which PDU ID controls turns it on.
  2. firmware and management – nodes have substantial driver (RAID/BIOS/IPMI) software below the operating system that must be configured correctly.   In some cases, these configurations have external interfaces (BMC) that require out-of-band access while others can only be configured in pre-install environments (I call that side-band).
  3. operating system – while the operating system is critical, operators are striving to keep this layer as thin to avoid overhead.  Even so, there are critical security, networking and device mapping functions that must be configured.  Critical local resource management items like mapping media or building network teams and bridges are level 2 functions.
  4. operations clients – this layer connects the node to the logical data center infrastructure is basic ways like time synch (NTP) and name resolution (DNS).  It’s also where more sophisticated operators configure things like distributed cache, centralized logging and system health monitoring.  CMDB agents like Chef, Puppet or Saltstack are installed at the “top” of this layer to complete ready state.
  5. applications – once all the baseline is setup, this is the unique workload.  It can range from platforms for other applications (like OpenStack or Kubernetes) or the software itself like Ceph, Hadoop or anything.
  6. operations management – the external system references for layer 3 must be factored into the operations model because they often require synchronized configuration.  For example, registering a server name and IP addresses in a DNS, updating an inventory database or adding it’s thresholds to a monitoring infrastructure.  For scale and security, it is critical to keep the node configuration (layer 3) constantly synchronized with the central management systems.
  7. cluster coordination – no application stands alone; consequently, actions from layer 4 nodes must be coordinated with other nodes.  This ranges from database registration and load balancing to complex upgrades with live data migration. Working in layer 4 without layer 6 coordination creates unmanageable infrastructure.

This seven layer operations model helps us discuss which actions are required when provisioning a scale infrastructure.  In my experience, many developers want to work exclusively in layer 4 and overlook the need to have a consistent and managed infrastructure in all the other layers.  We enable this thinking in cloud and platform as a service (PaaS) and that helps improve developer productivity.

We cannot overlook the other layers in physical ops; however, working to ready state helps us create more cloud-like boundaries.  Those boundaries are a natural segue my upcoming post about functional operations (older efforts here).