DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underware” you can find.

DNS is critical – getting physical ops integrations right matters

Why DNS? Maintaining DNS is essential to scale ops.  It’s not as simple as naming servers because each server will have multiple addresses (IPv4, IPv6, teams, bridges, etc) on multiple NICs depending on the systems function and applications. Plus, Errors in DNS are hard to diagnose.

Names MatterI love talking about the small Ops things that make a huge impact in quality of automation.  Things like automatically building a squid proxy cache infrastructure.

Today, I get to rave about the DNS integration that just surfaced in the OpenCrowbar code base. RackN CTO, Greg Althaus, just completed work that incrementally updates DNS entries as new IPs are added into the system.

Why is that a big deal?  There are a lot of names & IPs to manage.

In physical ops, every time you bring up a physical or virtual network interface, you are assigning at least one IP to that interface. For OpenCrowbar, we are assigning two addresses: IPv4 and IPv6.  Servers generally have 3 or more active interfaces (e.g.: BMC, admin, internal, public and storage) so that’s a lot of references.  It gets even more complex when you factor in DNS round robin or other common practices.

Plus mistakes are expensive.  Name resolution is an essential service for operations.

I know we all love memorizing IPv4 addresses (just wait for IPv6!) so accurate naming is essential.  OpenCrowbar already aligns the address 4th octet (Admin .106 goes to the same server as BMC .106) but that’s not always practical or useful.  This is not just a Day 1 problem – DNS drift or staleness becomes an increasing challenging problem when you have to reallocate IP addresses.  The simple fact is that registering IPs is not the hard part of this integration – it’s the flexible and dynamic updates.

What DNS automation did we enable in OpenCrowbar?  Here’s a partial list:

  1. recovery of names and IPs when interfaces and systems are decommissioned
  2. use of flexible naming patterns so that you can control how the systems are registered
  3. ability to register names in multiple DNS infrastructures
  4. ability to understand sub-domains so that you can map DNS by region
  5. ability to register the same system under multiple names
  6. wild card support for C-Names
  7. ability to create a DNS round-robin group and keep it updated

But there’s more! The integration includes both BIND and PowerDNS integrations. Since BIND does not have an API that allows incremental additions, Greg added a Golang service to wrap BIND and provide incremental updates and deletes.

When we talk about infrastructure ops automation and ready state, this is the type of deep integration that makes a difference and is the hallmark of the RackN team’s ops focus with RackN Enterprise and OpenCrowbar.

Manage Hardware like a BOSS – latest OpenCrowbar brings API to Physical Gear

A few weeks ago, I posted about VMs being squeezed between containers and metal.   That observation comes from our experience fielding the latest metal provisioning feature sets for OpenCrowbar; consequently, so it’s exciting to see the team has cut the next quarterly release:  OpenCrowbar v2.2 (aka Camshaft).  Even better, you can top it off with official software support.

Camshaft coordinates activity

Dual overhead camshaft housing by Neodarkshadow from Wikimedia Commons

The Camshaft release had two primary objectives: Integrations and Services.  Both build on the unique functional operations and ready state approach in Crowbar v2.

1) For Integrations, we’ve been busy leveraging our ready state API to make physical servers work like a cloud.  It gets especially interesting with the RackN burn-in/tear-down workflows added in.  Our prototype Chef Provisioning driver showed how you can use the Crowbar API to spin servers up and down.  We’re now expanding this cloud-like capability for Saltstack, Docker Machine and Pivotal BOSH.

2) For Services, we’ve taken ops decomposition to a new level.  The “secret sauce” for Crowbar is our ability to interweave ops activity between components in the system.  For example, building a cluster requires setting up pieces on different systems in a very specific sequence.  In Camshaft, we’ve added externally registered services (using Consul) into the orchestration.  That means that Crowbar will either use existing DNS, Database, or NTP services or set it’s own.  Basically, Crowbar can now work FIT YOUR EXISTING OPS ENVIRONMENT without forcing a dedicated Crowbar only services like DHCP or DNS.

In addition to all these features, you can now purchase support for OpenCrowbar from RackN (my company).  The Enterprise version includes additional server life-cycle workflow elements and features like HA and Upgrade as they are available.

There are AMAZING features coming in the next release (“Drill”) including a message bus to broadcast events from the system, more operating systems (ESXi, Xenserver, Debian and Mirantis’ Fuel) and increased integration/flexibility with existing operational environments.  Several of these have already been added to the develop branch.

It’s easy to setup and test OpenCrowbar using containers, VMs or metal.  Want to learn more?  Join our community in Gitteremail list or weekly interactive community meetings (Wednesdays @ 9am PT).

Showing to how others explain Ready State & OpenCrowbar

I’m working on a series for DevOps.com to explain Functional Ops (expect it to start early next week!) and it’s very hard to convey it’s east-west API nature.  So I’m always excited to see how other people explain how OpenCrowbar does ops and ready state.

Ready State PictureThis week I was blown away by the drawing that I’ve recreated for this blog post.  It’s very clear graphic showing the operational complexity of heterogeneous infrastructure AND how OpenCrowbar normalizes it into a ready state.

It’s critical to realize that the height of each component tower varies by vendor and also by location with in the data center topology.  Ready state is not just about normalizing different vendors gear; it’s really about dealing with the complexity that’s inherent in building a functional data center.  It’s “little” things liking knowing how to to enumerate the networking interfaces and uplinks to build the correct teams.

If you think this graphic helps, please let me know.

Online Meetup Today (1/13): Build a rock-solid foundation under your OpenStack cloud

Reminder: Online meetup w/ Crowbar + OpenStack DEMO TODAY

HFoundation Rawere’s the notice from the site (with my added Picture)

Building cloud infrastructure requires a rock-solid foundation. 

In this hour, Rob Hirschfeld will demo automated tooling, specifically OpenCrowbar, to prepare and integrate physical infrastructure to ready state and then use PackStack to install OpenStack.

 

The OpenCrowbar project started in 2011 as an OpenStack installer and had grown into a general purpose provisioning and infrastructure orchestration framework that works in parallel with multiple hardware vendors, operating systems and devops tools.  These tools create a fast, durable and repeatable environment to install OpenStack, Ceph, Kubernetes, Hadoop or other scale platforms.

 

Rob will show off the latest features and discuss key concepts from the Crowbar operational model including Ready State, Functional Operations and Late Binding. These concepts, built into Crowbar, can be applied generally to make your operations more robust and scalable.

OpenCrowbar v2.1 Video Tour from Metal to OpenStack and beyond

With the OpenCrowbar v2.1 out, I’ve been asked to update the video library of Crowbar demos.  Since a complete tour is about 3 hours, I decided to cut it down into focused demos that would allow you to start at an area of interest and work backwards.

I’ve linked all the videos below by title.  Here’s a visual table on contents:

Video Progression

Crowbar v2.1 demo: Visual Table of Contents [click for playlist]

The heart of the demo series is the Annealer and Ready State (video #3).

  1. Prepare Environment
  2. Bootstrap Crowbar
  3. Add Nodes ♥ Ready State (good starting point)
  4. Boot Hardware
  5. Install OpenStack (Juno using PackStack on CentOS 7)
  6. Integrate with Chef & Chef Provisioning
  7. Integrate with SaltStack

I’ve tried to do some post-production so limit dead air and focus on key areas.  As always, I value content over production values so feedback is very welcome!

Delicious 7 Layer DIP (DevOps Infrastructure Provisioning) model with graphic!

Applying architecture and computer science principles to infrastructure automation helps us build better controls.  In this post, we create an OSI-like model that helps decompose the ops environment.

The RackN team discussions about “what is Ready State” have led to some interesting realizations about physical ops.  One of the most critical has been splitting the operational configuration (DNS, NTP, SSH Keys, Monitoring, Security, etc) from the application configuration.

Interactions between these layers is much more dynamic than developers and operators expect.  

In cloud deployments, you can use ask for the virtual infrastructure to be configured in advance via the IaaS and/or golden base images.  In hardware, the environment build up needs to be more incremental because that variations in physical infrastructure and operations have to be accommodated.

Greg Althaus, Crowbar co-founder, and I put together this 7 layer model (it started as 3 and grew) because we needed to be more specific in discussion about provisioning and upgrade activity.  The system view helps explain how layer 5 and 6 operate at the system layer.

7 Layer DIP

The Seven Layers of our DIP:

  1. shared infrastructure – the base layer is about the interconnects between the nodes.  In this model, we care about the specific linkage to the node: VLAN tags on the switch port, which switch is connected, which PDU ID controls turns it on.
  2. firmware and management – nodes have substantial driver (RAID/BIOS/IPMI) software below the operating system that must be configured correctly.   In some cases, these configurations have external interfaces (BMC) that require out-of-band access while others can only be configured in pre-install environments (I call that side-band).
  3. operating system – while the operating system is critical, operators are striving to keep this layer as thin to avoid overhead.  Even so, there are critical security, networking and device mapping functions that must be configured.  Critical local resource management items like mapping media or building network teams and bridges are level 2 functions.
  4. operations clients – this layer connects the node to the logical data center infrastructure is basic ways like time synch (NTP) and name resolution (DNS).  It’s also where more sophisticated operators configure things like distributed cache, centralized logging and system health monitoring.  CMDB agents like Chef, Puppet or Saltstack are installed at the “top” of this layer to complete ready state.
  5. applications – once all the baseline is setup, this is the unique workload.  It can range from platforms for other applications (like OpenStack or Kubernetes) or the software itself like Ceph, Hadoop or anything.
  6. operations management – the external system references for layer 3 must be factored into the operations model because they often require synchronized configuration.  For example, registering a server name and IP addresses in a DNS, updating an inventory database or adding it’s thresholds to a monitoring infrastructure.  For scale and security, it is critical to keep the node configuration (layer 3) constantly synchronized with the central management systems.
  7. cluster coordination – no application stands alone; consequently, actions from layer 4 nodes must be coordinated with other nodes.  This ranges from database registration and load balancing to complex upgrades with live data migration. Working in layer 4 without layer 6 coordination creates unmanageable infrastructure.

This seven layer operations model helps us discuss which actions are required when provisioning a scale infrastructure.  In my experience, many developers want to work exclusively in layer 4 and overlook the need to have a consistent and managed infrastructure in all the other layers.  We enable this thinking in cloud and platform as a service (PaaS) and that helps improve developer productivity.

We cannot overlook the other layers in physical ops; however, working to ready state helps us create more cloud-like boundaries.  Those boundaries are a natural segue my upcoming post about functional operations (older efforts here).

Ironic + Crowbar: United in Vision, Complementary in Approach

This post is co-authored by Devanda van der Veen, OpenStack Ironic PTL, and Rob Hirschfeld, OpenCrowbar Founder.  We discuss how Ironic and Crowbar work together today and into the future.

Normalizing the APIs for hardware configuration is a noble and long-term goal.  While the end result, a configured server, is very easy to describe; the differences between vendors’ hardware configuration tools are substantial.  These differences make it impossible challenging to create repeatable operations automation (DevOps) on heterogeneous infrastructure.

Illustration to show potential changes in provisioning control flow over time.

Illustration to show potential changes in provisioning control flow over time.

The OpenStack Ironic project is a multi-vendor community solution to this problem at the server level.  By providing a common API for server provisioning, Ironic encourages vendors to write drivers for their individual tooling such as iDRAC for Dell or iLO for HP.

Ironic abstracts configuration and expects to be driven by an orchestration system that makes the decisions of how to configure each server. That type of orchestration is the heart of Crowbar physical ops magic [side node: 5 ways that physical ops is different from cloud]

The OpenCrowbar project created extensible orchestration to solve this problem at the system level.  By decomposing system configuration into isolated functional actions, Crowbar can coordinate disparate configuration actions for servers, switches and between systems.

Today, the Provisioner component of Crowbar performs similar functions as Ironic for operating system installation and image lay down.  Since configuration activity is tightly coupled with other Crowbar configuration, discovery and networking setup, it is difficult to isolate in the current code base.  As Ironic progresses, it should be possible to shift these activities from the Provisioner to Ironic and take advantage of the community-based configuration drivers.

The immediate synergy between Crowbar and Ironic comes from accepting two modes of operation for OpenStack: bootstrapping infrastructure and multi-tenant server allocation.

Crowbar was designed as an operational platform that seeds an OpenStack ready environment.  Once that environment is configured, OpenStack can take over ownership of the resources and allow Ironic to manage and deliver “hypervisor-free” servers for each tenant.  In that way, we can accelerate the adoption of OpenStack for self-service metal.

Physical operations is messy and challenging, but we’re committed to working together to make it suck less.  Operators of the world unite!

why is hardware hard? Ready State Physical Ops Meetup on Tuesday 12/2 9am PT

meh.  Compared to cloud, Ops on physical infrastructure sinks.physical outlet

Unfortunately, the cloud and scale platforms need to run someone so someone’s got to deal with it.  In fact, we’ve got to deal with crates of cranky servers and flocks of finicky platforms.  It’s enough to keep a good operator down.

If that’s you, or someone you care about, join us for a physical ops support group meetup on Tuesday 9am PT (11 central).

There is a light at the end of the tunnel!  We can make it repeatable to provision OpenStack, Hadoop and other platforms.

As a community, we’re steadily bringing best practices and proven automation from cloud ops down into the physical space.   On the OpenCrowbar project, we’re accelerating this effort using the ready state concept as a hand off point for “physical-cloud equivalency” and exploring the concept of “functional operations” to make DevOps scripts more portable.

Join me and Rafael Knuth for a discussion about how operators can work together to break the cycle.

API Driven Metal = OpenCrowbar + Chef Provisioning

The OpenCrowbar community created a Chef-Provisioning driver that allows you to quickly build hardware clusters using Chef cookbooks.

2012-08-05_14-13-18_310When we started using Chef in 2011, there was a distinct gap around bootstrapping systems.  The platform did a great job of automation and even connecting services together (via the Search anti-pattern, see below) but lacked a way to build the initial clusters automatically.

The current answer to this problem from Chef is refreshingly simply: a cookbook API extension called Chef Provisioning.  This approach uses the regular Chef DSL in recipes to create request and bind a cluster into Chef.  Basically, the code simply builds an array of nodes using an API that creates the nodes if they are missing from the array in the code.  Specifically, when a node is missing from the array, Chef calls out to create the node in an external system.

For clouds, that means using the API to request a server and then inject credentials for Chef management.  It’s trickier for physical gear because you cannot just make a server in the configuration you need it in.  Physical systems must first be discovered and profiled to ready state: the system must know how many NICs and disk drives are available to correctly configure the hardware prior to laying down the Operating System.

Consequently, Chef Provisioning automation is more about reallocation of existing discovered physical assets to Chef.  That’s exactly the approach the OpenCrowbar team took for our Chef Provisioning driver.

OpenCrowbar interacts with Chef Provisioning by pulling nodes from the System deployment into a Chef Provisioning deployment.  That action then allows the API client to request specific configurations like Operating System or network that need to be setup for Chef to execute.  Once these requests are made, Crowbar will simply run its normal annealing processes to ready state and then injects the Chef credentials.  Chef waits until the work queue is empty and then takes over management of the asset.  When Chef is finished, Crowbar can be instructed to reconfigure the node back to a base state.

Does that sound simple?  It is simple because the Crowbar APIs match the Chef needs very cleanly.

It’s worth noting that this integration is a great test of the OpenCrowbar API design.  Over the last two years, we’ve evolved the API to make it more final result focused.  Late binding is a critical concept for the project and the APIs reflect that objective.  For Chef Provisioning, we allow the integration to focus on simple requests like “give me a node then put this O/S on the node and go.”  Crowbar has the logic needed to figure out how to accomplish those objectives without much additional instruction.

Bonus Side Note: Why Search can become an anti-pattern?  

Search is an incredibly powerful feature in Chef that allows cross-role and cross-node integration; unfortunately, it’s also very difficult to maintain as complexity and contributor counts grow.  The reason is that search creates “forward dependencies” in the scripts that require operators creating data to be aware of downstream, hidden consumers.  High Availability (HA) is a clear example.  If I add a new “cluster database” role to the system then it is very likely to return multiple results for database searches.  That’s excellent until I learn that my scripts have coded search to assume that we only return one result for database lookups.  It’s very hard to find these errors since the searches are decoupled and downstream of the database cookbook.  Ultimately, the community had to advise against embedded search for shared cookbooks