Apparently IT death smells like kickstart files. Six Reasons why.

Today, I’m sharing a parable about always being focused on adding value.

Recently, I was on a call with an IT Ops manager who insisted that his team had their on-premises operations under control with “python scripts and manual kickstart files” because they “really don’t change their infrastructure setup.” He explained that he and his team was comfortable with this because it was something they understood and did not require learning new systems. While I understand his position, I was sort of sad for him and his employer because…

No value is created for his company by maintaining custom kickstart, preseeds or boot files.

Maintaining kickstarts is fatal for many reasons. Is there a way to make it less fatal? Yes, and it involves investing in learning tools that let you move up stack.

Contrary to popular IT mythology, managing physical infrastructure is still a reality for many IT teams and will remain a part of best practices until every workload simply runs on Amazon and it becomes their problem.  Since that “Utopian” future is unlikely, let’s deal with some practical realities of hybrid IT.

Here are my six reasons why custom kickstarts (and other site-specific boot provisioning scripts) are dangerous:

1. Creating Site Unique Processes

Every infrastructure is unique and that’s a practical reality that we have to accept because otherwise we would never be able to make improvements and corrects without touching everything that already deployed. However, we really want to work hard to minimize places where we inject variation into the environment. That means that server and site specific kickstarts with lots of post-provisioning steps forces operators to maintain additional information about each server.

2. Building Server Specific Configurations

When we create server specific templates, it becomes nearly impossible to recreate server builds. That directly leads to fragile infrastructure because teams cannot quickly redeploy or automate refreshes. Static IT infrastructure is a known fail pattern and makes enterprises vulnerable to staff changes, hacking and inability to manage and patch.

3. Having Opaque Configurations

Kickstart is hard to understand (and even harder to troubleshoot). When teams take actions during the provisioning process they are often not tracked or managed like other operational scripting tools. Failures or injections can easily go undetected. Even if they are tracked, the number of operators who can read and manage these scripts is limited. That means that critical aspects of your operational environment happen outside of your awareness.

4. Being Less Secure

Kickstart processes generally include injecting SSH keys, certificates and other authentication credentials. These embedded credentials are often hard coded into the process with minimal awareness of the operational team leaving you vulnerable at the most foundational level. This is not an acceptable security process; however, teams who hack kickstarts often don’t want to consider the implications.

Security side note: most teams don’t have the expertise to integrate TPM or HSM into their kickstart processes; consequently, these key security technologies are generally unused and ignored. If you want to talk about this, please contact me!

5. Diverging Provisioning Patterns

Cloud does not use kickstarts. Provisioning variation increases when teams keep/add logic and configuration into server provisioning instead of doing it as post-provision automation. If your physical provisioning team is not rehearsing on cloud then you’re in a serious IT hole because all workloads should be managed as hybrid-ready. Deployment fidelity helps accelerate teams and reduces cost.

6. Reusing Community Practice

Finally, managing your own kickstarts makes it impossible to leverage community patterns and practices. Kickstarts are not exactly a hive of innovation so you are not creating any competitive advantage by adding variation there. In cases like that, reusing community tooling is a net benefit to your organization. Why have we not done this already? Until recently, provisioning tools were not API driven or focused on reusable shared practice.

While Kickstart or similar is pretty much required for physical, we have a solution for these issues.

One of the key design elements of Digital Rebar is an templated, API driven boot provisioner. Our approach uses kickstarts, preseeds and other tools; however, we’ve worked hard to minimize their span and decompose them into reusable components. That allows users to inject site specific code as snippets that are centrally managed and hardware neutral.

Critically, our approach allows SRE and Ops teams to get out of the kickstart business and focus on provisioning workflow and automation. Yes, there’s some learning curve but there are a lot of benefits to moving up stack.

It’s not too late to “:q!” those kickstart edits and accelerate your infrastructure.

Rocking Docker – OpenCrowbar builds solid foundation & life-cycle [VIDEOS]

Docker has been gathering a substantial about of interest as an additional way to solve application portability and dependency hell.  We’ve been enthusiastic participants in this fledgling community (Docker in OpenStack) and my work in DefCore’s Tempest in a Container (TCUP).

flying?  not flying!In OpenCrowbar, we’ve embedded Docker much deeper to solve a few difficult & critical problems: speeding up developing multi-node deployments and building the environment for the containers.  Check out my OpenCrowbar does Docker video or the community demo!

Bootstrapping Docker into a DevOps management framework turns out to be non-trivial because integrating new nodes into a functioning operating environment is very different on Docker than using physical servers or a VMs.  Containers don’t PXE boot and have more limited configuration options.

How did we do this?  Unlike other bare metal provisioning frameworks, we made sure that Crowbar did not require DHCP+PXE as the only node discovery process.  While we default to and fully support PXE with our sledgehammer discovery image, we also allow operators to pre-populate the Crowbar database using our API and make configuration adjustments before the node is discovered/created.

We even went a step farther and enabled the Crowbar dependency graph to take alternate routes (we call it the “provides” role).  This enhancement is essential for dealing with “alike but different” infrastructure like Docker.

The result is that you can request Docker nodes in OpenCrowbar (using the API only for now) and it will automatically create the containers and attach them into Crowbar management.  It’s important to stress that we are not adding existing containers to Crowbar by adding an agent; instead, Crowbar manages the container’s life-cycle and then then work inside the container.

Getting around the PXE cycle using containers as part of Crowbar substantially improves Ops development cycle time because we don’t have to wait for boot > discovery > reboot > install to create a clean environment.  Bringing fresh Docker containers into a dev system takes seconds instead,

The next step is equally powerful: Crowbar should be able to configure the Docker host environment on host nodes (not just the Admin node as we are now demonstrating).  Setting up the host can be very complex: you need to have the correct RAID, BIOS, Operating System and multi-NIC networking configuration.  All of these factors must be done with a system perspective that match your Ops environment.  Luckily, this is exactly Crowbar’s sweet spot!

Until we’ve got that pulled together, OpenCrowbar’s ability to use upstream cookbooks and this latest Dev/Test focused step provides remarkable out of the gate advantages for everyone build multi-node DevOps tools.


PS: It’s worth noting that we’ve already been using Docker to run & develop the Crowbar Admin server.  This extra steps makes Crowbar even more Dockeriffic.

How OpenStack installer (crowbar + chefops) works (video from 3/14 demo)

July 24th 2012 Update:

This page is very very old and Crowbar has progressed significantly since this was posted.  For better information, please visit the Crowbar wiki and  review my Crowbar 2 writeups.

August 5th 2011 Update:

While still relevant and accurate, the information on this page does not reflect the latest information about the now Apache 2 released Crowbar code.  In the 4+ months following this post, we substantially refactored the code make make it more modular (see Barclamps), better looking, and multi-vendor/multi-application (Hadoop & RHEL).  If you want more information, I recommend that you try Crowbar for yourself.

Original March 14th 2011 Text:

I’ve been getting some “how does Crowbar work” inquiries and wanted to take a shot at adding some technical detail.   Before I launch into technical babble, there are some important things to note:

  1. Dell has committed to open source release the code for Crowbar (Apache 2)
  2. Crowbar is an extension of Chef Server – it does not function stand alone and uses Chef’s APIs to store all it’s data.
  3. The OpenStack components install is managed by Chef cookbooks & recipes jointly developed by Dell, Opscode and Rackspace.
  4. Crowbar can be used to simply bootstrap your data center; however, we believe it is the start of a cloud operational model that I described in the hyperscale cloud white paper.

LIVE DEMO (video via Barton George): If you’re at SXSW on 3/14 @ 2pm in Kung Fu Salon, you can ask Greg Althaus to explain it – he does a better job than I do.

Here’s what you need to know to understand Crowbar:

Crowbar is a PXE state machine.

The primary function of Crowbar is to get new hardware into a state where it can be managed by Chef.   To get hardware into a “Chef Ready” state, there are several steps that must be performed.  We need to setup the BIOS, RAID, figure out where the server is racked, install an operating system, assign IP networking and names, synchronize clocks (NTP) and setup a chef client linked to our server.  That’s a lot of steps!

In order to do these steps, we need to boot the server through a series of controlled images (stages) and track the progress through each state.  That means that each state corresponds to a PXE boot image.  The images have a simple script that uses WGET to update the Crowbar server (which stores it’s data in Chef) when the script completes.  When a state is finished, Crowbar will change the PXE server to provide the next image in the sequence.

During the Crowbar managed part of the install, the servers will reboot several times.  Once all of the hardware configuration is complete, Crowbar will use an operating system install image to create the base configuration.  For the first release, we are only planning to have a single Operating System (Ubuntu 10.10); however, we expect to be adding more operating system options.

The current architecture of Crowbar (and the Chef Server that it extends) is to use a dedicated server in the system for administration.  Our default install adds PXE, DHCP, NTP, DNS, Nagios, & Ganglia to the admin server.  For small systems, you can use Chef to add other infrastructure capabilities to the admin server; unfortunately, adding components makes it harder to redeploy the components.  For dynamic configurations where you may want to rehearse deployments while building Chef recipes, we recommend installing other infrastructure services on the admin server.

Of course, the hardware configuration steps are vendor specific.  We had to make the state machine (stored in Chef data bags) configurable so that you can add or omit steps.  Since hardware config is slow, error prone and painful, we see this as a big value add.  Making it work for open source will depend on community participation.

Once Chef has control of the servers, you can use Chef (on the local Chef Server) to complete the OpenStack installation.  From there, you can continue to use Chef to deploy VMs into the environment.  Because Chef encourages a DevOps automation mindset, I believe there is a significant ROI to your investment in learning how this tool operates if you want to manage hyperscale clouds.

Crowbar effectively extends the reach of Chef earlier into the cloud management life cycle.

3/21 Note: Updated graphic to show WGET.