Bad Premise: Cloud Outages are *not* driving IT back to premises

trapped

I wrote this responding to Lauren Carlson‘s (Software Advice) Blog Post.  Lauren – I’d be more likely to agree with the statement that “SLAs are dead”  Here’s why…

<soapbox>

Recent industry buzz about cloud service level agreements (SLAs) and reliability miss the core point about cloud.  Cloud is about agility, business models, consumerization of software and merciless pursuit of efficiency.

The fact that Amazon EC2 built its base without an “enterprise” SLA is exhibit #1 that the IT world changed and it’s not going back.

Here are my reasons why IT pandoras can’t get cloud back into the box.

#1. Cloud has vastly superior network connectivity

The concept of your users accessing your applications from inside your firewall is so 2005.  Today’s reality is that significant amounts of network access is externally routed means that applications need to live where they have excellent bandwidth to their users and to other applications.

#2. Cloud has elastic consumption of resources

Cloud is not less expensive infrastructure, it is mainly more flexible.  If you’re worried about an outage, then cloud is exactly the investment for you because you position a backup site at another location without having to pay for online resources.  It’s much harder to take down a site that invests the time to design a system that dynamically reallocates load between sites.

#3. Cloud drives more robust architecture

The fact that cloud delivery is more opaque and modular without a five 9s SLA has driven a cloud application architecture revolution (see CAP).  We have shifted the app paradigm from robust scale up hardware to robust scale out software.  Also significant, DevOps innovations have made deployments repeatable and adaptable.

The only “logical” argument for pulling applications back from the cloud is to assert control over more of the delivery chain for your application.  It the same reason that we think that driving is safer than flying – we’re the ones sitting behind the wheel when we drive.  News flash – driving is NOT safer than flying.

Cloud applications are not about hardware infrastructure, they are about SOFTWARE.  Perhaps one of the greatest disservices foisted on the market was saying cloud is synonymous with “Infrastructure as a Service” and “Virtualization.”  Cloud applications are powerful because we created ways that circumvent the limitations of IaaS and VMs!

</soapbox>

Not all APIs are equal: the power of API + implementation (OpenStack vs LibCloud vs DeltaCloud)

sky

I’ve been getting a lot of questions about Apache LibCloud and RedHat’s DeltaCloud vs. OpenStack.  While all of these projects offer APIs, only OpenStack is based on an implementation.

Having an implementation means that the API is reflected by code that delivers the functionality of the API.  This means that the implementation based API more closely reflects the actual workings of the system while the “pure” API must abstract the working of multiple systems.   The API only approach ends up having to become a least common deminator instead of a vision of the pure use cases.

LibCloud and DeltaCloud are important and useful.  They provide abstractions that help developers write applications without being tied to a specific cloud vendor.  While lack of lock-in is a concrete benefit, it comes at a price.  The price is that the API shim cannot expose features that differentiate the platforms.  This may represent a significant loss of functionality or performance.

When developers implement directly against an implemented API, they can take advantage of the full feature set of their target cloud.  They can also test and verify more directly.  These are significant benefits that result in richer, more robust and faster to market products.

Both approaches have their place and are needed in the market.  If I needed to write against multiple clouds for portability then Libcloud is a slam dunk.  If I needed rich features and an ecosystem then OpenStack or Amazon are better choices.

Hungry for Nova Cuisine? Adding Chef recipes for OpenStack Nova

As promised, here’s the other drop in advance of our OpenStack team’s Crowbar release. 

This is the second part of the Swift and Nova recipes that we are intentionally leaking out to the community.

USAGE NOTE: These recipes are designed to work with Crowbar!  They are not intended to stand alone.

As part of our collaboration with Opscode, Matt Ray, has been merging our recipes into his most excellent OpenStack cookbook tree.  If you want to see our unmerged recipes, we’re also posting those to our github

In addition to our Swift recipes, you can now check out the Nova recipes.

ADDITIONAL USAGE NOTE: The Matt’s tree is more complete – these are released for reference only.  They will ultimately be maintained as part of the Crowbar.

Cooking up OpenStack Chef recipes with Opscode

Our OpenStack team here at Dell has been busy getting Crowbar ready to open source and that does not leave much time for blog posts.  We’re putting on a new UI, modularizing with barclamps and creating network options for Nova Cactus.

Sharing is goodHowever, I wanted to take a minute to update the community about Swift and Nova recipes that we are intenionally leaking out to the community in advance of the larger Crowbar code drop.

As part of our collaboration with Opscode, Matt Ray, has been merging our recipes into his most excellent OpenStack cookbook tree.  If you want to see our unmerged recipes, we’re also posting those to our github.  So far, we have the Swift recipes available (thanks to Andi Abes!) with Nova to follow soon.

5/31 Update: These are now online.

#OpenStack Blueprint for Cloud Installer (#crowbar, #apache2)

Tonight I submitted a formal OpenStack Common blue print for Crowbar as a cloud installerMy team at Dell considers this to be our first step towards delivering the code as open source (next few weeks) and want to show the community the design thinking behind the project.  Crowbar currently only embodies a fraction of this scope but we have designed it looking forward.

I’ve copied the text of our inital blueprint here until it is approved.  The living document will be maintained at the OpenStack launch pad and I will update links appropriately.

Here’s what I submitted:

Note: Installer is used here because of convention. The scope of this blue print is intended to include expansion and maintenance of the OpenStack infrastructure. 

Summary

This blueprint creates a common installation system for OpenStack infrastructure and components. The installer should be able to discover and configure physical equipment (servers,switches, etc) and then deploy the OpenStack software components in an optimum way for the discovered infrastructure. Minimum manual steps should be needed for setup and maintenance of the system.

Users should be able to leverage and contribute to components of the system without deploying 100% of the system. This encourages community collaboration. For example, installation scripts that deploy and configure OpenStack components should be usable without using bare metal configuration and vice-versa.

The expected result will be installations that are 100% automated after racking gear with no individual touch of any components.

This means that the installer will be able to

  • expand physical capacity
  • update of software components
  • addition of new software components
  • cope with heterogeneous environments (hardware, OpenStack components, hyper-visors, operating systems, etc)
  • handle rolling upgrades (due to the scale of OpenStack target deployments)

 

Release Note

Not currently released. Reference code (“Crowbar”) to be delivered by Dell via GitHub .

Rationale / Problem Statement

While a complete deployment system is an essential component to ensure adoption, it also fosters sharing and encoding of operational methods by the community. This follows and “Open Ops” strategy that encourages OpenStack users to create and share best practices.

The installer addresses the following needs

  • Community collaboration on deployment scripts and architecture.
  • Bare metal installation – this is different, but possibly related to Nova bare metal provisioning
  • OpenStack is evolving (Ops Model, CloudOps )
  • Provide a common installation platform to facilitate consistent deployments

It is important that the installer does NOT

  • constrain architecture to limit scale
  • create extra effort to re-balance as system capacity grows

This design includes an “Ops Infrastructure API” for use by other components and services. This REST API will allow trusted applications to discover and inspect the operational infrastructure to provide additional services. The API should expose

  • Managed selection of components & requests
  • Expose internal infrastructure (not for customer use, but to enable Ops tools)
    • networks
    • nodes
    • capacity
    • configuration

 

Assumptions

 

  • OpenStack code base will not limit development based on current architecture practices. Cloud architectures will need to adopt
  • Expectation to use IP-based system management tools to provide out of band reboot and power controls.

 

Design

The installation process has multiple operations phases: 1) bare metal provisioning, 2) component deployment, and 3) upgrade/redeployment. While each phase is distinct, they must act in a coordinated way.

A provisioning state machine (PSM) is a core concept for this overall installation architecture. The PSM must be extensible so that new capabilities and sequences can be added.

It is important that installer support IPv6 as an end state. It is not required that the entire process be IPv4 or IPv6 since changing address schema may be desirable depending on the task to be performed.

Modular Design Objective

  • should have a narrow focus for installation – a single product or capability.
  • may have pre-requisites or dependencies but as limited as possible
  • should have system, zone, and node specific configuration capabilities
  • should not interfere with operation of other modules

 

Phase 1: Bare Metal Provisioning

  • For each node:
    • Entry State: unconfigured hardware with network connectivity and PXE boot enabled.
    • Exit State: minimal node config (correct operating system installed, system named and registered, checked into OpenStack install manager)

The core element for Phase 1 is a “PXE State Machine” (a subset of the PSM) that orchestrates node provisioning through multiple installation points. This allows different installation environments to be used while the system is prepared for it’s final state. These environments may include BIOS & RAID configuration, diagnostics, burn-in, and security validation.

It is anticipated that nodes will pass through phase 1 provisioning FOR EACH boot cycle. This allows the Installation Manager to perform any steps that may be dictated based on the PSM. This could include diagnostic and security checks of the physical infrastructure.

Considerations:

  • REST API for updating to new states from nodes
  • PSM changes PXE image based on state updates
  • PSM can use IPMI to force power changes
  • DHCP reservations assigned by MAC after discovery so nodes have a predictable IP
  • Phase 1 images may change IP addresses during this phase.
  • Discovery phase would use short term DHCP addresses. The size of the DHCP lease pool may be restricted but should allow for provisioning a rack of nodes at a time.
  • Configuration parameters for Phase 1 images can be passed
    • via DHCP properties (preferred)
    • REST data
  • Discovery phase is expected to set the FQDN for the node and register it with DNS

 

Phase 2: Component Deployment

  • Entry State: set of nodes in minimal configuration (number required depends on components to deploy, generally >=5)
  • Requirements:
  • Exit State: one or more

During Phase 2, the installer must act on the system as a whole. The focus shifts from single node provisioning, to system level deployment and configuration.

Phase 2 extends the PSM to comprehend the dependencies between system components. The use of a state machine is essential because system configuration may require that individual nodes return to Phase 1 in order to change their physical configuration. For example, a node identified for use by Swift may need to be setup as a JBOD while the same node could be configured as RAID 10 for Nova. The PSM would also be used to handle inter-dependencies between components that are difficult to script in stages such as rebalancing a Swift ring.

Considerations:

  • Deployments must be infrastructure aware so they can take network topology, disk capacity, fault zones, and proximity into account.
  • System must generate a reviewable proposal for roles nodes will perform.
  • Roles (nodes may have >1 role) define OS & prerequisite components that execute on on nodes
  • Operations on nodes should be omnipotent for individual actions (multiple state operations will violate this principle by definition)
  • System wide configuration information must be available to individual configuration nodes (e.g.: Scheduler must be able to retrieve a list of all nodes and that list must be automatically updated when new nodes are added).
  • Administrators must be able to centrally override global configuration on a individual, rack and zone basis.
  • Scripts must be able to identify other nodes and find which roles they were executing
  • Must be able to handle non-OS components such as networking, VLANs, load balancers, and firewalls.

 

Phase 3: Upgrade / Redeployment

The ultimate objective for Phase 3 is to foster a continuous deployment capability in which updates from OpenStack can be frequently and easily implemented in a production environment with minimal risk. This requires a substantial amount of self-testing and automation.

Phase 3 maintains the system when new components arrive. Phase 3 includes the added requirements:

  • rolling upgrades so that system operation is not compromised during a deployment
  • upgrade/patch of modules
  • new modules must be aware of current deployments
  • configuration and data must be preserved
  • deployments may extend the PSM to to pre-stage operations (move data and vms) before taking action.

 

Ops API

This needs additional requirements.

The objective of the Ops API is to provide a standard way for operations tools to map the internal cloud infrastructure without duplicating discovery effort. This will allow tools that can:

  • create billing data
  • audit security
  • rebalance physical capacity
  • manage power
  • audit & enforce physical partitions between tenants
  • generate ROI analysis
  • IP Address Management (possibly integration/bootstrap with the OpenStack network services)
  • Capacity Planning

 

User Stories

 

Personas:

  • Oscar: Operations Chief
    • Knows of Chef or Puppet. Likely has some experience
    • Comfortable and likes Linux. Probably prefers CentOS
    • Can work with network configuration, but does not own network
    • Has used VMware
  • Charlie: CIO
    • Concerned about time to market and ROI
    • Is working on commercial offering based on OpenStack
  • Denise: Cloud Developer
    • Working on adding features to OpenStack
    • Working on services to pair w/ OpenStack
    • Comfortable with Ruby code
  • Quick: Data Center Worker
    • Can operate systems
    • In charge of rack and replacement of gear
    • Can supervise, but not create automation

 

Proof of Concept (PoC ) use cases

 

Agrees to POC

  • Charlie agrees to be in POC by signing agreements
  • Dell gathers information about shipping and PO delivery
  • Quick provides shipping information to Dell
  • Oscar downloads ISO and VMPlayer image from Dell provided site.

 

Get Equipment Setup to base

Event: The Dell equipment has just arrived.

  • Quick checks the manifest to make sure that the equipment arrived.
  • Quick racks the servers and switch following the wiring chart provided by Oscar
  • Quick follows the installation guides BIOS and Raid configuration parameters for the Admin Node
  • Quick powers up the servers to make sure all the lights blink then turns them back off
  • Oscar arrives with his laptop and the crowbar ISO
  • As per instructions, Oscar wires his laptop to the admin server and uses VMplayer to bootstrap the ISO image
  • Oscar logs into the VMPlayer image and configures base admin parameters
    • Hostname
    • networks (admin and public required)
      • admin ips
      • routers
      • masks
      • subnets
      • usable ranges (mostly for public).
    • Optional: ntp server(s)
    • Optional: forwarding nameserver(s)
    • passwords and accounts
    • Manually edits files that get downloaded.
  • System validates configuration for syntax and obvious semantic issues.
  • System clears switch config and sets port fast and lldp med configuration.
  • Oscar powers system and selects network boot (system may automatically do this out of the “box”, but can reset if need be).
  • Once the bootstrap and installation of the Ubuntu-based image is completed, Oscar disconnects his laptop from the Admin server and connects into the switch.
  • Oscar configures his laptop for DHCP to join the admin network.
  • Oscar looks at the Chef UI and verifies that it is running and he can see the Admin node in the list.
    • The Install guide will describe this first step and initial passwords.
    • The install guide will have a page describing a valid visualization of the environment.
  • Oscar powers on the next node in the system and monitors its progress in Chef.
    • The install guide will have a page describing this process.
    • The Chef status page will have the node arrive and can be monitored from there. Completion occurs when the node is “checked in”. Intermediate states can be viewed by checking the nodes state attribute.
    • Node transitions through defined flow process for discovery, bios update, bios setting, and installation of base image.
  • Once Oscar sees the node report into Chef, Oscar shows Quick how to check the system status and tells him to turn on the rest of the nodes and monitor them.
  • Quick monitors the nodes while they install. He calls Oscar when they are all in the “ready” state. Then he calls Oscar back.
  • Oscar checks their health in Nagios and Ganglia.
  • If there are any red warnings, Oscar works to fix them.

 

Install OpenStack Swift

Event: System checked out healthy from base configuration

  • Oscar logs into the Crowbar portal
  • Oscar selects swift role from role list
  • Oscar is presented with a current view of the swift deployment.
    • Which starts empty
  • Oscar asks for a proposal of swift layout
    • The UI returns a list of storage, auth, proxy, and options.
  • Oscar may take the following actions:
    • He may tweak attributes to better set deployment
      • Use admin node in swift
      • Networking options …
    • He may force a node out or into a sub-role
    • He may re-generate proposal
    • He may commit proposal
  • Oscar finishes configuration proposal and commits proposal.
  • Oscar may validate progress by watching:
    • Crowbar main screen to see that configuration has been updated.
    • Nagios to validate that services have started
    • Chef UI to see raw data..
  • Oscar checks the swift status page to validate that the swift validation tests have completed successfully.
  • If Swift validation tests fail, Oscar uses troubleshooting guide to correct problems or calls support.
    • Oscar uses re-run validation test button to see if corrective action worked.
  • Oscar is directed to Swift On-line documentation for using a swift cloud from the install guide.

 

Install OpenStack Nova

Event: System checked out healthy from base configuration

  • Oscar logs into the Crowbar portal
  • Oscar selects nova role from role list
  • Oscar is presented with a current view of the nova deployment.
    • Which starts empty
  • Oscar asks for a proposal of nova layout
    • The UI returns a list of options, and current sub-role usage (6 or 7 roles).
    • If Oscar has already configured swift, the system will automatically configure glance to use swift.
  • Oscar may take the following actions:
    • He may tweak attributes to better set deployment
      • Use admin node in nova
      • Networking options …
    • He may force a node out or into a sub-role
    • He may re-generate proposal
    • He may commit proposal
  • Oscar finishes configuration proposal and commits proposal.
  • Oscar may validate progress by watching:
    • Crowbar main screen to see that configuration has been updated.
    • Nagios to validate that services have started
    • Chef UI to see raw data..
  • Oscar checks the nova status page to validate that the nova validation tests have completed successfully.
  • If nova validation tests fail, Oscar uses troubleshooting guide to correct problems or calls support.
    • Oscar uses re-run validation test button to see if corrective action worked.
  • Oscar is directed to Nova On-line documentation for using a nova cloud from the install guide.

 

Pilot and Beyond Use Cases

 

Unattended refresh of system

This is a special case, for Denise.

  • Denise is making daily changes to OpenStack’s code base and needed to test it. She has committed changes to their git code repository and started the automated build process
  • The system automatically receives that latest code and copies it to the admin server
  • A job on admin server sees there is new code resets all the work nodes to “uninstalled” and reboots them.
  • Crowbar reimages and reinstalls the images based on its cookbooks
  • Crowbar executes the test suites against OpenStack when the install completes
  • Denise reviews the test suite report in the morning.

 

Integrate into existing management

Event: System has passed lab inspection, is about to be connected into the corporate network (or hosting data center)

  • Charlie calls Oscar to find out when PoC will start moving into production
  • Oscar realizes that he must change from Nagios to BMC on all the nodes or they will be black listed on the network.
  • Oscar realizes that he needs to update the SSH certificates on the nodes so they can be access via remote. He also has to change the accounts that have root access.
  • Option 1: Reinstall.
    • Oscar updates the Chef recipes to remove Nagios and add BMC, copy the cert and configure the accounts.
    • Oscar sets all the nodes to “uninstalled” and reimages the system.
    • Repeat above step until system is configured correctly
  • Option 2: Update Recipes
    • Oscar updates the Chef recipes to remove Nagios and add BMC, copy the cert and configure the accounts.
    • Oscar runs the Chef scripts and inspects one of the nodes to see if the changes were made

 

Implementation

We are offering Crowbar as a starting point. It is an extension of Opscode Chef Server that provides the state machine for phases 1 and 2. Both code bases are Apache 2

Test/Demo Plan

TBD

OpenStack Design Conference Observations (plus IPv6 thread)

I’m not going to post OpenStack full conference summary because I spent more time talking 1 on 1 with partners and customers than participating in sessions.  Other members of the Dell team (@galthaus) did spend more time (I’ll see if he’ll post his notes).

I did lead an IPv6 unconference and those notes are below.

Overall, my observations from the conference are:

  • A constantly level of healthy debate.  For OpenStack to thrive, the community must be able to disagree, discuss and reach consensus.   I saw that going in nearly every session and hallway.  There were some pitched battles with forks and branches but no injuries.
  • Lots of adopters.  For a project that’s months old, there were lots of companies that were making plans to use OpenStack in some way.
  • Everyone was in a rush.  There’s been something of a log jam for decision making because the market is changing so fast companies seem to delay committing waiting for the “next big thing.”
  • Service Providers and implementers were out in force.
  • IPv6 is interesting to a limited audience, but consistently injected.

While IPv6 deserves more coverage here, I thought it would be worthwhile to at least preserve my notes/tweets from the IPv6 unconference discussion (To IP or not to IPv6? That will be the question.) at the OpenStack Design Summit.

NOTE: My tweets for this topic are notes, not my own experience/opinions

  • RT @opnstk_com_mgr #openstack unconference in camino real today < #IPv6 session going now – good size crowd
  • #NTT has IPv6 for VMs and tests for IPv6. If you set the mac, then you will know what the address will be.
  • it will be helpful to break out VMs to multiple networks – could have a VM on both IPv6 & IPv4
  • @zehicle @sjensen1850 (Dell) if IPv6 100% then may break infrastructure products – inside, easier to stay v4
    • you don’t want to paint yourself into a corner – IPv6 should not become your major feature requirement
  • typing IPv6 address not that hard to remember. DNS helps, but not required if you want to get to machines.
  • using IPv6 not hard – issue is the policy to do it. Until it’s forced. We need to find a path for DUAL operation.
  • chicken/egg problem. Our primary job is to make sure it works and is easy to adopt.
    • we are missing information on what options we have for transforms
  • where is the responsibility to do the translation? floating IP scheme needs to be worked out. IPv6 can make this easier.
  • idea, IPv6 should be the default. Fill gap with IPv4 as a Service? Floating needs NAT – v4aaS is LB/Proxy
  • unconference session was great! Good participation and ideas. Lots of opinions.

We had a hallway conversation after the unconference about what would force the switch.  In a character, it’s $.

Votes for IPv6 during the keynote (tweet: I’d like to hear from audience here if that’s important to them. RT to vote).  Retweeters:

Modularizing Crowbar via Barclamps – Dell prepares to open source our #OpenStack installer

My team at Dell is working diligently to release Crowbar (Apache 2) to the community.

  • We have ramped up our team size (Andi Abes was spotted recently posting on the Swift list).
  • We are collaborating with partners like Rackspace, Opscode and Citrix
  • We brought in UI expertise (Jon Roberts) to improve usability and polish.
  • We are making sure that the code is integrated with our Dell OpenStack Solution (DOSS).
  • We are lining up customers for real field trials.

The single most critical aspect of Crowbar involves a recent architectural change by Greg Althaus to make Crowbar much more modular.  He dubbed the modules “barclamps” because they are used to attach new capabilities into the system.  For example, we include barclamps for DNS, discovery, Nova, Swift, Nagios, Gangalia, and BIOS config.  Users select which combination to use based on their deployment objectives.

In the Crowbar architecture, nearly every capability of the system is expressed as a barclamp.  This means that the code base can be expanded and updated modularly.  We feel that this pattern is essential to community involvement.

For example, another hardware vendor can add a barclamp that does the BIOS configuration for their specific equipment (yes! that is our intent).  While many barclamps will be included with the open source release to install open source components, we anticipate that other barclamps will be only available with licensed products or in limited distribution.

A barclamp is like a cloud menu planner: it evaluates the whole environment and proposes configurations, roles, and recipes that fit your infrastructure.  If you like the menu, then it tells Chef to start cooking.

Barclamps complement the “PXE state machine” aspect of Crowbar by providing logic Crowbar evaluates as the servers reach deployment states.  These states are completely configurable via the provisioner barclamp; consequently, Crowbar users can choose to change order of operations.  They can also add barclamps and easily incorporate them into their workflow where needed.

Barclamps take the form of a Rails controller that inherits from the barclamp superclass.  The superclass provides the basic REST verbs that each barclamp must service while the child class implements the logic to create a “proposal” leveraging the wealth of information in Chef.  Proposals are JSON collections that include configuration data needed for the deployment recipes and a mapping of nodes into roles.

Users are able to review and edit proposals (which are versioned) before asking Crowbar to implement the proposal in Chef.  The proposal is implemented by assigning the nodes into the proposed roles and allowing Chef to work it’s magic.

Users can operate barcamps in parallel.  In fact, most of our barclamps are designed to operate in conjunction.

Reminder: It is vital to understand that Crowbar is not a stand-alone utility.  It is coupled to Chef Server for deployment and data storage.  Our objective was to leverage the outstanding capabilities and community support for Chef as much as possible.

We’re excited about this architecture addition to Crowbar and encourage you to think about barclamps that would be helpful to your cloud deployment.

Substituting Action for Knowledge – adopting “ready, fire, aim” as a strategy (and when to run like hell)

Today my mother-in-law (a practicing psychiatrist) was bemoaning the current medical practice of substituting action for knowledge. In her world, many doctors will make rapid changes to their patients’ therapy. Their goal is to address the issues immediately presented (patient feels sad so Dr prescribes antidepressants) rather than taking time to understand the patients’ history or make changes incrementally and measure impacts. It feels like another example of our cultural compulsion to fix problems as quickly as possible.

Her comments made me question the core way that I evangelize!

Do Lean and Agile substitute action for knowledge? No. We use action to acquire knowledge.

The fundamental assumption that drives poor decision-making is that we have enough information to make a design, solve a problem or define a market. Lean and Agile’s more core tenet is that we must attack this assumption. We must assume that we can’t gather enough information to fully define our objective. The good news, is that even without much analysis we know a lot! We know:

  • roughly what we want to do (road map)
  • the first steps we should take (tactics)
  • who will be working on the problem (team members)
  • generally how much effort it will take (time & team size)
  • who has the problem that we are trying to solve (market)

We also know that we’ll learn a lot more as we get closer to our target. Every delay in starting effectively pushed our “day of clarity” further into the future. For that reason, it is essential that we build a process that constantly reviews and adjusts its targets.

We need to build a process that acquires knowledge as progress is made and makes rapid progress.

In Agile, we translate this need into the decorations of our process: reviews for learning, retrospectives for adjustments, planning for taking action and short iterations to drive the feedback loop.  Agile’s mantra is “ready, fire, aim, fire, aim, fire, aim, …” which is very different from simply jumping out of a plane without a parachute and hoping you’ll find a haystack to land in.

For cloud deployments, this means building operational knowledge in stages.  Technology is simply evolving too quickly and best practices too slowly for anyone to wait for a packaged solution to solve all their cloud infrastructure problems.  We tried this and it does not work: clouds are a mixture hardware, software and operations.  More accurately, clouds are an operational model supported by hardware and software.

Currently, 80% of cloud deployment effort is operations (or “DevOps“).

When I listen to people’s plans about building product or deploying cloud, I get very skeptical when they take a lot of time to aim at objects far off on the horizon.  Perhaps they are worried that they will substitute action for knowledge; however, I think they would be better served to test their knowledge with a little action.

My MIL agrees – she sees her patients frequently and makes small adjustments to their treatment as needed.  Wow, that’s an Rx for Agile!

How OpenStack installer (crowbar + chefops) works (video from 3/14 demo)

July 24th 2012 Update:

This page is very very old and Crowbar has progressed significantly since this was posted.  For better information, please visit the Crowbar wiki and  review my Crowbar 2 writeups.

August 5th 2011 Update:

While still relevant and accurate, the information on this page does not reflect the latest information about the now Apache 2 released Crowbar code.  In the 4+ months following this post, we substantially refactored the code make make it more modular (see Barclamps), better looking, and multi-vendor/multi-application (Hadoop & RHEL).  If you want more information, I recommend that you try Crowbar for yourself.

Original March 14th 2011 Text:

I’ve been getting some “how does Crowbar work” inquiries and wanted to take a shot at adding some technical detail.   Before I launch into technical babble, there are some important things to note:

  1. Dell has committed to open source release the code for Crowbar (Apache 2)
  2. Crowbar is an extension of Chef Server – it does not function stand alone and uses Chef’s APIs to store all it’s data.
  3. The OpenStack components install is managed by Chef cookbooks & recipes jointly developed by Dell, Opscode and Rackspace.
  4. Crowbar can be used to simply bootstrap your data center; however, we believe it is the start of a cloud operational model that I described in the hyperscale cloud white paper.

LIVE DEMO (video via Barton George): If you’re at SXSW on 3/14 @ 2pm in Kung Fu Salon, you can ask Greg Althaus to explain it – he does a better job than I do.

Here’s what you need to know to understand Crowbar:

Crowbar is a PXE state machine.

The primary function of Crowbar is to get new hardware into a state where it can be managed by Chef.   To get hardware into a “Chef Ready” state, there are several steps that must be performed.  We need to setup the BIOS, RAID, figure out where the server is racked, install an operating system, assign IP networking and names, synchronize clocks (NTP) and setup a chef client linked to our server.  That’s a lot of steps!

In order to do these steps, we need to boot the server through a series of controlled images (stages) and track the progress through each state.  That means that each state corresponds to a PXE boot image.  The images have a simple script that uses WGET to update the Crowbar server (which stores it’s data in Chef) when the script completes.  When a state is finished, Crowbar will change the PXE server to provide the next image in the sequence.

During the Crowbar managed part of the install, the servers will reboot several times.  Once all of the hardware configuration is complete, Crowbar will use an operating system install image to create the base configuration.  For the first release, we are only planning to have a single Operating System (Ubuntu 10.10); however, we expect to be adding more operating system options.

The current architecture of Crowbar (and the Chef Server that it extends) is to use a dedicated server in the system for administration.  Our default install adds PXE, DHCP, NTP, DNS, Nagios, & Ganglia to the admin server.  For small systems, you can use Chef to add other infrastructure capabilities to the admin server; unfortunately, adding components makes it harder to redeploy the components.  For dynamic configurations where you may want to rehearse deployments while building Chef recipes, we recommend installing other infrastructure services on the admin server.

Of course, the hardware configuration steps are vendor specific.  We had to make the state machine (stored in Chef data bags) configurable so that you can add or omit steps.  Since hardware config is slow, error prone and painful, we see this as a big value add.  Making it work for open source will depend on community participation.

Once Chef has control of the servers, you can use Chef (on the local Chef Server) to complete the OpenStack installation.  From there, you can continue to use Chef to deploy VMs into the environment.  Because Chef encourages a DevOps automation mindset, I believe there is a significant ROI to your investment in learning how this tool operates if you want to manage hyperscale clouds.

Crowbar effectively extends the reach of Chef earlier into the cloud management life cycle.

3/21 Note: Updated graphic to show WGET.

Notes from 2011 Cloud Connect Event Day 2 (#ccevent)

With the OpenStack launch behind me, I have some time to attend the Cloud Connect Event.  I missed all the DevOps sessions, but was getting to geek out on the NoSQL & Big Data sessions.   I jumped to the private cloud track (based on Twitter traffic) and was rewarded for the shift.

I’m surprised at how much focus this cloud conference is dedicated to private cloud.  At other cloud conferences I’ve attended, the focus has been on learning how to use the cloud (specifically the public cloud).  This is the first cloud show I’ve attended that has so much emphasis, dialog and vendor feeding around private.  This was a suits & slacks show with few jeans, t-shirts, and pony tails.  Perhaps private cloud is where the $$$ is being spent now?

It definitely feels like using cloud has become assumed, but the best practices and tools are just emerging.

The twitter #ccevent stream is interesting but temporal.  I’m posting my raw (spelling optional) notes (below the more tag) because there is a lot of great content from the show to support and extend the twitter stream.  I’ll try to italicize some of the better lines.

Continue reading