OpenStack discussion at 5/19 Central Texas Linux Users Group (CTLUG ATX)

image

Greg Althaus (@glathaus) and I will be leading a discussion about OpenStack at the May CTLUG  on 5/19 at 7pm.  The location is Mangia Pizza on Burnet and Duval (In the strip mall where Taco Deli is).

We’ll talk about how OpenStack works, where we see it going, and what Dell is doing to participate in the community.

OpenStack should be very interesting to the CTLUG because of the technologies being used AND way that the community is engaged in helping craft the software.

#OpenStack Blueprint for Cloud Installer (#crowbar, #apache2)

Tonight I submitted a formal OpenStack Common blue print for Crowbar as a cloud installerMy team at Dell considers this to be our first step towards delivering the code as open source (next few weeks) and want to show the community the design thinking behind the project.  Crowbar currently only embodies a fraction of this scope but we have designed it looking forward.

I’ve copied the text of our inital blueprint here until it is approved.  The living document will be maintained at the OpenStack launch pad and I will update links appropriately.

Here’s what I submitted:

Note: Installer is used here because of convention. The scope of this blue print is intended to include expansion and maintenance of the OpenStack infrastructure. 

Summary

This blueprint creates a common installation system for OpenStack infrastructure and components. The installer should be able to discover and configure physical equipment (servers,switches, etc) and then deploy the OpenStack software components in an optimum way for the discovered infrastructure. Minimum manual steps should be needed for setup and maintenance of the system.

Users should be able to leverage and contribute to components of the system without deploying 100% of the system. This encourages community collaboration. For example, installation scripts that deploy and configure OpenStack components should be usable without using bare metal configuration and vice-versa.

The expected result will be installations that are 100% automated after racking gear with no individual touch of any components.

This means that the installer will be able to

  • expand physical capacity
  • update of software components
  • addition of new software components
  • cope with heterogeneous environments (hardware, OpenStack components, hyper-visors, operating systems, etc)
  • handle rolling upgrades (due to the scale of OpenStack target deployments)

 

Release Note

Not currently released. Reference code (“Crowbar”) to be delivered by Dell via GitHub .

Rationale / Problem Statement

While a complete deployment system is an essential component to ensure adoption, it also fosters sharing and encoding of operational methods by the community. This follows and “Open Ops” strategy that encourages OpenStack users to create and share best practices.

The installer addresses the following needs

  • Community collaboration on deployment scripts and architecture.
  • Bare metal installation – this is different, but possibly related to Nova bare metal provisioning
  • OpenStack is evolving (Ops Model, CloudOps )
  • Provide a common installation platform to facilitate consistent deployments

It is important that the installer does NOT

  • constrain architecture to limit scale
  • create extra effort to re-balance as system capacity grows

This design includes an “Ops Infrastructure API” for use by other components and services. This REST API will allow trusted applications to discover and inspect the operational infrastructure to provide additional services. The API should expose

  • Managed selection of components & requests
  • Expose internal infrastructure (not for customer use, but to enable Ops tools)
    • networks
    • nodes
    • capacity
    • configuration

 

Assumptions

 

  • OpenStack code base will not limit development based on current architecture practices. Cloud architectures will need to adopt
  • Expectation to use IP-based system management tools to provide out of band reboot and power controls.

 

Design

The installation process has multiple operations phases: 1) bare metal provisioning, 2) component deployment, and 3) upgrade/redeployment. While each phase is distinct, they must act in a coordinated way.

A provisioning state machine (PSM) is a core concept for this overall installation architecture. The PSM must be extensible so that new capabilities and sequences can be added.

It is important that installer support IPv6 as an end state. It is not required that the entire process be IPv4 or IPv6 since changing address schema may be desirable depending on the task to be performed.

Modular Design Objective

  • should have a narrow focus for installation – a single product or capability.
  • may have pre-requisites or dependencies but as limited as possible
  • should have system, zone, and node specific configuration capabilities
  • should not interfere with operation of other modules

 

Phase 1: Bare Metal Provisioning

  • For each node:
    • Entry State: unconfigured hardware with network connectivity and PXE boot enabled.
    • Exit State: minimal node config (correct operating system installed, system named and registered, checked into OpenStack install manager)

The core element for Phase 1 is a “PXE State Machine” (a subset of the PSM) that orchestrates node provisioning through multiple installation points. This allows different installation environments to be used while the system is prepared for it’s final state. These environments may include BIOS & RAID configuration, diagnostics, burn-in, and security validation.

It is anticipated that nodes will pass through phase 1 provisioning FOR EACH boot cycle. This allows the Installation Manager to perform any steps that may be dictated based on the PSM. This could include diagnostic and security checks of the physical infrastructure.

Considerations:

  • REST API for updating to new states from nodes
  • PSM changes PXE image based on state updates
  • PSM can use IPMI to force power changes
  • DHCP reservations assigned by MAC after discovery so nodes have a predictable IP
  • Phase 1 images may change IP addresses during this phase.
  • Discovery phase would use short term DHCP addresses. The size of the DHCP lease pool may be restricted but should allow for provisioning a rack of nodes at a time.
  • Configuration parameters for Phase 1 images can be passed
    • via DHCP properties (preferred)
    • REST data
  • Discovery phase is expected to set the FQDN for the node and register it with DNS

 

Phase 2: Component Deployment

  • Entry State: set of nodes in minimal configuration (number required depends on components to deploy, generally >=5)
  • Requirements:
  • Exit State: one or more

During Phase 2, the installer must act on the system as a whole. The focus shifts from single node provisioning, to system level deployment and configuration.

Phase 2 extends the PSM to comprehend the dependencies between system components. The use of a state machine is essential because system configuration may require that individual nodes return to Phase 1 in order to change their physical configuration. For example, a node identified for use by Swift may need to be setup as a JBOD while the same node could be configured as RAID 10 for Nova. The PSM would also be used to handle inter-dependencies between components that are difficult to script in stages such as rebalancing a Swift ring.

Considerations:

  • Deployments must be infrastructure aware so they can take network topology, disk capacity, fault zones, and proximity into account.
  • System must generate a reviewable proposal for roles nodes will perform.
  • Roles (nodes may have >1 role) define OS & prerequisite components that execute on on nodes
  • Operations on nodes should be omnipotent for individual actions (multiple state operations will violate this principle by definition)
  • System wide configuration information must be available to individual configuration nodes (e.g.: Scheduler must be able to retrieve a list of all nodes and that list must be automatically updated when new nodes are added).
  • Administrators must be able to centrally override global configuration on a individual, rack and zone basis.
  • Scripts must be able to identify other nodes and find which roles they were executing
  • Must be able to handle non-OS components such as networking, VLANs, load balancers, and firewalls.

 

Phase 3: Upgrade / Redeployment

The ultimate objective for Phase 3 is to foster a continuous deployment capability in which updates from OpenStack can be frequently and easily implemented in a production environment with minimal risk. This requires a substantial amount of self-testing and automation.

Phase 3 maintains the system when new components arrive. Phase 3 includes the added requirements:

  • rolling upgrades so that system operation is not compromised during a deployment
  • upgrade/patch of modules
  • new modules must be aware of current deployments
  • configuration and data must be preserved
  • deployments may extend the PSM to to pre-stage operations (move data and vms) before taking action.

 

Ops API

This needs additional requirements.

The objective of the Ops API is to provide a standard way for operations tools to map the internal cloud infrastructure without duplicating discovery effort. This will allow tools that can:

  • create billing data
  • audit security
  • rebalance physical capacity
  • manage power
  • audit & enforce physical partitions between tenants
  • generate ROI analysis
  • IP Address Management (possibly integration/bootstrap with the OpenStack network services)
  • Capacity Planning

 

User Stories

 

Personas:

  • Oscar: Operations Chief
    • Knows of Chef or Puppet. Likely has some experience
    • Comfortable and likes Linux. Probably prefers CentOS
    • Can work with network configuration, but does not own network
    • Has used VMware
  • Charlie: CIO
    • Concerned about time to market and ROI
    • Is working on commercial offering based on OpenStack
  • Denise: Cloud Developer
    • Working on adding features to OpenStack
    • Working on services to pair w/ OpenStack
    • Comfortable with Ruby code
  • Quick: Data Center Worker
    • Can operate systems
    • In charge of rack and replacement of gear
    • Can supervise, but not create automation

 

Proof of Concept (PoC ) use cases

 

Agrees to POC

  • Charlie agrees to be in POC by signing agreements
  • Dell gathers information about shipping and PO delivery
  • Quick provides shipping information to Dell
  • Oscar downloads ISO and VMPlayer image from Dell provided site.

 

Get Equipment Setup to base

Event: The Dell equipment has just arrived.

  • Quick checks the manifest to make sure that the equipment arrived.
  • Quick racks the servers and switch following the wiring chart provided by Oscar
  • Quick follows the installation guides BIOS and Raid configuration parameters for the Admin Node
  • Quick powers up the servers to make sure all the lights blink then turns them back off
  • Oscar arrives with his laptop and the crowbar ISO
  • As per instructions, Oscar wires his laptop to the admin server and uses VMplayer to bootstrap the ISO image
  • Oscar logs into the VMPlayer image and configures base admin parameters
    • Hostname
    • networks (admin and public required)
      • admin ips
      • routers
      • masks
      • subnets
      • usable ranges (mostly for public).
    • Optional: ntp server(s)
    • Optional: forwarding nameserver(s)
    • passwords and accounts
    • Manually edits files that get downloaded.
  • System validates configuration for syntax and obvious semantic issues.
  • System clears switch config and sets port fast and lldp med configuration.
  • Oscar powers system and selects network boot (system may automatically do this out of the “box”, but can reset if need be).
  • Once the bootstrap and installation of the Ubuntu-based image is completed, Oscar disconnects his laptop from the Admin server and connects into the switch.
  • Oscar configures his laptop for DHCP to join the admin network.
  • Oscar looks at the Chef UI and verifies that it is running and he can see the Admin node in the list.
    • The Install guide will describe this first step and initial passwords.
    • The install guide will have a page describing a valid visualization of the environment.
  • Oscar powers on the next node in the system and monitors its progress in Chef.
    • The install guide will have a page describing this process.
    • The Chef status page will have the node arrive and can be monitored from there. Completion occurs when the node is “checked in”. Intermediate states can be viewed by checking the nodes state attribute.
    • Node transitions through defined flow process for discovery, bios update, bios setting, and installation of base image.
  • Once Oscar sees the node report into Chef, Oscar shows Quick how to check the system status and tells him to turn on the rest of the nodes and monitor them.
  • Quick monitors the nodes while they install. He calls Oscar when they are all in the “ready” state. Then he calls Oscar back.
  • Oscar checks their health in Nagios and Ganglia.
  • If there are any red warnings, Oscar works to fix them.

 

Install OpenStack Swift

Event: System checked out healthy from base configuration

  • Oscar logs into the Crowbar portal
  • Oscar selects swift role from role list
  • Oscar is presented with a current view of the swift deployment.
    • Which starts empty
  • Oscar asks for a proposal of swift layout
    • The UI returns a list of storage, auth, proxy, and options.
  • Oscar may take the following actions:
    • He may tweak attributes to better set deployment
      • Use admin node in swift
      • Networking options …
    • He may force a node out or into a sub-role
    • He may re-generate proposal
    • He may commit proposal
  • Oscar finishes configuration proposal and commits proposal.
  • Oscar may validate progress by watching:
    • Crowbar main screen to see that configuration has been updated.
    • Nagios to validate that services have started
    • Chef UI to see raw data..
  • Oscar checks the swift status page to validate that the swift validation tests have completed successfully.
  • If Swift validation tests fail, Oscar uses troubleshooting guide to correct problems or calls support.
    • Oscar uses re-run validation test button to see if corrective action worked.
  • Oscar is directed to Swift On-line documentation for using a swift cloud from the install guide.

 

Install OpenStack Nova

Event: System checked out healthy from base configuration

  • Oscar logs into the Crowbar portal
  • Oscar selects nova role from role list
  • Oscar is presented with a current view of the nova deployment.
    • Which starts empty
  • Oscar asks for a proposal of nova layout
    • The UI returns a list of options, and current sub-role usage (6 or 7 roles).
    • If Oscar has already configured swift, the system will automatically configure glance to use swift.
  • Oscar may take the following actions:
    • He may tweak attributes to better set deployment
      • Use admin node in nova
      • Networking options …
    • He may force a node out or into a sub-role
    • He may re-generate proposal
    • He may commit proposal
  • Oscar finishes configuration proposal and commits proposal.
  • Oscar may validate progress by watching:
    • Crowbar main screen to see that configuration has been updated.
    • Nagios to validate that services have started
    • Chef UI to see raw data..
  • Oscar checks the nova status page to validate that the nova validation tests have completed successfully.
  • If nova validation tests fail, Oscar uses troubleshooting guide to correct problems or calls support.
    • Oscar uses re-run validation test button to see if corrective action worked.
  • Oscar is directed to Nova On-line documentation for using a nova cloud from the install guide.

 

Pilot and Beyond Use Cases

 

Unattended refresh of system

This is a special case, for Denise.

  • Denise is making daily changes to OpenStack’s code base and needed to test it. She has committed changes to their git code repository and started the automated build process
  • The system automatically receives that latest code and copies it to the admin server
  • A job on admin server sees there is new code resets all the work nodes to “uninstalled” and reboots them.
  • Crowbar reimages and reinstalls the images based on its cookbooks
  • Crowbar executes the test suites against OpenStack when the install completes
  • Denise reviews the test suite report in the morning.

 

Integrate into existing management

Event: System has passed lab inspection, is about to be connected into the corporate network (or hosting data center)

  • Charlie calls Oscar to find out when PoC will start moving into production
  • Oscar realizes that he must change from Nagios to BMC on all the nodes or they will be black listed on the network.
  • Oscar realizes that he needs to update the SSH certificates on the nodes so they can be access via remote. He also has to change the accounts that have root access.
  • Option 1: Reinstall.
    • Oscar updates the Chef recipes to remove Nagios and add BMC, copy the cert and configure the accounts.
    • Oscar sets all the nodes to “uninstalled” and reimages the system.
    • Repeat above step until system is configured correctly
  • Option 2: Update Recipes
    • Oscar updates the Chef recipes to remove Nagios and add BMC, copy the cert and configure the accounts.
    • Oscar runs the Chef scripts and inspects one of the nodes to see if the changes were made

 

Implementation

We are offering Crowbar as a starting point. It is an extension of Opscode Chef Server that provides the state machine for phases 1 and 2. Both code bases are Apache 2

Test/Demo Plan

TBD

OpenStack Design Conference Observations (plus IPv6 thread)

I’m not going to post OpenStack full conference summary because I spent more time talking 1 on 1 with partners and customers than participating in sessions.  Other members of the Dell team (@galthaus) did spend more time (I’ll see if he’ll post his notes).

I did lead an IPv6 unconference and those notes are below.

Overall, my observations from the conference are:

  • A constantly level of healthy debate.  For OpenStack to thrive, the community must be able to disagree, discuss and reach consensus.   I saw that going in nearly every session and hallway.  There were some pitched battles with forks and branches but no injuries.
  • Lots of adopters.  For a project that’s months old, there were lots of companies that were making plans to use OpenStack in some way.
  • Everyone was in a rush.  There’s been something of a log jam for decision making because the market is changing so fast companies seem to delay committing waiting for the “next big thing.”
  • Service Providers and implementers were out in force.
  • IPv6 is interesting to a limited audience, but consistently injected.

While IPv6 deserves more coverage here, I thought it would be worthwhile to at least preserve my notes/tweets from the IPv6 unconference discussion (To IP or not to IPv6? That will be the question.) at the OpenStack Design Summit.

NOTE: My tweets for this topic are notes, not my own experience/opinions

  • RT @opnstk_com_mgr #openstack unconference in camino real today < #IPv6 session going now – good size crowd
  • #NTT has IPv6 for VMs and tests for IPv6. If you set the mac, then you will know what the address will be.
  • it will be helpful to break out VMs to multiple networks – could have a VM on both IPv6 & IPv4
  • @zehicle @sjensen1850 (Dell) if IPv6 100% then may break infrastructure products – inside, easier to stay v4
    • you don’t want to paint yourself into a corner – IPv6 should not become your major feature requirement
  • typing IPv6 address not that hard to remember. DNS helps, but not required if you want to get to machines.
  • using IPv6 not hard – issue is the policy to do it. Until it’s forced. We need to find a path for DUAL operation.
  • chicken/egg problem. Our primary job is to make sure it works and is easy to adopt.
    • we are missing information on what options we have for transforms
  • where is the responsibility to do the translation? floating IP scheme needs to be worked out. IPv6 can make this easier.
  • idea, IPv6 should be the default. Fill gap with IPv4 as a Service? Floating needs NAT – v4aaS is LB/Proxy
  • unconference session was great! Good participation and ideas. Lots of opinions.

We had a hallway conversation after the unconference about what would force the switch.  In a character, it’s $.

Votes for IPv6 during the keynote (tweet: I’d like to hear from audience here if that’s important to them. RT to vote).  Retweeters:

Modularizing Crowbar via Barclamps – Dell prepares to open source our #OpenStack installer

My team at Dell is working diligently to release Crowbar (Apache 2) to the community.

  • We have ramped up our team size (Andi Abes was spotted recently posting on the Swift list).
  • We are collaborating with partners like Rackspace, Opscode and Citrix
  • We brought in UI expertise (Jon Roberts) to improve usability and polish.
  • We are making sure that the code is integrated with our Dell OpenStack Solution (DOSS).
  • We are lining up customers for real field trials.

The single most critical aspect of Crowbar involves a recent architectural change by Greg Althaus to make Crowbar much more modular.  He dubbed the modules “barclamps” because they are used to attach new capabilities into the system.  For example, we include barclamps for DNS, discovery, Nova, Swift, Nagios, Gangalia, and BIOS config.  Users select which combination to use based on their deployment objectives.

In the Crowbar architecture, nearly every capability of the system is expressed as a barclamp.  This means that the code base can be expanded and updated modularly.  We feel that this pattern is essential to community involvement.

For example, another hardware vendor can add a barclamp that does the BIOS configuration for their specific equipment (yes! that is our intent).  While many barclamps will be included with the open source release to install open source components, we anticipate that other barclamps will be only available with licensed products or in limited distribution.

A barclamp is like a cloud menu planner: it evaluates the whole environment and proposes configurations, roles, and recipes that fit your infrastructure.  If you like the menu, then it tells Chef to start cooking.

Barclamps complement the “PXE state machine” aspect of Crowbar by providing logic Crowbar evaluates as the servers reach deployment states.  These states are completely configurable via the provisioner barclamp; consequently, Crowbar users can choose to change order of operations.  They can also add barclamps and easily incorporate them into their workflow where needed.

Barclamps take the form of a Rails controller that inherits from the barclamp superclass.  The superclass provides the basic REST verbs that each barclamp must service while the child class implements the logic to create a “proposal” leveraging the wealth of information in Chef.  Proposals are JSON collections that include configuration data needed for the deployment recipes and a mapping of nodes into roles.

Users are able to review and edit proposals (which are versioned) before asking Crowbar to implement the proposal in Chef.  The proposal is implemented by assigning the nodes into the proposed roles and allowing Chef to work it’s magic.

Users can operate barcamps in parallel.  In fact, most of our barclamps are designed to operate in conjunction.

Reminder: It is vital to understand that Crowbar is not a stand-alone utility.  It is coupled to Chef Server for deployment and data storage.  Our objective was to leverage the outstanding capabilities and community support for Chef as much as possible.

We’re excited about this architecture addition to Crowbar and encourage you to think about barclamps that would be helpful to your cloud deployment.

The unexpected openness of OpenStack: why it’s important to learn from others’ operations experience.

During the OpenStack Design Conference, Forrester’s James Staten (@Staten7) raved about OpenStack’s transparency compared to AWS.  Within the enclave of OpenStack fan boys supports (Dell alone sent >14 people to the summit), his post drew a considerable attention but did little to really further the value proposition.

“Open deployments” are a much more significant value to implementors than transparency from open source code.

For any technology solution, there are significant challenges that will only be understood when the system is under stress.  In some cases, these challenges are code defects; however, many will be related to configuration and deployment choices that are site specific.  It is correcting these issues that result in design patterns and practices that create a robust infrastructure; consequently, the process of hardening a solution is critical to its ultimate stability and success.

When a solution, like AWS, is deployed and managed by a single entity, it is extremely rare for operational lessons learned and best practices to make it to the larger community.  Amazon’s recent post mortem is a welcome exception.   This is not a bad thing (Roman Stanek’s contrasting point), it is just the reality of a proprietary cloud.  AWS operates as a black box and I don’t believe that Amazon’s operational experience would be relevant to others unless they were also operationally transparent.

While it makes business sense to remain operationally opaque, service providers lose the benefit of external lessons learned when there is no community working in parallel with them.

OpenStack’s community has an opportunity to iterate on CloudOps patterns and practices at a dramatically faster rate than any single provider.  This creates distinct value for OpenStack adopters because they can shorten or eliminate their own challenges because other adopters will have the same pains and benefit from the same fixes.

It is critical to understand that the benefit is conferred to both the party sharing the problem (they get advice and support) and the party lending assistance (they avoid the problem).  This is distinctly different from proprietary clouds where sharing is likely to cause embarrassment  unlikely to create helpful outcomes.

I am not advocating that all OpenStack deployments be the same or follow a prescriptive patterns. 

I believe that each installation will be unique in some way; however, there will  be enough commonalities and shared code to make sharing worthwhile.  This is especially true for adopters who start with tools like Crowbar that leverage community based Chef Recipes and automating scripts.  Tools that encourage automation and shared scripts help accelerate the establishment of robust deployment patterns and practices.

Ultimately, the ability to collaborate on cloud operation practice does more to strengthen OpenStack than developers, code reviews or corporate endorsements.

OpenStack “Operating Hyperscale Clouds” preso given at Design Summit on 4/27.

Today, I gave the a presentation that is the sequel to the “bootstrapping the hyperscale clouds.”  The topic addresses concerns raised by the BlackOps concept but brings up history and perspectives that drive Dell’s Solution focus for OpenStack.

You can view in slide share or here’s a PDF of the preso: Operating the Hyperscale Cloud 4-27

I tend to favor pictures over text, so you may get more when OpenStack posts the video [link pending].

The talk is about our journey to build an OpenStack solution starting from our drive to build a cloud installer (Crowbar).  The critical learning is that success for clouds is the adoption of an “operational model” (aka Cloud Ops) that creates cloud agility.

Ultimately, the presentation is the basis for another white paper.

Appending:

Rethinking Play and Work: gaming is good for us (discuss in ATX, Dell, Twitter?)

Brad Szollose’s Liquid Leadership piqued my interest in the idea that gaming teaches critical job skills for the information age.  This is a significant paradigm shift in how we learn, share and collaborate to solve problems together.

At first, I thought “games = skillz” was nonsense until I looked more carefully at what my kids are doing in games.

When my kids are gaming they are doing things that adults would classify as serious work:

  • Designing buildings and creating machines that work within their environment
  • Hosting communities and enforcing discipline within the group
  • Recruiting talent to collaborate on shared projects
  • Writing programs that improve their productivity
  • Solving challenging problems under demanding time pressures
  • Learning to perseverance through multiple trials and iterative learning
  • Memorizing complex sequences and facts

They seek out these challenges because they are interesting, social and fun.

Is playing collaborative Portal 2 (which totally rocks) with my 13 year old worse than a nice game of chess?  I think it may be better!  We worked side-by-side on puzzles, enjoyed victories together, and left with a shared experience.

On the flip side, I’ve observed that it takes my kids a while to “come back down” after they play games.  They seem more likely to be impatient, rude and argumentative immediately after playing.  This effect definitely varies depending on the game.

I don’t pretend that all games and gaming has medicinal benefits; rather that we need to rethink how we look at games.  This is the main theme from McGonigal’s Reality is Broken (link below).  I’m just at the beginning and my virtual high lighter is running out of ink!  Here are some of her observations that she supports with research and data:

  • Gaming provides an evolutionary advantage
  • The majority of US citizens are gamers
  • Gaming teaches flow (state of heightened awareness that is essential to creativity and health)
  • Gaming drives UI innovation (really from Szollose) (yeah, and so does the porn industry)

If you’re interested in discussing this more, then please read one of the books listed below and choose another in the field.

Please feel free to post additional suggestions for titles as comments!

If you’re interested, let me know by commenting to tweeting – I’ll post our meeting times here in the future.

Note: I do not consider myself to be a “gamer.”  Although I greatly enjoy games, my play is irregular.  I suspect this is because I can achieve flow from my normal daily activities (programming, writing, running).

BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

OpenStack is ready, but are you? Get some operational cloud mojo and get started!

NOTE: This post is not intended as an endorsement of the company “CloudOps.”

This week, I’ve working to describe the “cloud operation model” or “cloud ops” to Dell internal and external customers.  CloudOps is really just DevOps but packaged more broadly to help explain how hardware, software, and operations interact.  The critical concept I’m trying to convey is that we’re not spending enough time working with customers on operations.

Running a cloud is driven by operational processes and choices.

Back in 2001 when virtualization was a shiny new thing, no one had any idea on how to operate a virtualized data center.  My company (now owned by Quest) struggled to win deals outside of our own data center because our customers did not know how to operate virtualized hardware.  Ultimately, VMware created the SAN based data center consolidation pattern and sales exploded.  That solution is much more about operations than hardware (SANs) or software (ESX).

So here in 2011, we have the same challenge with cloud.  (The majority of) Dell’s customers do not know how to operate a hyperscale data center because there is no commonly accepted pattern.  That’s where the cloud operation model comes into play – we have cloud proven hardware and cloud proven software, but we had been missing a description of the operational cloud mojo.

My team’s first OpenStack project started as a cloud installer (aka Crowbar), but we’ve learned that it is more fundamental than that.  To achieve “4 hours to cloud,” our approach embraced the DevOps philosophy that deployments should be automated, dynamic and repeatable.  Our choice to extend Opscode’s Chef Server allowed us to bring in more than just a software capability: it delivers a core operational foundation that enables customers to manage their data center at significant scale.

We had to deliver a CloudOps Foundation because Cloud is not a static configuration that can be distilled in a 10 page white paper!

Cloud scale requires an Operations Foundation that can respond and react because deployed software and infrastructure is constantly evolving and adapting.  I do not mean moving around assets like VMs.  I am talking about something that closer to refactoring code and writing software features.   Like the applications that run on the cloud, we need to recognize that cloud is a moving target and build systems that can handle that.

We’re delivering OpenStack using an operational platform that can respond to the code as it changes and expands.  There is more than enough stable code and proven capability in OpenStack for our customers with CloudOps mojo to start building their operational foundation and to create commercial public clouds.  These first providers are not waiting for a “final release” of OpenStack where it’s suddenly “production ready.”

The beauty of an open source cloud with an active community is that it will be constantly improving.

Some may be hoping that in 5 years we will have established patterns for hyperscale; however, I think those days are past.  Instead, we’ll see tools that accelerate infrastructure agility.  We already have those for public cloud deployments and now it’s time to bring those into the data center itself.  But that is the subject for another post (BlackOps).