Open Source Cloud Bootstrapping Revisised

At the OpenStack last design conference, Greg Althaus and I presented about updates (presentation here) we were making to a Nov 2010 cloud architecture white paper.

The revised “Bootstrapping Open Source Clouds” white paper has been out for a few months so I thought it was past time to throw out a link.

I’m really pleased about this update because it reflects real world experience my team has working with customers and partners on OpenStack (and Hadoop) deployments.

Executive Summary
Bringing a cloud infrastructure online can be a daunting bootstrapping challenge. Before
hanging out a shingle as a private or public cloud service provider, you must select a platform,
acquire hardware, configure your network, set up operations services, and integrate it to work
well together. That is a lot of moving parts before you have even installed a sellable application.
This white paper walks you through the decision process to get started with an open source
cloud infrastructure based on OpenStack™ and Dell™ PowerEdge™ C servers. At the end, you’ll
be ready to design your own trial system that will serve as the foundation of your hyperscale
cloud.
2011 Revision Notes
In the year since the the original publication of this white paper, we worked with many
customers building OpenStack clouds. These clouds range in size from small six-node lab
systems to larger production deployments. Based on these experiences, we updated this white
paper to reflect lessons learned.

OpenStack Essex Deploy Day: First Steps to Production

One March 8th, 70 people from around the world gathered on the Crowbar IM channel to begin building a production grade OpenStack Essex deployment. The event was coordinated as meet-ups by the Dell OpenStack/Crowbar team (my team) in two physical locations: the Nokia offices in Boston and the TechRanch in Austin.

My objective was to enable the community to begin collaboration on Essex Deployment. At that goal, we succeeded beyond my expectations.

IMHO, the top challenge for OpenStack Essex is to build a community of deploying advocates. We have a strong and dynamic development community adding features to the project. Now it is time for us to build a comparable community of deployers. By providing a repeatable, shared and open foundation for OpenStack deployments, we create a baseline that allows collaboration and co-development. Not only must we make deployments easy and predictable, we must also ensure they are scalable and production ready.

Having solid open production deployment infrastructure drives OpenStack adoption.

Our goal on the 8th was not to deliver finished deployments; it was to the start of Essex deployment community collaboration. To ensure that we could focus on getting to an Essex baseline, our team invested substantial time before the event to make sure that participants had a working Essex reference deployment.

By the nature of my team’s event leadership and our approach to OpenStack, the event was decidedly Crowbar focused. I feel like this is an acceptable compromise because Crowbar is open and provides a repeatable foundation. If everyone has the same foundation then we can focus on the truly critical challenges of ensuring consistent OpenStack deployments. Even using Crowbar, we waste a lot of time trying to figure out the differences between configurations. Lack of baseline consistency seriously impedes collaboration.

The fastest way to collaborate on OpenStack deployment is to have a reference deployment as a foundation.

Success By The Numbers

This was a truly international community collaborative event. Here are some of the companies that participated:

Dell (sponsor), Nokia (sponsor), Rackspace, Opscode, Canonical, Fedora, Mirantis, Morphlabs, Nicira, Enstratus, Deutsche Telekom Innovation Laboratories, Purdue University, Orbital Software Solutions, XepCloud and others.

PLEASE COMMENT here if I missed your company and I will add it to the list.

On the day of the event, we collected the following statistics:

  • 70 people on Skype IM channel (it’s not too late to join by pinging DellCrowbar with “Essex barclamps”).
  • 14+ companies
  • 2 physical sites with 10-15 people at each
  • 4 fold increase in traffic on the Crowbar Github to 813 hits.
  • 66 downloads of the Deploy day ISO
  • 8 videos capture from deploy day sessions.
  • World-wide participation

For over 70 people to spend a day together at this early stage in deployment is a truly impressive indication of the excitement that is building around OpenStack.

Improvements for Next Deploy Day

This was a first time that Andi Abes (Boston event lead), Rob Hirschfeld (Austin event lead) or Jean-Marie Martini (Dell event lead) had ever coordinated an event like this. We owe much of the success to efforts by Greg Althaus, Victor Lowther and the Canonical 12.04/Essex team before the event. Also, having physical sites was very helpful.

We are planning to do another event, so we are carefully tracking ways to improve.

Here are some issues we are tracking.

  • Issues with setting up a screen and voice share that could handle 70 people.
  • Lack of test & documentation on Crowbar meant too much time focused on Crowbar
  • Connectivity issues distributed voice
  • Should have started with DevStack as a baseline
  • more welcome in the comments!

Thank you!

I want to thank everyone who participated in making this event a huge success!

OpenStack Essex Deploy Day 3/8 – Get involved and install with us

My team at Dell has been avidly tracking the upsdowns, and breakthroughs of the OpenStack Essex release.  While we still have a few milestones before the release is cut, we felt like the E4 release was a good time to begin the work on Essex deployment.  Of course, the final deployment scripts will need substantial baking time after the final release on April 5th; however, getting deployments working will help influence the quality efforts and expand the base of possible testers.

To rally behind Essex Deployments, we are hosting a public work day on Thursday March 8th.

For this work day, we’ll be hosting all-day community events online and physically in Austin and Boston.  We are getting commitments from other Dell teams, partners and customers around the world to collaborate.  The day is promising to deliver some real Essex excitement.

The purpose of these events is to deliver the core of a working OpenStack Essex deployment.  While my team is primarily focused on deploys via Crowbar/Chef, we are encouraging anyone interested in laying down OpenStack Essex to participate.  We will be actively engaged on the OpenStack IRC and mailing lists too.

We have experts in OpenStack, Chef, Crowbar and Operating Systems (Canonical, SUSE, and RHEL) engaged in these activities.

This is a great time to start learning about OpenStack (or Crowbar) with hands-on work.  We are investing substantial upfront time (checkout out the Crowbar wiki for details) to ensure that there is a working base OpenStack Essex deploy on Ubuntu 12.04 beta.  This deploy includes the Crowbar 1.3 beta with some new features specifically designed to make testing faster and easier than ever before.

In the next few days, I’ll cut a 12.04 ISO and OpenStack Barclamp TARs as the basis for the deploy day event.  I’ll also be creating videos that help you quickly get a test lab up and running.  Visit the wiki or meetup sites to register and stay tuned for details!

Austin OpenStack Meetup: Keystone & Knife (2/20 notes via Greg Althaus)

I could not make it to the recent Austin OpenStack Meetup, but Greg Althaus generously let me post his notes from the event.

Background

Matt Ray talks about Chef

Matt Ray from Opscode presented some of the work with Chef and OpenStack. He talked about the three main chef repos floating around. He called out Anso’s original cookbook set that is the basis for the Crowbar cookbooks (his second set), and his final set is the emerging set of cookbooks in OpenStack proper. The third one is interesting and what he plans to continue working on to make into his public openstack cookbooks. These are an amalgamation of smokestack, RCB, Anso improvements, and his (Crowbar’s).

He then demoed his knife plugin (slideshare) to build openstack virtual servers using the Openstack API. This is nice and works against TryStack.org (previously “Free Cloud”) and RCB’s demo cloud. All of that is on his github repo with instructions how to build and use. Matt and I talked about trying to get that into our Crowbar distro.

There were some questions about flow and choice of OpenStack API versus Amazon EC2 API because there was already an EC2 knife set of plugins.

Ziad Sawalha talks about Keystone

Ziad Sawalha is the PLT (Project Technical Lead) for Keystone. He works for Rackspace out of San Antonio. He drove up for the meeting.

He split his talk into two pieces, Incubation Process and Keystone Overview. He asked who was interested in what and focused his talk more towards overview than incubation.

Some key take-aways:

  • Keystone comes from Rackspace’s strong, flexible, and scalable API. It started as a known quantity from his perspective.
  • Community trusted nothing his team produced from an API perspective
  • Community is python or nothing
    • His team was ignored until they had a python prototype implementing the API
    • At this point, comments on API came in.
  • Churn in API caused problems with implementation and expectations around the close of Diablo.
    • Because comments were late, changes occurred.
    • Official implementation lagged and stalled into arriving.
  • API has been stable since Diablo final, but code is changing. that is good and shows strength of API.
  • Side note from Greg, Keystone represents to me the power of API over Code. You can have innovation around the implementation as long all the implementations have a fair ground work to plan under which is an API specification. The replacement of Keystone with the Keystone Light code base is an example of this. The only reason this is possible is that the API was sound and documented.  (Rob’s post on this)

Ziad spent the rest of his time talking about the work flow of Keystone and the API points. He covered the API points.

  • Client to Keystone, Keystone to Client for initial auth token
  • Client to Middleware API for the services to have a front.
  • Middleware to Keystone to verify and establish identity.
  • Middleware to Service to pass identity

Not many details other then flow and flexibility. He stressed the API design separated protocol from actions and data at all the layers. This allows for future variations and innovations while maintaining the APIs.

Ziad talked about the state of Essex.

  • Planned
    • RBAC (aka Role Based Access Control)
    • Stability
    • Many backends
  • Actual
    • Code replacement Keystone Light
    • Stability
    • LDAP backend
    • SQL backend

Folsum work:

  • RBAC
  • Stability
  • AD backend
  • Another backend
  • Federation was planned but will most likely be pushed to G
    • Federation is the ability for multiple independent Keystones to operate (bursting use case)
    • Dependent upon two other federation components (networking and billing/metering)

CloudOps white paper explains “cloud is always ready, never finished”

I don’t usually call out my credentials, but knowing the I have a Masters in Industrial Engineering helps (partially) explain my passion for process as being essential to successful software delivery. One of my favorite authors, Mary Poppendiek, explains undeployed code as perishable inventory that you need to get to market before it loses value. The big lessons (low inventory, high quality, system perspective) from Lean manufacturing translate directly into software and, lately, into operation as DevOps.

What we have observed from delivering our own cloud products, and working with customers on thier’s, is that the operations process for deployment is as important as the software and hardware. It is simply not acceptable for us to market clouds without a compelling model for maintaining the solution into the future. Clouds are simply moving too fast to be delivered without a continuous delivery story.

This white paper [link here!] has been available since the OpenStack conference, but not linked to the rest of our OpenStack or Crowbar content.

Austin OpenStack Meetup (January Minutes) + OpenStack Foundation Web Cast!

Sorry for the brevity… At the last Austin OpenStack meetup, we had >60 stackers!  Some from as far away as Portland and Boston (as in Oregon and Massachusetts).

Notes:

  • Suse introduced their OpenStack beta and talked about their Suse Studio that can deploy images against the OpenStack APIs
  • I showed off DevStack.org code that can setup the truck of OpenStack (now Essex) in about 10 minutes on a single node.  Great for developers!
  • I showed an OpenStack Diablo Final deployment from Crowbar.  I focused mainly on Dashboard and used our reference architecture (see below) as illustration of the many parts.
  • Matt Ray suggested everyone watch the webcasts about the OpenStack Foundation (Thurs 6pm central  & Friday 9am central)
  • We planned the next few meetups.
    • For February, we’ll talk about Swift and Dashboard.
    • For March, we’ll talk about Essex and DevStack to prep for the next design summit (in SF).
    • For April, we’ll debrief the conference

Thank you Suse and Dell (my employer) for sponsoring!   The next meetup is sponsored by Canonical.

OpenStack Deployments Abound at Austin Meetup (12/9)

I was very impressed by the quality of discussion at the Deployment topic meeting for Austin OpenStack Meetup (#OSATX). Of the 45ish people attending, we had representations for at least 6 different OpenStack deployments (Dell, HP, ATT, Rackspace Internal, Rackspace Cloud Builders, Opscode Chef)!  Considering the scope of those deployments (several are aiming at 1000+ nodes), that’s a truly impressive accomplishment for such a young project.

Even with the depth of the discussion (notes below), we did not go into details on how individual OpenStack components are connected together.  The image my team at Dell uses is included below.  I also recommend reviewing Rackspace’s published reference architecture.

Figure 1 Diablo Software Architecture. Source Dell/OpenStack (cc w/ attribution)

Notes

Our deployment discussion was a round table so it is difficult to link statements back to individuals, but I was able to track companies (mostly).

  • HP
    • picked Ubuntu & KVM because they were the most vetted. They are also using Chef for deployment.
    • running Diablo 2, moving to Diablo Final & a flat network model. The network controller is a bottleneck. Their biggest scale issue is RabbitMQ.
    • is creating their own Nova Volume plugin for their block storage.
    • At this point, scale limits are due to simultaneous loading rather than total number of nodes.
    • The Nova node image cache can get corrupted without any notification or way to force a refresh – this defect is being addressed in Essex.
    • has setup availability zones are completely independent (500 node) systems. Expecting to converge them in the future.
  • Rackspace
    • is using the latest Ubuntu. Always stays current.
    • using Puppet to setup their cloud.
    • They are expecting to go live on Essex and are keeping their deployment on the Essex trunk. This is causing some extra work but they expect it to pay back by allowing them to get to production on Essex faster.
    • Deploying on XenServer
    • “Devs move fast, Ops not so much.”  Trying to not get behind.
  • Rackspace Cloud Builders (RCB) is running major releases being run through an automated test suite. The verified releases are being published to https://github.com/cloudbuilders (note: Crowbar is pulling our OpenStack bits from this repo).
  • Dell commented that our customers are using Crowbar primarily pilots – they are learning how to use OpenStack
    • Said they have >10 customer deployments pending
    • ATT is using OpenSource version of Crowbar
    • Need for Keystone and Dashboard were considered essential additions to Diablo
  • Hypervisors
    • KVM is considered the top one for now
    • Libvirt (which uses KVM) also supports LXE which people found to be interesting
    • XenServer via XAPI are also popular
    • No so much activity on ESX & HyperV
    • We talked about why some hypervisors are more popular – it’s about the node agent architecture of OpenStack.
  • Storage
    • NetApp via Nova Volume appears to be a popular block storage
  • Keystone / Dashboard
    • Customers want both together
    • Including keystone/dashboard was considered essential in Diablo. It was part of the reason why Diablo Final was delayed.
    • HP is not using dashboard
OpenStack API
  • Members of the Audience made comments that we need to deprecate the EC2 APIs (because it does not help OpenStack long term to maintain EC2 APIs over its own).  [1/5 Note: THIS IS NOT OFFICIAL POLICY, it is a reflection of what was discussed]
  • HP started on EC2 API but is moving to the OpenStack API

Meetup Housekeeping

  • Next meeting is Tuesday 1/10 and sponsored by SUSE (note: Tuesday is just for this January).  Topic TBD.
  • We’ve got sponsors for the next SIX meetups! Thanks for Dell (my employeer), Rackspace, HP, SUSE, Canonical and PuppetLabs for sponsoring.
  • We discussed topics for the next meetings (see the post image). We’re going to throw it to a vote for guidance.
  • The OSATX tag is also being used by Occupy San Antonio.  Enjoy the cross chatter!

Extending Chef’s reach: “Managed Nodes” for External Entities.

Note: this post is very technical and relates to detailed Chef design patterns used by Crowbar. I apologize in advance for the post’s opacity. Just unleash your inner DevOps geek and read on. I promise you’ll find some gems.

At the Opscode Community Summit, Dell’s primary focus was creating an “External Entity” or “Managed Node” model. Matt Ray prefers the term “managed node” so I’ll defer to that name for now. This model is needed for Crowbar to manage system components that cannot run an agent such as a network switch, blade chassis, IP power distribution unit (PDU), and a SAN array. The concept for a managed node is that there is an instance of the chef-client agent that can act as a delegate for the external entity. We’ve been reluctant to call it a “proxy” because that term is so overloaded.

My Crowbar vision is to manage an end-to-end cloud application life-cycle. This starts from power and network connections to hardware RAID and BIOS then up to the services that are installed on the node and ultimately reaches up to applications installed in VMs on those nodes.

Our design goal is that you can control a managed node with the same Chef semantics that we already use. For example, adding a Network proposal role to the Switch managed node will force the agent to update its configuration during the next chef-client run. During the run, the managed node will see that the network proposal has several VLANs configured in its attributes. The node will then update the actual switch entity to match the attributes.

Design Considerations

There are five key aspects of our managed node design. They are configuration, discovery, location, relationships, and sequence. Let’s explore each in detail.

A managed node’s configuration is different than a service or actuator pattern. The core concept of a node in chef is that the node owns the configuration. You make changes to the nodes configuration and it’s the nodes job to manage its state to maintain that configuration. In a service pattern, the consumer manages specific requests directly. At the summit (with apologies to Bill Clinton), I described Chef configuration as telling a node what it “is” while a service provide verbs that change a node. The critical difference is that a node is expected to maintain configuration as its composition changes (e.g.: node is now connected for VLAN 666) while a service responds to specific change requests (node adds tag for VLAN 666). Our goal is the maintain Chef’s configuration management concept for the external entities.

Managed nodes also have a resource discovery concept that must align with the current ohai discovery model. Like a regular node, the manage node’s data attributes reflect the state of the managed entity; consequently we’d expect a blade chassis managed node to enumerate the blades that are included. This creates an expectation that the manage node appears to be “root” for the entity that it represents. We are also assuming that the Chef server can be trusted with the sharable discovered data. There may be cases where these assumptions do not have to be true, but we are making them for now.

Another essential element of managed nodes is that their agent location matters because the external resource generally has restricted access. There are several examples of this requirement. Switch configuration may require a serial connection from a specific node. Blade SANs and PDUs management ports are restricted to specific networks. This means that the manage node agents must run from a specific location. This location is not important to the Chef server or the nodes’ actions against the managed node; however, it’s critical for the system when starting the managed node agent. While it’s possible for managed nodes to run on nodes that are outside the overall Chef infrastructure, our use cases make it more likely that they will run as independent processes from regular nodes. This means that we’ll have to add some relationship information for managed nodes and perhaps a barclamp to install and manage managed nodes.

All of our use cases for managed nodes have a direct physical linkage between the managed node and server nodes. For a switch, it’s the ports connected. For a chassis, it’s the blades installed. For a SAN, it’s the LUNs exposed. These links imply a hierarchical graph that is not currently modeled in Chef data – in fact, it’s completely missing and difficult to maintain. At this time, it’s not clear how we or Opscode will address this. My current expectation is that we’ll use yet more roles to capture the relationships and add some hierarchical UI elements into Crowbar to help visualize it. We’ll also need to comprehend node types because “managed nodes” are too generic in our UI context.

Finally, we have to consider the sequence of action for actions between managed nodes and nodes.  In all of our uses cases, steps to bring up a node requires orchestration with the managed node.  Specifically, there needs to be a hand-off between the managed node and the node.  For example, installing an application that uses VLANs does not work until the switch has created the VLAN,  There are the same challenges on LUNs and SAN and blades and chassis.  Crowbar provides orchestration that we can leverage assuming we can declare the linkages.

For now, a hack to get started…

For now, we’ve started on a workable hack for managed nodes. This involves running multiple chef-clients on the admin server in their own paths & processes. We’ll also have to add yet more roles to comprehend the relationships between the managed nodes and the things that are connected to them. Watch the crowbar listserv for details!

Extra Credit

Notes on the Opscode wiki from the Crowbar & Managed Node sessions

OpenStack Seattle Meetup 11/30 Notes

We had an informal OpenStack meetup after the Opscode Summit in Seattle.

This turned out to be a major open cloud gab fest! In addition to Dell OpenStack leads (Greg and I), we had the Nova Project Technical Lead (PTL, Vish Ishaya, @vish), HP’s Cloud Architect (Alex Howells, @nixgeek), Opscode OpenStack cookbook master (Matt Ray, @mattray). We were joined by several other Chef Summit attendees with OpenStack interest including a pair of engineers from Spain.

We’d planned to demo using Knife-OpenStack against the Crowbar Diablo build.  Unfortunately, the knife-openstack is out of date (August 15th?!).  We need Keystone support.  Anyone up for that?

Highlights

There’s no way I can recapture everything that was said, but here are some highlights I jotted down the on the way home.

  • After the miss with Keystone and the Diablo release, solving the project dependency problem is an important problem. Vish talked at length about the ambiguity challenge of Keystone being required and also incubated. He said we were not formal enough around new projects even though we had dependencies on them. Future releases, new projects (specifically, Quantum) will not be allowed to be dependencies.
  • The focus for Essex is on quality and stability. The plan is for Essex to be a long-term supported (LTS) release tied to the Ubuntu LTS. That’s putting pressure on all the projects to ensure quality, lock features early, and avoid unproven dependencies.
  • There is a lot of activity around storage and companies are creating volume plug-ins for Nova. Vish said he knew of at least four.
  • Networking has a lot of activity. Quantum has a lot of activity, but may not emerge as a core project in time for Essex. There was general agreement that Quantum is “the killer app” for OpenStack and will take cloud to the next level.  The Quantum Open vSwitch implementaiton is completely open source and free. Some other plugins may require proprietary hardware and/or software, but there is definitely a (very) viable and completely open source option for Quantum networking.
  • HP has some serious cloud mojo going on. Alex talked about defects they have found and submitted fixes back to core. He also hinted about some interesting storage and networking IP that’s going into their OpenStack deployment. Based on his comments, I don’t expect those to become public so I’m going to limit my observations about them here.
  • We talked about hypervisors for a while. KVM and XenServer (via XAPI) were the primary topics. We did talk about LXE & OpenVZ as popular approaches too. Vish said that some of the XenServer work is using Xen Storage Manager to manage SAN images.
  • Vish is seeing a constant rise in committers. It’s hard to judge because some committers appear to be individuals acting on behalf of teams (10 to 20 people).

Note: cross posted on the OpenStack Blog.

Reminder: 12/8 Meetup @ Austin!

Missed this us in Seattle? Join us at the 12/8 OpenStack meetup in Austin co-hosted by Dell and Rackspace.

Based on our last meetup, it appears deployment is a hot topic, so we’ll kick off with that – bring your experiences, opinions, and thoughts! We’ll also open the floor to other OpenStack topics that would be discussed – open technical and business discussions – no commercials please!

We’ll also talk about organizing future OpenStack meet ups! If your company is interested in sponsoring a future meetup, find Joseph George at the meetup and he can work with you on details.

Opscode Summit Recap – taking Chef & DevOps to a whole new level

Opscode Summit Agenda created by open space

I have to say that last week’s Opscode Community Summit was one of the most productive summits that I have attended. Their use of the open-space meeting format proved to be highly effective for a team of motivated people to self-organize and talk about critical topics. I especially like the agenda negations (see picture for an agenda snapshot) because everyone worked to adjust session times and locations based on what else other sessions being offered. Of course, is also helped to have an unbelievable level of Chef expertise on tap.

Overall

Overall, I found the summit to be a very valuable two days; consequently, I feel some need to pay it forward with some a good summary. Part of the goal was for the community to document their sessions on the event wiki (which I have done).

The roadmap sessions were of particular interest to me. In short, Chef is converging the code bases of their three products (hosted, private and open). The primary change on this will moving from CouchBD to a SQL based DB and moving away the API calls away from Merb/Ruby to Erlang. They are also improving search so that we can make more fine-tuned requests that perform better and return less extraneous data.

I had a lot of great conversations. Some of the companies represented included: Monster, Oracle, HP, DTO, Opscode (of course), InfoChimps, Reactor8, and Rackspace. There were many others – overall >100 people attended!

Crowbar & Chef

Greg Althaus and I attended for Dell with a Crowbar specific agenda so my notes reflect the fact that I spent 80% of my time on sessions related to features we need and explaining what we have done with Chef.

Observations related to Crowbar’s use of Chef

  1. There is a class of “orchestration” products that have similar objectives as Crowbar. Ones that I remember are Cluster Chef, Run Deck, Domino
  2. Crowbar uses Chef in a way that is different than users who have a single application to deploy. We use roles and databags to store configuration that other users inject into their recipes. This is dues to the fact that we are trying to create generic recipes that can be applied to many installations.
  3. Our heavy use of roles enables something of a cookbook service pattern. We found that this was confusing to many chef users who rely on the UI and knife. It works for us because all of these interactions are automated by Crowbar.
  4. We picked up some smart security ideas that we’ll incorporate into future versions.

Managed Nodes / External Entities

Our primary focus was creating an “External Entity” or “Managed Node” model. Matt Ray prefers the term “managed node” so I’ll defer to that name for now. This model is needed for Crowbar to manage system components that cannot run an agent such as a network switch, blade chassis, IP power distribution unit (PDU), and a SAN array. The concept for a managed node is that that there is an instance of the chef-client agent that can act as a delegate for the external entity. I had so much to say about that part of the session, I’m posting it as its own topic shortly.