Seven Cloud Success Criteria to consider before you pick a platform

From my desk at Dell, I have a unique perspective.   In addition to a constant stream of deep customer interactions about our many cloud solutions (even going back pre-OpenStack to Joyent & Eucalyptus), I have been an active advocate for OpenStack, involved in many discussions with and about CloudStack and regularly talk shop with Dell’s VIS Creator (our enterprise focused virtualization products) teams.  And, if you go back ten years to 2002, patented the concept of hybrid clouds with Dave McCrory.

Rather than offering opinions in the Cloud v. Cloud fray, I’m suggesting that cloud success means taking a system view.

Platform choice is only part of the decision: operational readiness, application types and organization culture are critical foundations before platform.

Over the last two years at Dell, I found seven points outweigh customers’ choice of platform.

  1. Running clouds requires building operational expertise both at the application and infrastructure layers.  CloudOps is real.
  2. Application architectures matter for cloud deployment because they can redefine the SLA requirements and API expectations
  3. Development community and collaboration is a significant value because sharing around open operations offers significant returns.
  4. We need to build an accelerating pace of innovation into our core operating principles
  5. There are still significant technology gaps to fill (networking & storage) and we will discover new gaps as we go
  6. We can no longer discuss public and private clouds as distinct concepts.   True hybrid clouds are not here yet, but everyone can already see their massive shadow.
  7. There is always more than one right technological answer.  Avoid analysis paralysis by making incrementally correct decisions (committing, moving forward, learning and then re-evaluating).

Open Source Cloud Bootstrapping Revisised

At the OpenStack last design conference, Greg Althaus and I presented about updates (presentation here) we were making to a Nov 2010 cloud architecture white paper.

The revised “Bootstrapping Open Source Clouds” white paper has been out for a few months so I thought it was past time to throw out a link.

I’m really pleased about this update because it reflects real world experience my team has working with customers and partners on OpenStack (and Hadoop) deployments.

Executive Summary
Bringing a cloud infrastructure online can be a daunting bootstrapping challenge. Before
hanging out a shingle as a private or public cloud service provider, you must select a platform,
acquire hardware, configure your network, set up operations services, and integrate it to work
well together. That is a lot of moving parts before you have even installed a sellable application.
This white paper walks you through the decision process to get started with an open source
cloud infrastructure based on OpenStack™ and Dell™ PowerEdge™ C servers. At the end, you’ll
be ready to design your own trial system that will serve as the foundation of your hyperscale
cloud.
2011 Revision Notes
In the year since the the original publication of this white paper, we worked with many
customers building OpenStack clouds. These clouds range in size from small six-node lab
systems to larger production deployments. Based on these experiences, we updated this white
paper to reflect lessons learned.

OpenStack Austin: What we’d like to see at the Design Summit

Last week, the OpenStack Austin user group discussed what we’d like to see at the upcoming OpenStack Design Summit. We had a strong turnout (48?!).

  1. To get the meeting started, Marc Padovani from HP (this month’s sponsor) provided some lessons learned from the HP OpenStack-Powered Cloud. While Marc noted that HP has not been able to share much of their development work on OpenStack; he was able to show performance metrics relating to a fix that HP contributed back to the OpenStack community. The defect related to the scheduler’s ability to handle load. The pre-fix data showed a climb and then a gap where the scheduler simply stopped responding. Post-fix, the performance curve is flat without any “dead zones.” (sharing data like this is what I call “open operations“)
  2. Next, I (Rob Hirschfeld) gave a brief overview of the OpenStack Essex Deploy Day (my summary) that Dell coordinated with world-wide participation. The Austin deploy day location was in the same room as the meetup so several of the OSEDD participants were still around.
  3. The meat of the meetup was a freeform discussion about what the group would like to see discussed at the Design Summit. My objective for the discussion was that the Austin OpenStack community could have a broader voice is we showed consensus for certain topics in advance of the meeting.

At Jim Plamondon‘s suggestion, we captured our brain storming on the OpenStack etherpad. The Etherpad is super cool – it allows simultaneous editing by multiple parties, so the notes below were crowd sourced during the meeting as we discussed topics that we’d like to see highlighted at the conference. The etherpad preserves editors, but I removed the highlights for clarity.

The next step is for me to consolidate the list into a voting page and ask the membership to rank the items (poll online!) below.

Brain storm results (unedited)

Stablity vs. Features

API vs. Code

  • What is the measurable feature set?
  • Is it an API, or an implementation?
  • Is the Foundation a formal-ish standards body?
  • Imagine the late end-game: can Azure/VMWare adopt OPenStack’s APIs and data formats to deliver interop, without running OpenStack’s code? Is this good? Are there conversations on displacing incumbents and spurring new adoption?
  • Logo issues

Documentation Standards

  • Dev docs vs user docs
  • Lag of update/fragmentation (10 blogs, 10 different methods, 2 “work”)
  • Per release getting started guide validated and available prior or at release.

Operations Focus

  • Error messages and codes vs python stack traces
  • Alternatively put, “how can we make error messages more ops-friendly, without making them less developer-friendly?”
  • Upgrade and operations of rolling updates and upgrades. Hot migrations?

If OpenStack was installable on Windows/Hyper-V as a simple MSI/Service installer – would you try it as a node?

  • Yes.

Is Nova too big?  How does it get fixed?

  • libraries?
  • sections?
  • make it smaller sub-projects
  • shorter release cycles?

nova-volume

  • volume split out?
  • volume expansion of backend storage systems
  • Is nova-volume the canonical control plane for storage provisioning?  Regardless of transport? It presently deals in block devices only… is the following blueprint correctly targeted to nova-volume?

https://blueprints.launchpad.net/nova/+spec/filedriver

Orchestration

  • Is the Donabe project dead?

Discussion about invitations to Summit

  • What is a contribution that warrants an invitation
  • Look at Launchpad’s Karma system, which confers karma for many different “contributory” acts, including bug fixes and doc fixes, in addition to code commitments

Summit Discussions

  • Is there a time for an operations summit?
  • How about an operators’ track?
  • Just a note: forums.openstack.org for users/operators to drive/show need and participation.

How can we capture the implicit knowledge (of mailing list and IRC content) in explicit content (documentation, forums, wiki, stackexchange, etc.)?

Hypervisors: room for discussion?

  • Do we want hypervisor featrure parity?
  • From the cloud-app developer’s perspective, I want to “write once, run anywhere,” and if hypervisor features preclude that (by having incompatible VM images, foe example)
  • (RobH: But “write once, run anywhere” [WORA] didn’t work for Java, right?)
  • (JimP: Yeah, but I was one of Microsoft’s anti-Java evangelists, when we were actively preventing it from working — so I know the dirty tricks vendors can use to hurt WORA in OpenStack, and how to prevent those trick from working.)

CDMI

Swift API is an evolving de facto open alternative to S3… CDMI is SNIA standards track.  Should Swift API become CDMI compliant?  Should CDMI exist as a shim… a la the S3 stuff.

OpenStack Essex Deploy Day: First Steps to Production

One March 8th, 70 people from around the world gathered on the Crowbar IM channel to begin building a production grade OpenStack Essex deployment. The event was coordinated as meet-ups by the Dell OpenStack/Crowbar team (my team) in two physical locations: the Nokia offices in Boston and the TechRanch in Austin.

My objective was to enable the community to begin collaboration on Essex Deployment. At that goal, we succeeded beyond my expectations.

IMHO, the top challenge for OpenStack Essex is to build a community of deploying advocates. We have a strong and dynamic development community adding features to the project. Now it is time for us to build a comparable community of deployers. By providing a repeatable, shared and open foundation for OpenStack deployments, we create a baseline that allows collaboration and co-development. Not only must we make deployments easy and predictable, we must also ensure they are scalable and production ready.

Having solid open production deployment infrastructure drives OpenStack adoption.

Our goal on the 8th was not to deliver finished deployments; it was to the start of Essex deployment community collaboration. To ensure that we could focus on getting to an Essex baseline, our team invested substantial time before the event to make sure that participants had a working Essex reference deployment.

By the nature of my team’s event leadership and our approach to OpenStack, the event was decidedly Crowbar focused. I feel like this is an acceptable compromise because Crowbar is open and provides a repeatable foundation. If everyone has the same foundation then we can focus on the truly critical challenges of ensuring consistent OpenStack deployments. Even using Crowbar, we waste a lot of time trying to figure out the differences between configurations. Lack of baseline consistency seriously impedes collaboration.

The fastest way to collaborate on OpenStack deployment is to have a reference deployment as a foundation.

Success By The Numbers

This was a truly international community collaborative event. Here are some of the companies that participated:

Dell (sponsor), Nokia (sponsor), Rackspace, Opscode, Canonical, Fedora, Mirantis, Morphlabs, Nicira, Enstratus, Deutsche Telekom Innovation Laboratories, Purdue University, Orbital Software Solutions, XepCloud and others.

PLEASE COMMENT here if I missed your company and I will add it to the list.

On the day of the event, we collected the following statistics:

  • 70 people on Skype IM channel (it’s not too late to join by pinging DellCrowbar with “Essex barclamps”).
  • 14+ companies
  • 2 physical sites with 10-15 people at each
  • 4 fold increase in traffic on the Crowbar Github to 813 hits.
  • 66 downloads of the Deploy day ISO
  • 8 videos capture from deploy day sessions.
  • World-wide participation

For over 70 people to spend a day together at this early stage in deployment is a truly impressive indication of the excitement that is building around OpenStack.

Improvements for Next Deploy Day

This was a first time that Andi Abes (Boston event lead), Rob Hirschfeld (Austin event lead) or Jean-Marie Martini (Dell event lead) had ever coordinated an event like this. We owe much of the success to efforts by Greg Althaus, Victor Lowther and the Canonical 12.04/Essex team before the event. Also, having physical sites was very helpful.

We are planning to do another event, so we are carefully tracking ways to improve.

Here are some issues we are tracking.

  • Issues with setting up a screen and voice share that could handle 70 people.
  • Lack of test & documentation on Crowbar meant too much time focused on Crowbar
  • Connectivity issues distributed voice
  • Should have started with DevStack as a baseline
  • more welcome in the comments!

Thank you!

I want to thank everyone who participated in making this event a huge success!

OpenStack Essex Deploy Day 3/8 – Get involved and install with us

My team at Dell has been avidly tracking the upsdowns, and breakthroughs of the OpenStack Essex release.  While we still have a few milestones before the release is cut, we felt like the E4 release was a good time to begin the work on Essex deployment.  Of course, the final deployment scripts will need substantial baking time after the final release on April 5th; however, getting deployments working will help influence the quality efforts and expand the base of possible testers.

To rally behind Essex Deployments, we are hosting a public work day on Thursday March 8th.

For this work day, we’ll be hosting all-day community events online and physically in Austin and Boston.  We are getting commitments from other Dell teams, partners and customers around the world to collaborate.  The day is promising to deliver some real Essex excitement.

The purpose of these events is to deliver the core of a working OpenStack Essex deployment.  While my team is primarily focused on deploys via Crowbar/Chef, we are encouraging anyone interested in laying down OpenStack Essex to participate.  We will be actively engaged on the OpenStack IRC and mailing lists too.

We have experts in OpenStack, Chef, Crowbar and Operating Systems (Canonical, SUSE, and RHEL) engaged in these activities.

This is a great time to start learning about OpenStack (or Crowbar) with hands-on work.  We are investing substantial upfront time (checkout out the Crowbar wiki for details) to ensure that there is a working base OpenStack Essex deploy on Ubuntu 12.04 beta.  This deploy includes the Crowbar 1.3 beta with some new features specifically designed to make testing faster and easier than ever before.

In the next few days, I’ll cut a 12.04 ISO and OpenStack Barclamp TARs as the basis for the deploy day event.  I’ll also be creating videos that help you quickly get a test lab up and running.  Visit the wiki or meetup sites to register and stay tuned for details!

Austin OpenStack Meetup: Keystone & Knife (2/20 notes via Greg Althaus)

I could not make it to the recent Austin OpenStack Meetup, but Greg Althaus generously let me post his notes from the event.

Background

Matt Ray talks about Chef

Matt Ray from Opscode presented some of the work with Chef and OpenStack. He talked about the three main chef repos floating around. He called out Anso’s original cookbook set that is the basis for the Crowbar cookbooks (his second set), and his final set is the emerging set of cookbooks in OpenStack proper. The third one is interesting and what he plans to continue working on to make into his public openstack cookbooks. These are an amalgamation of smokestack, RCB, Anso improvements, and his (Crowbar’s).

He then demoed his knife plugin (slideshare) to build openstack virtual servers using the Openstack API. This is nice and works against TryStack.org (previously “Free Cloud”) and RCB’s demo cloud. All of that is on his github repo with instructions how to build and use. Matt and I talked about trying to get that into our Crowbar distro.

There were some questions about flow and choice of OpenStack API versus Amazon EC2 API because there was already an EC2 knife set of plugins.

Ziad Sawalha talks about Keystone

Ziad Sawalha is the PLT (Project Technical Lead) for Keystone. He works for Rackspace out of San Antonio. He drove up for the meeting.

He split his talk into two pieces, Incubation Process and Keystone Overview. He asked who was interested in what and focused his talk more towards overview than incubation.

Some key take-aways:

  • Keystone comes from Rackspace’s strong, flexible, and scalable API. It started as a known quantity from his perspective.
  • Community trusted nothing his team produced from an API perspective
  • Community is python or nothing
    • His team was ignored until they had a python prototype implementing the API
    • At this point, comments on API came in.
  • Churn in API caused problems with implementation and expectations around the close of Diablo.
    • Because comments were late, changes occurred.
    • Official implementation lagged and stalled into arriving.
  • API has been stable since Diablo final, but code is changing. that is good and shows strength of API.
  • Side note from Greg, Keystone represents to me the power of API over Code. You can have innovation around the implementation as long all the implementations have a fair ground work to plan under which is an API specification. The replacement of Keystone with the Keystone Light code base is an example of this. The only reason this is possible is that the API was sound and documented.  (Rob’s post on this)

Ziad spent the rest of his time talking about the work flow of Keystone and the API points. He covered the API points.

  • Client to Keystone, Keystone to Client for initial auth token
  • Client to Middleware API for the services to have a front.
  • Middleware to Keystone to verify and establish identity.
  • Middleware to Service to pass identity

Not many details other then flow and flexibility. He stressed the API design separated protocol from actions and data at all the layers. This allows for future variations and innovations while maintaining the APIs.

Ziad talked about the state of Essex.

  • Planned
    • RBAC (aka Role Based Access Control)
    • Stability
    • Many backends
  • Actual
    • Code replacement Keystone Light
    • Stability
    • LDAP backend
    • SQL backend

Folsum work:

  • RBAC
  • Stability
  • AD backend
  • Another backend
  • Federation was planned but will most likely be pushed to G
    • Federation is the ability for multiple independent Keystones to operate (bursting use case)
    • Dependent upon two other federation components (networking and billing/metering)

OpenStack Boston Meetup 2/1 covers Quantum & Foundation

My team at Dell was in Beantown (several of us are Nashua based) for an annual team meeting so the timing for this Boston meetup.  Special thanks to Andi Abes for organizing and Suse for Sponsoring!!

We covered two primary topics: Quantum and the OpenStack Foundation.

In typing up my notes from the sessions, I ended up with so much information that it made more sense to break them into independent blog posts. Wow – that’s a lot of value from a free meetup!eetup was ideal for us. While we showed up in force, so did many other Stackers including people from HP, Nicira, Suse, Havard, Voxel, RedHat, ESPN and many more! The turnout for the event was great and I’m taking notes that Austin may need to upgrade our pizza and Boston may need to upgrade their cookies (just sayin’).

The Quantum session by David Lapsley from Nicira talked about the architecture and applications of Quantum. I think that Quantum is an exciting incubated project for OpenStack; however, it is important to remember that Essex stands alone without it. I believe this fact gets forgotten in enthusiasm over Quantum’s shiny potential.

The OpenStack session by Rob Hirschfeld from Dell (me!) talked about the importance of governance for OpenStack and how the Foundation will play a key role in transitioning it from Rackspace to a neutral party. There are many feel-good community benefits that the Foundation brings; however, the collaborators’ ROI is driver for creating a strong foundation. There is nothing wrong with acknowledging that fact and using it to create a more sustainable OpenStack.

Quantum: Network Virtualization in the OpenStack Essex Release

This post is part of my notes from the 2/1 Boston OpenStack meetup.

Quantum

David Lapsley from Nicira gave the Quantum presentation (his slides). My notes include additional explication and interpretation so he is not to blame for errors (but I’ll share credit for clarity).

The objective for Quantum is to replace the current networking modes (flat, vlan, dhcp, dhcp ha) with a programmatic networking API. The idea is that cloud users would use the API to request the network topology they wanted to implement rather than have it imposed by the infrastructure’s network mode. To accomplish this, the API must allow users to create complex & hierarchical network topologies without being aware of the underlying network infrastructure (aka “an abstraction layer”).

In simpler terms: Quantum allows users to design their own isolated networks without knowing how the network is actually deployed.

Quantum is a stand-alone service with its own API. It is not simply an extension of the Nova API. The Quantum API an extensibility model similar to Nova and it also has a plug-in architecture so that it can be implementation agnostic. The plug-ins are needed to map the user’s API abstraction into actual networking. For example, if the user requests a network tunnel between two VMs then the plug in may choose to implement a tagged VLAN, OpenFlow connections, IPtable filters, or encapsulated tunnels. The goal is that the implementation of the API should not matter to the user of the API!

For (hopefully) obvious reasons, the use cases the Quantum are similar to the Amazon EC2 VPC. The notable exception is service injection. Quantum wants to allow vendors/providers to innovate around value-added services. This should result in a diversity of choices as vendors offer additional network services such as load balancers, IPS, IDS, etc. While this is a great concept, it’s important to note that Quantum is currently limited to a single plug-in!  [see note in comments by Quantum PTL Dan Wendlandt (@danwendlandt)]

The expectation is that cloud users will want to create traditional application topologies with different tiers of access. For example, applications may require a dedicated network between web and database tiers or a DMZ between web and load balancer. The challenge is that these are patterns not rigid requirements. Ultimately, the simplest solution for the feature is to allow users to create “virtual VLANs.”

Essentially, the current Quantum API is creating virtual VLANs.

The Quantum API has four basic abstractions: interface, network, port and attachment. These primitives are used to build up a virtual network just as they are in physical networks.

  • Interface: cloud / tenant / server / GUID / eth0
  • Network: cloud / tenant / network / GUID
  • Port: cloud ID / tenant / network / GUID / port / GUID
  • Attachement : interface & network & port

To use the Quantum API, you must create a network, add ports (to network) and interfaces (to vms) then attach the network, interface, and port together. This gives users very fine grained control over their network topology. It is up to the plug-in to translate these primitives into a working physical topology.

According to my teammate and OSBOS organizer, Andi Abes, the Quantum API reached consensus in the community quickly because these it started with this basic but extensible API. In the meeting, I added that this approach is typical for OpenStack where it is considered better to demonstrate working core functionality than build extra complexity into the initial delivery. This approach links back to the API vs. Implementation debate I’ve discussed before. This simple API also provides room for innovation – while providing the basic constructs it is light, and does not encumber mappings of this API to different underlying technologies with lots of extras. OEM Vendors and service providers this have an easier time differentiating their offerings be it equipment or services.

In my experience, people often link OpenFlow and Quantum into a single technology base. I have certainly been guilty making that generalization. Quantum does not require OpenFlow or vice versa; however, they are highly complementary. OpenFlow takes over the switches’ “flow table” and allows administrators to control how every packet that touches the switch is routed. The potential for OpenFlow is to create highly dynamic and controlled network conduits. Quantum needs exactly that functionality to most directly map the virtual network requests into a physical fabric. In this way, OpenFlow is the most direct approach to building a fully enabled Quantum plug-in.

In the Essex release, progress has been made (and still is being made) towards integrating Nova and Quantum. The workflow of attaching a VIF (virtual interface) to the right network, and assigning it an appropriate IP (using Melange – the OpenStack IP address management project) are making headway. That said, the dashboard integration still lags and more progress is required.

Overall, my impression is that Quantum has great potential; however, I think that Nova in Essex will be sufficient for real applications without Quantum. As my freshman roommate used to say, “potential means you’ve got to keep working on it.”

CloudOps white paper explains “cloud is always ready, never finished”

I don’t usually call out my credentials, but knowing the I have a Masters in Industrial Engineering helps (partially) explain my passion for process as being essential to successful software delivery. One of my favorite authors, Mary Poppendiek, explains undeployed code as perishable inventory that you need to get to market before it loses value. The big lessons (low inventory, high quality, system perspective) from Lean manufacturing translate directly into software and, lately, into operation as DevOps.

What we have observed from delivering our own cloud products, and working with customers on thier’s, is that the operations process for deployment is as important as the software and hardware. It is simply not acceptable for us to market clouds without a compelling model for maintaining the solution into the future. Clouds are simply moving too fast to be delivered without a continuous delivery story.

This white paper [link here!] has been available since the OpenStack conference, but not linked to the rest of our OpenStack or Crowbar content.

Austin OpenStack Meetup (January Minutes) + OpenStack Foundation Web Cast!

Sorry for the brevity… At the last Austin OpenStack meetup, we had >60 stackers!  Some from as far away as Portland and Boston (as in Oregon and Massachusetts).

Notes:

  • Suse introduced their OpenStack beta and talked about their Suse Studio that can deploy images against the OpenStack APIs
  • I showed off DevStack.org code that can setup the truck of OpenStack (now Essex) in about 10 minutes on a single node.  Great for developers!
  • I showed an OpenStack Diablo Final deployment from Crowbar.  I focused mainly on Dashboard and used our reference architecture (see below) as illustration of the many parts.
  • Matt Ray suggested everyone watch the webcasts about the OpenStack Foundation (Thurs 6pm central  & Friday 9am central)
  • We planned the next few meetups.
    • For February, we’ll talk about Swift and Dashboard.
    • For March, we’ll talk about Essex and DevStack to prep for the next design summit (in SF).
    • For April, we’ll debrief the conference

Thank you Suse and Dell (my employer) for sponsoring!   The next meetup is sponsored by Canonical.