OpenStack’s next hurdle: Interoperability. Why should you care?

SXSW life size Newton's Cradle

SXSW life size Newton’s Cradle

The OpenStack Board spent several hours (yes, hours) discussing interoperability related topics at the last board meeting.  Fundamentally, the community benefits when uses can operate easily across multiple OpenStack deployments (their own and/or public clouds).

Cloud interoperability: the ability to transfer workloads between systems without changes to the deployment operations management infrastructure.

This is NOT hybrid (which I defined as a workload transparently operating in multiple systems); however it is a prereq to achieve scalable hybrid operation.

Interoperability matters because the OpenStack value proposition is all about creating a common platform.  IT World does a good job laying out the problem (note, I work for Dell).  To create sites that can interoperate, we have to some serious lifting:

At the OpenStack Summit, there are multiple chances to engage on this.   I’m moderating a panel about Interop and also sharing a session about the highly related topic of Reference Architectures with Monty Tayor.

The Interop Panel (topic description here) is Tuesday @ 5:20pm.  If you join, you’ll get to see me try to stump our awesome panelists

  • Jonathan LaCour, DreamHost
  • Troy Toman, Rackspace
  • Bernard Golden,  Enstratius
  • Monty Taylor, OpenStack Board (and HP)
  • Peter Pouliot, Microsoft

PS: Oh, and I’m also talking about DevOps Upgrades Patterns during the very first session (see a preview).

DevOps approaches to upgrade: Cube Visualization

I’m working on my OpenStack summit talk about DevOps upgrade patterns and got to a point where there are three major vectors to consider:

  1. Step Size (shown as X axis): do we make upgrades in small frequent steps or queue up changes into larger bundles? Larger steps mean that there are more changes to be accommodated simultaneously.
  2. Change Leader (shown as Y axis): do we upgrade the server or the client first? Regardless of the choice, the followers should be able to handle multiple protocol versions if we are going to have any hope of a reasonable upgrade.
  3. Safeness (shown as Z axis): do the changes preserve the data and productivity of the entity being upgraded? It is simpler to assume to we simply add new components and remove old components; this approach carries significant risks or redundancy requirements.

I’m strongly biased towards continuous deployment because I think it reduces risk and increases agility; however, I laying out all the vertices of the upgrade cube help to visualize where the costs and risks are being added into the traditional upgrade models.

Breaking down each vertex:

  1. Continuous Deploy – core infrastructure is updated on a regular (usually daily or faster) basis
  2. Protocol Driven – like changing to HTML5, the clients are tolerant to multiple protocols and changes take a long time to roll out
  3. Staged Upgrade – tightly coordinate migration between major versions over a short period of time in which all of the components in the system step from one version to the next together.
  4. Rolling Upgrade – system operates a small band of versions simultaneously where the components with the oldest versions are in process of being removed and their capacity replaced with new nodes using the latest versions.
  5. Parallel Operation – two server systems operate and clients choose when to migrate to the latest version.
  6. Protocol Stepping – rollout of clients that support multiple versions and then upgrade the server infrastructure only after all clients have achieved can support both versions.
  7. Forced Client Migration – change the server infrastructure and then force the clients to upgrade before they can reconnect.
  8. Big Bang – you have to shut down all components of the system to upgrade it

This type of visualization helps me identify costs and options. It’s not likely to get much time in the final presentation so I’m hoping to hear in advance if it resonates with others.

PS: like this visualization? check out my “magic 8 cube” for cloud hosting options.

5 things keeping DevOps from playing well with others (Chef, Crowbar and Upstream Patterns)

Sharing can be hardSince my earliest days on the OpenStack project, I’ve wanted to break the cycle on black box operations with open ops. With the rise of community driven DevOps platforms like Opscode Chef and Puppetlabs, we’ve reached a point where it’s both practical and imperative to share operational practices in the form of code and tooling.

Being open and collaborating are not the same thing.

It’s a huge win that we can compare OpenStack cookbooks. The real victory comes when multiple deployments use the same trunk instead of forking.

This has been an objective I’ve helped drive for OpenStack (with Matt Ray) and it has been the Crowbar objective from the start and is the keystone of our Crowbar 2 work.

This has proven to be a formidable challenge for several reasons:

  1. diverging DevOps patterns that can be used between private, public, large, small, and other deployments -> solution: attribute injection pattern is promising
  2. tooling gaps prevent operators from leveraging shared deployments -> solution: this is part of Crowbar’s mission
  3. under investing in community supporting features because they are seen as taking away from getting into production -> solution: need leadership and others with join
  4. drift between target versions creates the need for forking even if the cookbooks are fundamentally the same -> solution: pull from source approaches help create distro independent baselines
  5. missing reference architectures interfere with having a stable baseline to deploy against -> solution: agree to a standard, machine consumable RA format like OpenStack Heat.

Unfortunately, these five challenges are tightly coupled and we have to progress on them simultaneously. The tooling and community requires patterns and RAs.

The good news is that we are making real progress.

Judd Maltin (@newgoliath), a Crowbar team member, has documented the emerging Attribute Injection practice that Crowbar has been leading. That practice has been refined in the open by ATT and Rackspace. It is forming the foundation of the OpenStack cookbooks.

Understanding, discussing and supporting that pattern is an important step toward accelerating open operations. Please engage with us as we make the investments for open operations and help us implement the pattern.

The Atlantic magazine explains why Lean process rocks (and saves companies $$)

GearsI’m certain that the Atlantic‘s Charles Fishman was not thinking software and DevOps when he wrote the excellent article about “The Insourcing Boom.”  However, I strongly recommend reading this report for anyone who is interested in a practical example of the inefficiencies of software lean process (If you are impatient, jump to page 2 and search for toaster).

It’s important to realize that this article is not about software! It’s an article about industrial manufacturing and the impact that lean process has when you are making stuff.  It’s about how US companies are using Lean to make domestic plants more profitable than Asian ones.  It turns out that how you make something really matters – you can’t really optimize the system if you treat major parts like a black box.

When I talk about Agile and Lean, I am talking about proven processes being applied broadly to companies that want to make profit selling stuff. That’s what this article is about

If you are making software then you are making stuff! Your install and deploy process is your assembly line. Your unreleased code is your inventory.

This article does a good job explaining the benefits of being close to your manufacturing (DevOps) and being flexible in deployment (Agile) and being connected to customers (Lean).  The software industry often acts like it’s inventing everything from scratch. When it comes to manufacturing processes, we can learn a lot from industry.

Unlike software, industry has real costs for scrap and lost inventory. Instead of thinking “old school” perhaps we should be thinking of it as the school of hard knocks.

My Dilemma with Folsom – why I want to jump to G

When your ship sailsThese views are my own.  Based on 1×1 discussions I’ve had in the OpenStack community, I am not alone.

If you’ve read my blog then you know I am a vocal and active supporter of OpenStack who is seeking re-election to the OpenStack Board.  I’m personally and professionally committed to the project’s success. And, I’m confident that OpenStack’s collaborative community approach is out innovating other clouds.

A vibrant project requires that we reflect honestly: we have an equal measure of challenges: shadow free fall Dev, API vs implementation, forking risk and others.  As someone helping users deploy OpenStack today, I find my self straddling between a solid release (Essex) and a innovative one (Grizzly). Frankly, I’m finding it very difficult to focus on Folsom.

Grizzly excites me and clearly I’m not alone.  Based on pace of development, I believe we saw a significant developer migration during feature freeze free fall.

In Grizzly, both Cinder and Quantum will have progressed to a point where they are ready for mainstream consumption. That means that OpenStack will have achieved the cloud API trifecta of compute-store-network.

  • Cinder will get beyond the “replace Nova Volume” feature set and expands the list of connectors.
  • Quantum will get to parity with Nova Network, addresses overlapping VM IPs and goes beyond L2 with L3 feature enablement like  load balancing aaS.
  • We are having a real dialog about upgrades while the code is still in progress
  • And new projects like Celio and Heat are poised to address real use problems in billing and application formation.

Everything I hear about Folsom deployment is positive with stable code and significant improvements; however, we’re too late to really influence operability at the code level because the Folsom release is done.  This is not a new dilemma.  As operators, we seem to be forever chasing the tail of the release.

The perpetual cycle of implementing deployment after release is futile, exhausting and demoralizing because we finish just in time for the spotlight to shift to the next release.

I don’t want to slow the pace of releases.  In Agile/Lean, we believe that if something is hard then we do should it more.  Instead, I am looking at Grizzly and seeing an opportunity to break the cycle.  I am looking at Folsom and thinking that most people will be OK with Essex for a little longer.

Maybe I’m a dreamer, but if we can close the deployment time gap then we accelerate adoption, innovation and happy hour.  If that means jilting Folsom at the release altar to elope with Grizzly then I can live with that.

Ops “Late Binding” is critical best practice and key to Crowbar differentiation

From OSCON 2012: Portland's light rail understands good dev practice.Late binding is a programming term that I’ve commandeered for Crowbar’s DevOps design objectives.

We believe that late binding is a best practice for CloudOps.

Understanding this concept is turning out to be an important but confusing differentiation for Crowbar. We’ve effectively inverted the typical deploy pattern of building up a cloud from bare metal; instead, Crowbar allows you to build a cloud from the top down.  The difference is critical – we delay hardware decisions until we have the information needed to do the correct configuration.

If Late Binding is still confusing, the concept is really very simple: “we hold off all work until you’ve decided how you want to setup your cloud.”

Late binding arose from our design objectives. We started the project with a few critical operational design objectives:

  1. Treat the nodes and application layers as an interconnected system
  2. Realize that application choices should drive down the entire application stack including BIOS, RAID and networking
  3. Expect the entire system to be in a constantly changing so we must track state and avoid locked configurations.

We’d seen these objectives as core tenets in hyperscale operators who considered bare metal and network configuration to be an integral part of their application deployment. We know it is possible to build the system in layers that only (re)deploy once the application configuration is defined.

We have all this great interconnected automation! Why waste it by having to pre-stage the hardware or networking?

In cloud, late binding is known as “elastic computing” because you wait until you need resources to deploy. But running apps on cloud virtual machines is simple when compared to operating a physical infrastructure. In physical operations, RAID, BIOS and networking matter a lot because there are important and substantial variations between nodes. These differences are what drive late binding as a one of Crowbar’s core design principles.

Late Binding Visualized

The real workloads begin: Crowbar’s Sophomore Year

Given Crowbar‘s frenetic Freshman year, it’s impossible to predict everything that Crowbar could become. I certainly aspire to see the project gain a stronger developer community and the seeds of this transformation are sprouting. I also see that community driven work is positioning Crowbar to break beyond being platforms for OpenStack and Apache Hadoop solutions that pay the bills for my team at Dell to invest in Crowbar development.

I don’t have to look beyond the summer to see important development for Crowbar because of the substantial goals of the Crowbar 2.0 refactor.

Crowbar 2.0 is really just around the corner so I’d like to set some longer range goals for our next year.

  • Growing acceptance of Crowbar as an in data center extension for DevOps tools (what I call CloudOps)
  • Deeper integration into more operating environments beyond the core Linux flavors (like virtualization hosts, closed and special purpose operating systems.
  • Improvements in dynamic networking configuration
  • Enabling more online network connected operating modes
  • Taking on production ops challenges of scale, high availability and migration
  • Formalization of our community engagement with summits, user groups, and broader developer contributions.

For example, Crowbar 2.0 will be able to handle downloading packages and applications from the internet. Online content is not a major benefit without being able to stage and control how those new packages are deployed; consequently, our goals remains tightly focused improvements in orchestration.

These changes create a foundation that enables a more dynamic operating environment. Ultimately, I see Crowbar driving towards a vision of fully integrated continuous operations; however, Greg & Rob’s Crowbar vision is the topic for tomorrow’s post.

A SuPEr New Linux for Crowbar! SuSE shows off port and OpenStack deploy

During last week’s OpenStack Essex Deploy Day, we featured several OpenStack ecosystem presentations including SuSE, Morphlabs, enStratus, Opscode, and Inktank (Ceph).

SuSE’s presentation (video) was deploying OpenStack using a SuSE port of Crowbar (including a reskinned UI)!

This is a significant for SuSE and Crowbar:

  1. SuSE, a platinum member of the OpenStack foundation, now has an OpenStack Essex distribution. They are offering this deployment as an on-request beta.
  2. Crowbar is now demonstrable operating on the three top Linux distributions.

SuSE is advancing some key architectural proposals for Crowbar because their implementation downloads Crowbar as a package rather than bundling everything into an ISO.

With the Hadoop 4 & OpenStack Essex releases nearly put to bed, it’s time to bring some of this great innovation into the Crowbar trunk.

OSED OMG: OpenStack Essex Deploy Day!! A day-long four-session two-track International Online Conference

Curious about OpenStack? Know it, but want to tune your Ops chops? JOIN US on Thursday 5/31 (or Friday 6/1 if you are in Asia)!

Already know the event logistics? Skip back to my OSED observations post.

Some important general notes:

  1. We are RECORDING everything and will link posts from the event page.
  2. There is HOMEWORK if you want to get ahead by installing OpenStack yourself.
  3. For last minute updates about the event, I recommend that you join the Crowbar Listserver.

Content Logistics work like this.

  1. Everything will be available ONLINE. We are also coordinating many physical sites as rally points.
  2. Introductory: FOUR 3-hour sessions for people who do not have OpenStack or Crowbar experience. These sessions will show how to install OpenStack using Crowbar, discuss DevOps and showcase companies that are in the OpenStack ecosystem. They are planned to have 2 European slots (afternoon & evening), 3 US slots (morning, afternoon & evening), and 1 Asian slot (morning).
  3. Expert: ON-GOING deep technical sessions for engineers who have OpenStack and/or Crowbar experience. There will be one main screen and voice channel in which we are planning to highlight and discuss these topics in blocks throughout the day. We have a long list of topics to discuss and will maintain an ongoing Google Hangout for each topic. Depending on interest, we will jump back and forth to different hangouts.

Intro/Overview Session Logistics work like this

We’re planning FOUR introductory sessions throughout the day (read ahead?). Each session should be approximately 3 hours. The first hour of the sessions will be about OpenStack Essex and installing it using Crowbar. After some Q&A, we’re going to highlight the OpenStack ecosystem. The schedule for the ecosystem is in flux and will likely shift even during the event.

The Session start times for Overview & Ecosystem content

Region EDT Session 1 Session 2 Session 3 Session 4
Europe (-5) -5 3pm 6pm * *
Americas Eastern 0 10am 1pm 4pm *
Americas Central +1 9am Noon 3pm *
Americas Mtn +2 * 11am 2pm 7pm
Americas West +3 * 10am 1pm 6pm
Asia (Toyko) +10 * * * 6/1 10 am

* There are no planned live venues at this time/region. You are always welcome to join online!

Experts Track Logistics

Note: we expect experts to have already installed OpenStack (see homework page). Ideally, an expert has already setup a build environment.

We have a list of topics (Essex, Quantum, Networking, Pull from Source, Documentation, etc) that we plan to cover on a 30-60 minute rotation.

We will cover the OpenStack Essex deploy at the start of each planned session (9am, Noon, 3pm & 8pm EDT). Before we cover the OpenStack deploy, we’ll spend 10 minutes setting (and posting) the agenda for the next three hours based on attendee input.

Even if we are not talking about a topic on the main channel, we will keep a dialog going on topic specific Google hangouts. The links to the hangouts will be posted with the Expert track agenda.

We need an OpenStack Reference Deployment (My objectives for Deploy Day)

I’m overwhelmed and humbled by the enthusiasm my team at Dell is seeing for the OpenStack Essex Deploy day on 5/31 (or 6/1 for Asia). What started as a day for our engineers to hack on Essex Cookbooks with a few fellow Crowbarians has morphed into an international OpenStack event spanning Europe, Americas & Asia.

If you want to read more about the event, check out my event logistics post (link pending).

I do not apologize for my promotion of the Dell-lead open source Crowbar as the deployment tool for the OpenStack Essex Deploy. For a community to focus on improving deployment tooling, there must be a stable reference infrastructure. Crowbar provides a fast and repeatable multi-node environment with scriptable networking and packaging.

I believe that OpenStack benefits from a repeatable multi-node reference deployment. I’ll go further and state that this requires DevOps tooling to ensure consistency both within and between deployments.

DevStack makes trunk development more canonical between different developers. I hope that Crowbar will help provide a similar experience for operators so that we can truly share deployment experience and troubleshooting. I think it’s already realistic for Crowbar deployments to a repeatable enough deployment that they provide a reference for defect documentation and reproduction.

Said more plainly, it’s a good thing if a lot of us use OpenStack in the same way so that we can help each out.

My team’s choice to accelerate releasing the Crowbar barclamps for OpenStack Essex makes perfect sense if you accept our rationale for creating a community baseline deployment.

Crowbar is Dell-lead, not Dell specific.

One of the reasons that Crowbar is open source and we do our work in the open (yes, you can see our daily development in github) is make it safe for everyone to invest in a shared deployment strategy. We encourage and welcome community participation.

PS: I believe the same is true for any large scale software project. Watch out for similar activity around Apache Hadoop as part of our collaboration with Cloudera!