Our Vision for Crowbar – taking steps towards closed loop operations

When Greg Althaus and I first proposed the project that would become Dell’s Crowbar, we had already learned first-hand that there was a significant gap in both the technologies and the processes for scale operations. Our team at Dell saw that the successful cloud data centers were treating their deployments as integrated systems (now called DevOps) in which configuration of many components where coordinated and orchestrated; however, these approaches feel short of the mark in our opinion. We wanted to create a truly integrated operational environment from the bare metal through the networking up to the applications and out to the operations tooling.

Our ultimate technical nirvana is to achieve closed-loop continuous deployments. We want to see applications that constantly optimize new code, deployment changes, quality, revenue and cost of operations. We could find parts but not a complete adequate foundation for this vision.

The business driver for Crowbar is system thinking around improved time to value and flexibility. While our technical vision is a long-term objective, we see very real short-term ROI. It does not matter if you are writing your own software or deploying applications; the faster you can move that code into production the sooner you get value from innovation. It is clear to us that the most successful technology companies have reorganized around speed to market and adapting to pace of change.

System flexibility & acceleration were key values when lean manufacturing revolution gave Dell a competitive advantage and it has proven even more critical in today’s dynamic technology innovation climate.

We hope that this post helps define a vision for Crowbar beyond the upcoming refactoring. We started the project with the idea that new tools meant we could take operations to a new level.

While that’s a great objective, we’re too pragmatic in delivery to rest on a broad objective. Let’s take a look at Crowbar’s concrete strengths and growth areas.

Key strength areas for Crowbar

  1. Late binding – hardware and network configuration is held until software configuration is known.  This is a huge system concept.
  2. Dynamic and Integrated Networking – means that we treat networking as a 1st class citizen for ops (sort of like software defined networking but integrated into the application)
  3. System Perspective – no Application is an island.  You can’t optimize just the deployment, you need to consider hardware, software, networking and operations all together.
  4. Bootstrapping (bare metal) – while not “rocket science” it takes a lot of careful effort to get this right in a way that is meaningful in a continuous operations environment.
  5. Open Source / Open Development / Modular Design – this problem is simply too complex to solve alone.  We need to get a much broader net of environments and thinking involved.

Continuing Areas of Leadership

  1. Open / Lean / Incremental Architecture – these are core aspects of our approach.  While we have a vision, we also are very open to ways that solve problems faster and more elegantly than we’d expected.
  2. Continuous deployment – we think the release cycles are getting faster and the only way to survive is the build change into the foundation of operations.
  3. Integrated networking – software defined networking is cool, but not enough.  We need to have semantics that link applications, networks and infrastructure together.
  4. Equilivent physical / virtual – we’re not saying that you won’t care if it’s physical or virtual (you should), we think that it should not impact your operations.
  5. Scale / Hybrid – the key element to hybrid is scale and to hybrid is scale.  The missing connection is being able to close the loop.
  6. Closed loop deployment – seeking load management, code quality, profit, and cost of operations as factor in managed operations.

Crowbar 2.0 Objectives: Scalable, Heterogeneous, Flexible and Connected

The seeds for Crowbar 2.0 have been in the 1.x code base for a while and were recently accelerated by SuSE.  With the Dell | Cloudera 4 Hadoop and Essex OpenStack-powered releases behind us, we will now be totally focused bringing these seeds to fruition in the next two months.

Getting the core Crowbar 2.0 changes working is not a major refactoring effort in calendar time; however, it will impact current Crowbar developers by changing improving the programming APIs. The Dell Crowbar team decided to treat this as a focused refactoring effort because several important changes are tightly coupled. We cannot solve them independently without causing a larger disruption.

All of the Crowbar 2.0 changes address issues and concerns raised in the community and are needed to support expanding of our OpenStack and Hadoop application deployments.

Our technical objective for Crowbar 2.0 is to simplify and streamline development efforts as the development and user community grows. We are seeking to:

  1. simplify our use of Chef and eliminate Crowbar requirements in our Opscode Chef recipes.
    1. reduce the initial effort required to leverage Crowbar
    2. opens Crowbar to a broader audience (see Upstreaming)
  2. provide heterogeneous / multiple operating system deployments. This enables:
    1. multiple versions of the same OS running for upgrades
    2. different operating systems operating simultaneously (and deal with heterogeneous packaging issues)
    3. accommodation of no-agent systems like locked systems (e.g.: virtualization hosts) and switches (aka external entities)
    4. UEFI booting in Sledgehammer
  3. strengthen networking abstractions
    1. allow networking configurations to be created dynamically (so that users are not locked into choices made before Crowbar deployment)
    2. better manage connected operations
    3. enable pull-from-source deployments that are ahead of (or forked from) available packages.
  4. improvements in Crowbar’s core database and state machine to enable
    1. larger scale concerns
    2. controlled production migrations and upgrades
  5. other important items
    1. make documentation more coupled to current features and easier to maintain
    2. upgrade to Rails 3 to simplify code base, security and performance
    3. deepen automated test coverage and capabilities

Beyond these great technical targets, we want Crowbar 2.0 is to address barriers to adoption that have been raised by our community, customers and partners. We have been tracking concerns about the learning curve for adding barclamps, complexity of networking configuration and packaging into a single ISO.

We will kick off to community part of this effort with an online review on 7/16 (details).

PS: why a refactoring?

My team at Dell does not take on any refactoring changes lightly because they are disruptive to our community; however, a convergence of requirements has made it necessary to update several core components simultaneously. Specifically, we found that desired changes in networking, operating systems, packaging, configuration management, scale and hardware support all required interlocked changes. We have been bringing many of these changes into the code base in preparation and have reached a point where the next steps require changing Crowbar 1.0 semantics.

We are first and foremost an incremental architecture & lean development team – Crowbar 2.0 will have the smallest footprint needed to begin the transformations that are currently blocking us. There is significant room during and after the refactor for the community to shape Crowbar.

Stop the Presses! Austin OpenStack Meetup 7/12 features docs, bugs & cinder

Don’t miss the 7/12 OpenStack Austin meetup!  We’ve got a great agenda lined up.

This meetup is sponsored by HP (Mark Padovani will give the intro).

Topics will include

  1. 6:30 pre-meeting OpenStack intro & overview for N00bs.
  2. Anne Gentle, OpenStack Technical Writer at Rackspace Hosting, talking about How to contribute to docs & the areas needed. *
  3. Report on the Folsom.3 bug squash day (http://wiki.openstack.org/BugDays/20120712BugSquashing)
  4. (tentative) Greg Althaus, Dell, talking about the “Cinder” Block Storage project
  5. White Board – Next Meeting Topics

* if you contribute to docs then you’ll get an invite to the next design summit!   It’s a great way to support OpenStack even if you don’t write code.

Crowbar Celebrates 1st Anniversary

Nearly a year ago at OSCON 2011, my team at Dell opened sourced “Crowbar, an OpenStack installer.” That first Github commit was a much more limited project than Crowbar today: there was no separation into barclamps, no distinct network configuration, one operating system option and the default passwords were all “openstack.” We simply did not know if our effort would create any interest.

The response to Crowbar has been exciting and humbling. I most appreciate those who looked at Crowbar and saw more than a bare metal installer. They are the ones who recognized that we are trying to solve a bigger problem: it has been too difficult to cope with change in IT operations.

During this year, we have made many changes. Many have been driven by customer, user and partner feedback while others support Dell product delivery needs. Happily, these inputs are well aligned in intent if not always in timing.

  • Introduction of barclamps as modular components
  • Expansion into multiple applications (most notably OpenStack and Apache Hadoop)
  • Multi-Operating System
  • Working in the open (with public commits)
  • Collaborative License Agreements

Dell‘s understanding of open source and open development has made a similar transformation. Crowbar was originally Apache 2 open sourced because we imagined it becoming part of the OpenStack project. While that ambition has faded, the practical benefits of open collaboration have proven to be substantial.

The results from this first year are compelling:

  • For OpenStack Diablo, coordination with the Rackspace Cloud Builder team enabled Crowbar to include the Keystone and Dashboard projects into Dell’s solution
  • For OpenStack Essex, the community focused work we did for the March Essex Hackday are directly linked to our ability to deliver Dell’s OpenStack-Powered Essex solution over two months earlier than originally planned.
  • For Apache Hadoop distributions for 3.x and 4.x with implementation of Cloudera Manager and eco system components.
  • We’ve amassed hundreds of mail subscribers and Github followers
  • Support for multiple releases of RHEL, Centos & Ubuntu including Ubuntu 12.04 while it was still in beta.
  • SuSE does their own port of Crowbar to SuSE with important advances in Crowbar’s install model (from ISO to package).

We stand on the edge of many exciting transformations for Crowbar’s second year. Based on the amount of change from this year, I’m hesitant to make long term predictions. Yet, just within next few months there are significant plans based on Crowbar 2.0 refactor. We have line of site to changes that expand our tool choices, improve networking, add operating systems and become more even production ops capable.

That’s quite a busy year!

What does “enable upstream recipes” mean? Not just fishing for community goodness!

One of the major Crowbar 2.0 design targets is to allow you to “upstream” operations scripts more easily.  “Upstream code” means that parts of Crowbar’s source code could be maintained in other open source repositories.  This is beyond a simple dependency (like Rails, Curl, Java or Apache): Upstreaming allows Crowbar can use code managed in the other open source repositories for more general application.  This is important because Crowbar users can leverage DevOps logic that is more broadly targeted than just Crowbar.  Even more importantly, upstreaming means that we can contribute and take advantage of community efforts to improve the upstream source.

Specifically, Crowbar maintains a set of OpenStack cookbooks that make up the core of our OpenStack deployment.  These scripts have been widely cloned (not forked) and deCrowbarized for other deployments.  Unfortunately, that means that we do not benefit from downstream improvements and the cloners cannot easily track our updates.  This happened because Crowbar was not considered a valid upstream OpenStack repository because our deployment scripts required Crowbar.  The consequence of this cloning is that incompatible OpenStack recipes have propagated like cracks in a windshield.

While there are concrete benefits to upstreaming, there are risks too.  We have to evaluate if the upstream code has been adequately tested, operates effectively, implements best practices and leverages Crowbar capabilities.  I believe strongly that untested deployment code is worse than useless; consequently, the Dell Crowbar team provides significant value by validating that our deployments work as an integrated system.  Even more importantly, we will not upstream from unmoderated sources where changes are accepted without regard for downstream impacts.  There is a significant amount of trust required for upstreaming to work.

If upstreaming is so good, why did we not start out with upstream code?  It was simply not an option at the time – Crowbar was the first (and is still!) most complete set of DevOps deployment scripts for OpenStack in a public repository.
By design, Crowbar 1.0 was tightly coupled to Opscode Chef and required users to inject Crowbar dependencies into their Chef Recipes.  This approach allowed us to more quickly integrate capabilities between recipes and with nascent Crowbar features.  Our top design requirement was that our deployment was tightly integrated between hardware, networking, operating system, operations infrastructure and the application.  Figuring out the correct place to separate concerns was impractical; consequently, we injected dependencies into our Chef code.
We have reached a point with Crowbar development that we can correctly decouple Crowbar and Chef.
The benefits to upstreaming go far beyond enabling more collaboration on OpenStack deployments.  These same changes make it easier for Crowbar to leverage community deployment scripts without one-way modifications.  If you have a working Chef Recipe then making it work with Crowbar will no longer require changes that break it outside of Crowbar; therefore, you can leverage Crowbar capabilities without losing community input and without being locked into Crowbar.

OSCON preso graphic about Upstreaming added 7/23:

A SuPEr New Linux for Crowbar! SuSE shows off port and OpenStack deploy

During last week’s OpenStack Essex Deploy Day, we featured several OpenStack ecosystem presentations including SuSE, Morphlabs, enStratus, Opscode, and Inktank (Ceph).

SuSE’s presentation (video) was deploying OpenStack using a SuSE port of Crowbar (including a reskinned UI)!

This is a significant for SuSE and Crowbar:

  1. SuSE, a platinum member of the OpenStack foundation, now has an OpenStack Essex distribution. They are offering this deployment as an on-request beta.
  2. Crowbar is now demonstrable operating on the three top Linux distributions.

SuSE is advancing some key architectural proposals for Crowbar because their implementation downloads Crowbar as a package rather than bundling everything into an ISO.

With the Hadoop 4 & OpenStack Essex releases nearly put to bed, it’s time to bring some of this great innovation into the Crowbar trunk.

OpenStack Deploy Day generates lots of interest, less coding

Last week, my team at Dell led a world-wide OpenStack Essex Deploy event. Kamesh Pemmaraju, our OpenStack-powered solution product manager, did a great summary of the event results (200+ attendees!). What started as a hack-a-thon for deploy scripts morphed into a stunning 14+ hour event with rotating intro content and an ecosystem showcase (videos).  Special kudos to Kamesh, Andi Abes, Judd Maltin, Randy Perryman & Mike Pittaro for leadership at our regional sites.

Clearly, OpenStack is attracting a lot of interest. We’ve been investing time in content to help people who are curious about OpenStack to get started.

While I’m happy to be fueling the OpenStack fervor with an easy on-ramp, our primary objective for the Deploy Day was to collaborate on OpenStack deployments.

On that measure, we have room for improvement. We had some great discussions about how to handle upgrades and market drivers for OpenStack; however, we did not spend the time improving Essex deployments that I was hoping to achieve. I know it’s possible – I’ve talked with developers in the Crowbar community who want this.

If you wanted more expert interaction, here are some of my thoughts for future events.

  • Expert track did not get to deploy coding. I think that we need to simply focus more even tightly on to Crowbar deployments. That means having a Crowbar Hack with an OpenStack focus instead of vice versa.
  • Efforts to serve OpenStack n00bs did not protect time for experts. If we offer expert sessions then we won’t try to have parallel intro sessions. We’ll simply have to direct novices to the homework pages and videos.
  • Combining on-site and on-line is too confusing. As much as I enjoy meeting people face-to-face, I think we’d have a more skilled audience if we kept it online only.
  • Connectivity! Dropped connections, sigh.
  • Better planning for videos (not by the presenters) to make sure that we have good results on the expert track.
  • This event was too long. It’s just not practical to serve Europe, US and Asia in a single event. I think that 2-3 hours is a much more practical maximum. 10-12am Eastern or 6-8pm Pacific would be much more manageable.

Do you have other comments and suggestions? Please let me know!

OSED OMG: OpenStack Essex Deploy Day!! A day-long four-session two-track International Online Conference

Curious about OpenStack? Know it, but want to tune your Ops chops? JOIN US on Thursday 5/31 (or Friday 6/1 if you are in Asia)!

Already know the event logistics? Skip back to my OSED observations post.

Some important general notes:

  1. We are RECORDING everything and will link posts from the event page.
  2. There is HOMEWORK if you want to get ahead by installing OpenStack yourself.
  3. For last minute updates about the event, I recommend that you join the Crowbar Listserver.

Content Logistics work like this.

  1. Everything will be available ONLINE. We are also coordinating many physical sites as rally points.
  2. Introductory: FOUR 3-hour sessions for people who do not have OpenStack or Crowbar experience. These sessions will show how to install OpenStack using Crowbar, discuss DevOps and showcase companies that are in the OpenStack ecosystem. They are planned to have 2 European slots (afternoon & evening), 3 US slots (morning, afternoon & evening), and 1 Asian slot (morning).
  3. Expert: ON-GOING deep technical sessions for engineers who have OpenStack and/or Crowbar experience. There will be one main screen and voice channel in which we are planning to highlight and discuss these topics in blocks throughout the day. We have a long list of topics to discuss and will maintain an ongoing Google Hangout for each topic. Depending on interest, we will jump back and forth to different hangouts.

Intro/Overview Session Logistics work like this

We’re planning FOUR introductory sessions throughout the day (read ahead?). Each session should be approximately 3 hours. The first hour of the sessions will be about OpenStack Essex and installing it using Crowbar. After some Q&A, we’re going to highlight the OpenStack ecosystem. The schedule for the ecosystem is in flux and will likely shift even during the event.

The Session start times for Overview & Ecosystem content

Region EDT Session 1 Session 2 Session 3 Session 4
Europe (-5) -5 3pm 6pm * *
Americas Eastern 0 10am 1pm 4pm *
Americas Central +1 9am Noon 3pm *
Americas Mtn +2 * 11am 2pm 7pm
Americas West +3 * 10am 1pm 6pm
Asia (Toyko) +10 * * * 6/1 10 am

* There are no planned live venues at this time/region. You are always welcome to join online!

Experts Track Logistics

Note: we expect experts to have already installed OpenStack (see homework page). Ideally, an expert has already setup a build environment.

We have a list of topics (Essex, Quantum, Networking, Pull from Source, Documentation, etc) that we plan to cover on a 30-60 minute rotation.

We will cover the OpenStack Essex deploy at the start of each planned session (9am, Noon, 3pm & 8pm EDT). Before we cover the OpenStack deploy, we’ll spend 10 minutes setting (and posting) the agenda for the next three hours based on attendee input.

Even if we are not talking about a topic on the main channel, we will keep a dialog going on topic specific Google hangouts. The links to the hangouts will be posted with the Expert track agenda.

We need an OpenStack Reference Deployment (My objectives for Deploy Day)

I’m overwhelmed and humbled by the enthusiasm my team at Dell is seeing for the OpenStack Essex Deploy day on 5/31 (or 6/1 for Asia). What started as a day for our engineers to hack on Essex Cookbooks with a few fellow Crowbarians has morphed into an international OpenStack event spanning Europe, Americas & Asia.

If you want to read more about the event, check out my event logistics post (link pending).

I do not apologize for my promotion of the Dell-lead open source Crowbar as the deployment tool for the OpenStack Essex Deploy. For a community to focus on improving deployment tooling, there must be a stable reference infrastructure. Crowbar provides a fast and repeatable multi-node environment with scriptable networking and packaging.

I believe that OpenStack benefits from a repeatable multi-node reference deployment. I’ll go further and state that this requires DevOps tooling to ensure consistency both within and between deployments.

DevStack makes trunk development more canonical between different developers. I hope that Crowbar will help provide a similar experience for operators so that we can truly share deployment experience and troubleshooting. I think it’s already realistic for Crowbar deployments to a repeatable enough deployment that they provide a reference for defect documentation and reproduction.

Said more plainly, it’s a good thing if a lot of us use OpenStack in the same way so that we can help each out.

My team’s choice to accelerate releasing the Crowbar barclamps for OpenStack Essex makes perfect sense if you accept our rationale for creating a community baseline deployment.

Crowbar is Dell-lead, not Dell specific.

One of the reasons that Crowbar is open source and we do our work in the open (yes, you can see our daily development in github) is make it safe for everyone to invest in a shared deployment strategy. We encourage and welcome community participation.

PS: I believe the same is true for any large scale software project. Watch out for similar activity around Apache Hadoop as part of our collaboration with Cloudera!

Quick turn OpenStack Essex on Crowbar (BOOM, now we’re at v1.4!)

Don’t blink if you’ve been watching the Crowbar release roadmap!

My team at Dell is about to turn another release of Crowbar. Version 1.3 released 5/14 (focused on Cloudera Apache Hadoop) and our original schedule showed several sprints of work on OpenStack Essex. Upon evaluation, we believe that the current community developed Essex barclamps are ready now.

The healthy state of the OpenStack Essex deployment is a reflection of 1) the quality of Essex and 2) our early community activity in creating deployments based Essex RC1 and Ubuntu Beta1.

We are planning many improvements to our OpenStack Essex and Crowbar Framework; however, most deployments can proceed without these enhancements.  This also enables participants in the 5/31 OpenStack Essex Deploy Day.

By releasing a core stable Essex reference deployment, we are accelerating field deployments and enabling the OpenStack ecosystem. In terms of previous posts, we are eliminating release interlocks to enable more downstream development. Ultimately, we hope that we are also creating a baseline OpenStack deployment.

We are also reducing the pressure to rush more disruptive Crowbar changes (like enabling high availability, adding multiple operating systems, moving to Rails 3, fewer crowbarisms in cookbooks and streamlining networking). With this foundational Essex release behind us (we call it an MVP), we can work on more depth and breadth of capability in OpenStack.

One small challenge, some of the changes that we’d expected to drop have been postponed slightly. Specifically, markdown based documentation (/docs) and some new UI pages (/network/nodes, /nodes/families). All are already in the product under but not wired into the default UI (basically, a split test).

On the bright side, we did manage to expose 10g networking awareness for barclamps; however, we have not yet refactored to barclamps to leverage the change.