Stop the Presses! Austin OpenStack Meetup 7/12 features docs, bugs & cinder

Don’t miss the 7/12 OpenStack Austin meetup!  We’ve got a great agenda lined up.

This meetup is sponsored by HP (Mark Padovani will give the intro).

Topics will include

  1. 6:30 pre-meeting OpenStack intro & overview for N00bs.
  2. Anne Gentle, OpenStack Technical Writer at Rackspace Hosting, talking about How to contribute to docs & the areas needed. *
  3. Report on the Folsom.3 bug squash day (http://wiki.openstack.org/BugDays/20120712BugSquashing)
  4. (tentative) Greg Althaus, Dell, talking about the “Cinder” Block Storage project
  5. White Board – Next Meeting Topics

* if you contribute to docs then you’ll get an invite to the next design summit!   It’s a great way to support OpenStack even if you don’t write code.

Four alternatives to Process Interlock

Note: This is the third and final part of 3 part series about the “process interlock dilemma.”

In post 1, I’ve spelled out how evil Process Interlock causes well intentioned managers to add schedule risk and opportunity cost even as they appear to be doing the right thing. In post 2, I offered some alternative outcomes when process interlock is avoided. In this post, I attempt to provide alternatives to the allure of process interlock. We must have substitute interlocks types to replace our de facto standard because there are strong behavioral and traditional reasons to keep broken processes. In other words, process Interlock feels good because it gives you the illusion that your solution is needed and vital to other projects.

If your product is vital to another team then they should be able to leverage what you have, not what you’re planning to have.

We should focus on delivered code instead of future promises. I am not saying that roadmaps and projections are bad – I think they are essential. I am saying that roadmaps should be viewed as potential not as promises.

  1. No future commits (No interlock)

    The simplest way to operate without any process interlock is to never depend on other groups for future deliveries. This approach is best for projects that need to move quickly and have no tolerance for schedule risk. This means that your project is constrained to use the “as delivered” work product from all external groups. Depending on needs, you may further refine this as only rely on stable and released work.

    For example, OpenStack Cactus relied on features that were available in the interim 10.10 Ubuntu version. This allowed the project to advance faster, but also limited support because the OS this version was not a long term support (LTS) release.

  2. Smaller delivery steps (MVP interlock)

    Sometimes a new project really needs emerging capabilities from another project. In those cases, the best strategy is to identify a minimum viable feature set (or “product”) that needs to be delivered from the other project. The MVP needs to be a true minimum feature set – one that’s just enough to prove that the integration will work. Once the MVP has been proven, a much clearer understanding of the requirements will help determine the required amount of interlock. My objective with an MVP interlock is to find the true requirements because IMHO many integrations are significantly over specified.

    For example, the OpenStack Quantum project (really, any incubated OpenStack projects) focuses on delivering the core functionality first so that the ecosystem and other projects can start using it as soon as possible.

  3. Collaborative development (Shared interlock)

    A collaborative interlock is very productive when the need for integration is truly deep and complex. In this scenario, the teams share membership or code bases so that the needs of each team is represented in real time. This type of transparency exposes real requirements and schedule risk very quickly. It also allows dependent teams to contribute resources that accelerate delivery.

    For example, our Crowbar OpenStack team used this type of interlock with the Rackspace OpenStack team to ensure that we could get Diablo code delivered and deployed as fast as possible.

  4. Collaborative requirements (Fractal interlock)

    If you can’t collaborate or negotiate an MVP then you’re forced into working at the requirements level instead of development collaboration. You can think of this as a sprint-roadmap fast follow strategy because the interlocked teams are mutually evolving design requirements.

    I call this approach Fractal because you start at big concepts (road maps) and drill down to more and more detail (sprints) as the monitored project progresses. In this model, you interlock on a general capability initially and then work to refine the delivery as you learn more. The goal is to avoid starting delays or injecting false requirements that slow delivery.

    For example, if you had a product that required power from hamsters running in wheels then you’d start saying that you needed a small fast running animal. Over the next few sprints, you’d likely refine that down to four legged mammals and then to short tailed high energy rodents. Issues like nocturnal or bites operators could be addressed by the Hamster team or by the Wheel team as the issues arose. It could turn out that the right target (a red bull sipping gecko) surfaces during short tail rodent design review. My point is that you can avoid interlocks by allowing scope to evolve.

Breaking Process Interlocks delivers significant ROI

I have been trying to untangle both the cause and solution of process interlock for a long time. My team at Dell has an interlock-averse culture and it accelerates our work delivery. I write about this topic because I have real world experience that eliminating process interlocks increases

  1. team velocity
  2. collaboration
  3. quality
  4. return on investment

These are significant values that justify adoption of these non-interlock approachs; however, I have a more selfish motivation.

We want to work with other teams that are interlock-averse because the impacts multiply. Our team is slowed when others attempt to process interlock and accelerated when we are approached in the ways I list above.

I suspect that this topic deserves a book rather than a three part blog series and, perhaps, I will ultimately create one. Until then, I welcome your comments, suggestions and war stories.

The Process Interlock Dilemma – where Roadmaps get lost and why Waterfalls suck

Note: This is part 1 of a 3 part series. I have been working on this series for nearly six months in an attempt to make this subtle but extremely expensive problem understandable. Rather than continue to polish the posts, I will post series for your enjoyment. I hope that it is enlightening, humorous or (ideally) both. Comments are welcome!

I’ve been struggling to explain a subtle process fail that occurs every day at my company (Dell) and also at every company I’ve ever worked with or for. I call this demon “Process Interlock” and it is the invisible bane of projects big and small. It manifests by forcing well-meaning product managers and engineering directors to make trade-offs that they know are wrong because of schedule commitments. It means that product quality consistently drops to the bottom of the list in favor of getting in that one promised feature. It shows up when customers get products late because of prospect who decided not to buy demanded a feature a year ago. These are the symptoms of the process interlock dilemma.

Process Interlock occurs when another team depends on your team for a future feature.

That sounds pretty innocuous right? It makes sense that other teams, customers and partners should be able to ask you about your roadmap and then build your delivery schedule into their plans. That is the perfectly logical request that happens inside my group every single day. Unfortunately, that exact commitment is what creates the problem because it locks your team’s velocity into the future and eliminates agility.

Note: I was reading chapter 11 in Eric Ries’ Lean Startup as was surprised to find him making very similar arguments but from a different perspective.

To hopefully help explain, I’m inventing a hypothetical project from the car division of the G.Mordler company. GM plans to add time travel as an option for their 2016 product line. They believe that there is a big market in minivan’s that can solve the proverbial “are we there yet problem” by simply skipping over the boring part of the trip. The trans-dimensional mommy mobile (or Trans Ma’am) will be part of a refresh of their 2014 model. The addition of a time circuit and power generator developed two internal divisions, Alpha and Omega, support a critical marketing event for the company so timing is important.

Let’s examine four outcomes of how these two divisions turn their assumed schedules into rigidly locked conundrum.

Scenario 0: Ideal Case.

Alpha makes the fusion power supply and Omega is making the time circuits. Based on experimental data, Omega’s design calls for 3.14 Gigawatts to operate their time capacitor; however, Alpha’s available design is limited to 0.73 Gigawatts. Alpha expects to reach 3.5 Gigawatts in 9 months when their supplier releases an updated nitrogen cooled super conductor. Based on that commitment, Omega has enough information to make an informed decision about their timeline. Since Alpha commits to deliver in 12 months (9 for the new part + 3 for development), Omega expects to deliver a working time circuit in 20 months (12 for the supply + 8 for development). In this example, there are 3 levels of Process Interlock: Alpha interlocks with the supplier and then Omega interlocks with Alpha. From a PERT schedule perspective, the world is now under control! It’s a brand new day and the birds are singing…

Scenario 1: Meet Schedule w/ Added Cost

Unfortunately, we now have a highly interlocked schedule. In the best case scenario (the one where we meet the schedule), Alpha has just signed up to meet an aggressive delivery timeframe. They have to put heavy pressure on the supplier to deliver their part which causes the supplier to increase the price for the cooler component. When their product manager identifies available alternative markets (such as power generating pet waste incineration), they are not able to purse the opportunities because they cannot risk the schedule impact of redirecting engineers. Meanwhile, Omega understands that a critical part is missing for 12 months and decides to reduce staffing while waiting for the needed part. In the process, they lose a key engineer who could have optimized the manufacturing process to half the production defect rate. Overall, the project meets schedule but at added cost, reduced quality and missed opportunities. This happened because the interlocks eliminated flexibility in the schedule for upstream and downstream participants. GM meets the launch window for the Trans Ma’am but high costs for the upgrade limit sales.

Scenario 2: Meet Schedule w/ Lost Features

A more likely “on schedule” alternative is that Alpha’s supplier cuts some corners to meet the aggressive deadline; consequently, power generation for Alpha is not reliable. This issue is not revealed by load testing in Alpha’s labs or short time travel testing by Omega. Instead, the faulty generators fail in integration field testing accidentally sending a DOT test driver home during rush hour traffic. Fixing the problem requires a redesign of the power plant. The new design does not fit into space allowed by the Trans Ma’am design team causing the entire program, while delivered “on time,” to be considered a failure and not shipped. GM misses the launch window for the Trans Ma’am.

Scenario 3:
Miss Schedule

In the most likely scenario the project is late. The schedule for Alpha slips because supplier requires an extra three months to meet the Alpha’s specs. In a common turn of fate, the supplier’s specs would be sufficient for Alpha to proceed; however, Alpha’s risk manager bumped up the cooling requirements by 20% in order to ensure they had wiggle room in their own design. Because of the supplier contract requiring delivery per spec, the supplier could not ship a workable but contractually unacceptable product. Since the part is delayed, Alpha has to slip the schedule to Omega. Compounding the problem, Alpha’s manager is optimistic that it will work out and does not alert Omega until 2 weeks before the deadline. Omega, who has been testing their circuits using liquid sodium cooled nuclear fission power plants, attempts to make up the schedule delay by imposing 20 hour Mountain Dew fueled work days. The aggressive schedule results in quality issues for the time circuits so that they can only be used during Mountain-time rebroadcasts of Seinfeld. After an unsuccessful bid to purchase the Denver cable TV station KDEV, GM misses the launch window for the Trans Ma’am.

I realize these examples are complicated, but I hope they humorously illuminate the problem.

In part 2, I’ll show an alternate approach for GM that addresses the process interlock.

Post Script

Of course, for this example, the entire project plan is a moot point since we’re talking about time machines! I’m offering two likely endings for the scenarios above:

The Pragmatists’ Ending: Once the project is finally complete, the manager simply drives the car back to the beginning of the project. Over white Russian martinis and sushi, her future self explains how the painful delivery schedule cost her the best years of her life causing her to quit. Her replacement cannot maintain funding for the project so it is eventually scraped by G.Mordler six months before the working pieces can be assembled.

The Realists’ Ending: Once the project is finally complete, the manager simply drives the car back to the beginning of the project. Over lemonade vodka tonics and tapas, her future self provides a USB stick with the critical design data needed to complete the project on time and budget. When she examines the data, the resulting time paradox creates a rift in the Einstein-Jacob space-time fabric thus ending the universe.

Crowbar community support and 111111 sprint plan

The Dell Crowbar team is working to improve road map transparency. In the last few weeks, the Crowbar community has become more active on our lists, testing builds, and helping with documentation.

We love the engagement and continue to make supporting the list a priority.

Participation in Crowbar, OpenStack and Hadoop has been exceeding our expectations and we’re working to implement more community support and process. Thank you!!!

Our next steps:

  1. I’ve committed to post sprint plans and summary pages (this is the first)
  2. New Crowbar Twitter account
  3. I’m going to setup feature voting on the Crowbar Facebook page (like to vote)
  4. Continue to work the listserv and videos. We need help converting those to documentation on the crowbar wiki.
  5. Formalize collaborator agreements – we’re working with legal on this
  6. Exploring the option of a barclamp certification program and Crowbar support
  7. Moving to a gated trunk model for internal commits to improve quality
  8. Implementing a continuous integration system that includes core and barclamps. This will be part of our open source components.

We are working towards the 1.2 release (Beta 1) . That release is focused on supporting OpenStack but includes enhancements for upgrades, Hadoop, and additional OS support.

Our Sprint 111111 plan.

Source: Crowbar Wiki: [[sprint 111111]]

  • Theme: OpenStack Diablo Final release candidate.
  • Core Work: Refine Deployment for Nova, Glance, Nova Dashboard (horizon), keystone, swift
  • New additions: mySQL barclamp, Nova HA networking, kong
  • Crowbar internals: expose error states for proposals, allow packages to be included with barclamps to make upgrades easier, barclamp group pages
  • Operating system: added CentOS
  • Documentation: we’ve split the user guides into distinct books so Crowbar, OpenStack, and Hadoop each have their own user guide.
  • Pending action: expose the Hadoop barclamps
  • OS note: OpenStack is being tested (at Dell) against Ubuntu 10.10 only. Hadoop was tested against RHEL 5.7 and we expect it to work against CentOS also.

Crowbar design: solving the multi master update issue and adding a pause before configuration

The last few weeks for my team at Dell have been all about testing as Crowbar goes through our QA cycle and enters field testing. These activities are the run up to Dell open sourcing the bits.

The Crowbar testing cycle drove two significant architectural changes that are interesting as general challenges and important in the details for Crowbar adopters.

Challenge #1: Configuration Sequence.

Crowbar has control of every step of deployment from discovery, BIOS/RAID configuration, base image, core services and applications. That’s a great value prop but there’s a chicken and egg problem: how do you set the RAID for a system when you have not decided which applications you are going to install on it?

The urgency of solving this problem became obvious during our first full integration tests. Nova and Swift need very different hardware configurations. In our first Crowbar flows, we would configure the hardware before you selected the purpose of the node.  This was an effect of “rushing” into a Chef client ready state. 

We also needed a concept of collecting enough nodes to deploy a solution.  Building an OpenStack cloud requires that you have enough capacity to build the components of the system in the correct sequence.

Our solution was to inject a “pause” state just after node discovery.  In the current Crowbar state machine, nodes pause after discovery.  This allows you to assign them into the roles that you want them to play in your system.

In testing, we’ve found that the pause state helps manage the system deployment; however, it also added a new user action requirement. 

Challenge #2: Multi-Master Updates

In Chef, the owner of a node’s data in the centralized database is the node, not the server.  This is a logical (but not a typical) design pattern and has interesting side effects.  Specifically, updates from Chef Client runs on the nodes are considered authoritative and will over-write changes made on the server. 

This is correct behavior because Chef’s primary focus is updating the node (edge) and not the central system (core).  If the authority was reversed then we would miss critical changes that Chef effected on the nodes.   From this perspective, the server is a collection point for data that is owned/maintained at the nodes.

Unfortunately, Crowbar’s original design was to inject configuration into the Chef server’s node objects.  We found that Crowbar’s changes could be silently lost since the server is not the owner of the data.  This is not a locking issue – it is a data ownership issue.  Crowbar was not talking to the master of the data when it made updates!

To correct this problem, we (really Greg Althaus in a coding blitz) changed Crowbar to store data in a special role mapped to each node.  This works because roles are mastered on the server.  Crowbar can make reliable updates to the node’s dedicated role without worrying the remote data will override changes. 

This pattern is a better separation of concerns because Crowbar and barclamp configuration in stored in a very clearly delineated location (a role named crowbar-[node] and is not mixed with edge configuration data.

It turns out that these two design changes are tightly coupled.  Simultaneous edge/server writes became very common after we added the pause state.  They are infrequent for single node changes; however, the frequency increases when you are changing a system of interconnected nodes through multiple state.

More simply put: Crowbar is busy changing the node configs at the exactly same time the nodes are busy changing their own configuration.

Whew!  I hope that helped clarify some interesting design considerations behind Crowbar design.

Note: I want to repeat that Crowbar is not tied to Dell hardware! We have modules that are specifically for our BIOS/RAID, but Crowbar will happily do all the other great deployment work if those barclamps are missing.

Introducing BravoDelta: Erlang BDD based on Cucumber

I highly recommend Armstrong's Programming Erlang

I ❤ Erlang.  I learned about Erlang while simultaneously playing w/ CouchDB (written in Erlang) and reading Siebel’s excellent Coders At Work interview of Erlang creator Joe Armstrong.  Erlang takes me back to my lisp and prolog days – it’s interesting, powerful and elegant.  Even better, it’s performant, time tested and proven.

To whet my Erlang skills, I decided to port of the most essential development tools I’ve used: Cucumber BDD.  I think that using BDD is one of the most critical success criteria for a project that wants to move quickly and respond to customers.  If you’d like to see Cucumber in action, check out my WhatTheBus project.  A Cucumber test is written in “simple English” and looks like this:

Scenario: Web Page1
    When I go to the home page.
    Then I should see "Districts".

To run Bravo Delta, you’ll need Erlang installed on your system.  You may also want to setup the WhatTheBus project because the initial drop uses that RoR web site as it’s target.  I’ve uploaded the code onto GitHub project BravoDelta (code contributions welcome).

NOTE: This is a functional core – it is not expected to be a complete Cuke replacement at this point!

The code base consists of the following files:

  • bdd.erl (main code file, start using bdd:test(“scenario”).)
  • bdd.config (config file)
  • bdd_webrat.erl (standard steps that are used by many web page tests)
  • bravodelta.erl (same custom steps, must match feature file name)
  • bravodelta.feature (BDD test file)
  • bdd_utils.erl (utilities called by bdd & webrat)
  • bdd_selftest.erl (unit tests for utils – interesting pattern for selftest in this file!)
  • bdd_selftest.data (data for unit tests)

Erlang makes parsing the feature file very easy.  Unlike Cucumber, there is no RegEx craziness because Erlang has groovy pattern matching.  Basically, each step decomposes into a single line starting with Given, When, or Then.  The code is designed so that developers can easily add custom steps and there are pre-built steps for common web tasks in the “webrat” step file.  A step processor looks like this in Erlang:

step(_Config, _Given, {step_when, _N, ["I go to the home page"]}) ->
	bdd_utils:http_get(_Config, []);
step(_Config, _Result, {step_then, _N, ["I should see", Text]}) ->
	bdd_utils:html_search(Text,_Result).

The steps are called by an Erlang recursive routine in BDD for each scenario in the feature file.  Explaining that code will have to wait for a future post.

The objective for Bravo Delta is to demonstrate simple Erlang concepts.  I wanted to make sure that the framework was easy to extend and could grow overtime.  My experience with Erlang is that my code generally gets smaller and more powerful as I iterate.  For example, moving from three types of steps (given_, when_, then_) to a single step type with atoms for type resulted in a 20% code reduction.

I hope to use it for future BDD projects and grow its capability because it is fast and simple to extend – I found Cucumber to be very slow.  It should also be possible to execute features in parallel with minimal changes to this code base.  That makes Bravo Delta very well suited to large projects with lots of tests and automated build systems.

If Apple is Disney then is the iPad Miley Cyrus?

Or Is Apple’s walled garden more like Disney World

With the iPad frenzy, I’ve been hearing a lot about Apple’s success with its walled garden approach.  I objected to their proprietary closed stance on principle for a long time.  When I finally caved in, I came to understand something fundamentally true about consumers: predictability matters to the mainstream.

This is really no surprise.   Walt Disney figured this out with his amusement parks a long time ago.

Disney World is the ultimate walled garden.  They relentlessly control every mote of our experience in their parks and my family loves it.  We happily willingly pay a premium for the experience because we know that going to Disney World will be a smooth and our fun in assured. 

However, we less willingly pay a second price for our Disney experience; it’s homogenous and bland.  It lacks the spontaneity and vibe of the Austin City Limits music festivals.   At festivals, the content is raw and fresh and things can go wonderfully wrong.  You may be delighted by Vampire Weekend when you’d planned to see the Bob Dylan.

And so, Apple provides the quality control and censorship to Disney-ify our smart phones and tablets.  They’ve created a safe place to show off their impressive innovations.  They’ve created a limited market where they can control the spot lights.  In this way, Apple reminds me of how Disney manipulates it media outlets to create multi-talent superstars like Miley Cyrus.  They craft personas for their actors and ensure that they can sing, dance, and act.  This maximizes the appeal for Disney’s platform but blocks out other talented singers, dancers, and actors. 

Way when Brittany Spears a Disney property there was room left for other (better, truer) singers like Avril Lavigne.  Today, the sanitized Miley Cyrus talent trifecta effectively blots out the sun.

So far, the iPhone has been a platform for innovation.  Please ignore the fact that developers had to buy Apple computers to write applications for it.  Please ignore the fact that developers must pass through Apple’s QA and censors.  Please ignore the fact that you must purchase an Apple device.  Please ignore the fact that you can only purchase applications through the iTunes store.  They are a platform trifecta with hardware, software, and distribution.  This is the price that you pay to ride on Space Mountain, you must enter Apple’s iPark.

I’m hearing about some interesting new products emerging that will challenge Apple’s technology; however, I’m not sure if consumers are ready to leave the park and go to the festival.  I hope they are.

Disclaimer: I am a Dell employee.  We have products (based on Android) that complete with Apple’s smart phones and tablets.