How Good beats Great and avoids Process Interlock failure

Note: This is part 2 of a 3 part series about the “process interlock dilemma.”

This post addresses how to solve the Process Interlock dilemma I identified in part 1. It is critical to understand the failure of Process Interlock comes because the interlocks turn assumptions into facts. We must accept that any forward looking schedule is a guess. If your guesses are accurate then your schedule should be accurate. That type of insight and $5 will get you a Venti Carmel Frappuccino.

The problem of predicting the future and promising to deliver on that schedule results in one of two poor outcomes.

  1. The better poor outcome is that you are accurate and committed to a schedule.

    To keep on the schedule, you must focus on the committed deliverables. While this sounds ideal, there an opportunity cost to staying focused. Opportunity cost means that while your team is busy delivering on schedule, it is not doing work to pursue other opportunities. In a perfect world, your team picked the most profitable option before it committed the schedule. If you don’t live in a perfect world then it’s likely that while you are working on deliver you’ve learned about another opportunity. You may make your schedule but miss a more lucrative opportunity.

  2. The worse poor outcome is that you are not accurate and committed to a schedule.

    In that case, you miss both the opportunity you thought you had and the ones that you could not pursue while staying dedicated to your planning assumptions.

Let’s go back to our G.Mordler example and look at some better outcomes:

The “we’re going to try outcome.”

The Trans Ma’am team, Alpha, Omega and the supplier all get together and realize that the current design is not shippable; however, they realize that each team’s roadmaps converge within target time. To reduce interlocks, Omega takes Alpha in the low power form and begins integration. During integration, Omega identifies that Alpha can produce sufficient power for short periods of time travel but causes the exhaust vent of the power module to melt. Alpha determines that a change to the cooling system will address the problem. In consulting with their supplier, Alpha asks them to stop design on the new supply and adjust the current design as needed. The resulting time drive does not meet GM’s initial design for 4 hour time jumps, but is sufficient for lead footed mommies to retroactively avoid speeding tickets. GM decides it can still market the limited design.

The “we’re not ready outcome.

The Trans Ma’am team, Alpha and Omega all get together and realize that the current design is not shippable in their current state. While they cannot commit each realizes that there is a different market for their products: Alpha pursues dog poop power generation for high rises condo towers (aka brown energy) and Omega finds military applications for time travel nuclear submarines. In the experience gained from delivering products to these markets, Alpha improves power delivery by 20% and Omega improves efficiency by 20%. These modest mutual improvements allow Alpha to meet Omega’s requirement. While the combined product is too late for the target date, GM is able to incorporate the design into next design cycle.

While neither outcome delivers the desired feature at the original schedule, both provide better ROI for the company. One of the most common problems with process interlock is that we lost sight of ROI in our desire to meet an impractical objective.

Process interlock is a classic case of point optimization driving down system-wide performance.

If you’re interested in this effect, I recommend reading Eli Goldratt’s The Goal.

In the part, I’ve discussed some ways to escape from Process Interlock. I’ll talk about four alternative approaches in part 3 (to be published 3/16).

Agile Analogy: sprints are way points on a road trip

Happy 111111!  I’m working on a BIG AGILE post discussing the “interlock dilemma” that challenges big companies (like my employer Dell) as we become more lean in our development approaches.  That thought exercise turned up an analogy that is worth sharing.

We use sprints like we are driving on a long road trip.  As we travel, we want to stop at regular intervals to:

  • make sure we’re still going in the right direction (check the map)
  • see if we’re going to fast (overheating the engine)
  • see if we need to go faster (storms behind us)
  • avoid traffic (market is congested)
  • linger if there’s something interesting around (customers?!)
  • abandon the whole trip (kids are fighting in the back seat)
  • change our destination (saw a cool billboard)
  • pick-up a hitch hiker (partnering)

It just does not make sense to drive forward blindly hoping everything works out.  We need to inject decision points into our journey so that we take the right path.  And we have to remember, the right path is rarely that exact one that we started on!

If your product journey is predictable enough to navigate without frequent checks then your problem is not unique enough to generate much value.

Crowbar’s surprise value proposition: continous integration (#ci) testing

As part of our Agile/Lean methodologies, our team at Dell is highly invested in automated testing and continuous integration.  We’re running Jenkins to coordinate builds and EVERY CHECK-IN launches our full integration suite that tests our system end-to-end.  It may not be typical, but I don’t consider that to be particularly note worthy because it’s best practice.    (Rob’s note: if you write code and don’t think you have the authority then you need to geek-up and just do it – that’s our MO at Dell)

It’s important to understand that since Crowbar is an installer, every check-in does a FULL CLEAN INSTALL of all the Cactus OpenStack components.  Our verification requires that we test OpenStack because that’s our #1 exit requirement.  Consequently, we have built an automated build system that does a continuous integration test of a full, multi-node Nova/Glance/Swift deployment.

Automated end-to-end integration tests of OpenStack are a very handy thing!

In the last few weeks, we’ve heard from Dell internal groups and partners who are contributing to OpenStack Diablo that they want to leverage our work in continuous integration.  This will allow them to make sure that their development work does not regress other functions.  It’s a significant opportunity to ensure that we can collaborate between organizations.  It also promotes early development and distribution of Diablo installation scripts.

To support this in Crowbar, we are already planning incorporate more sophisticated revision control (likely based on Git) into Crowbar.

Note: YES, we consider our CI scripts to be part of our open source code.

BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

OpenStack is ready, but are you? Get some operational cloud mojo and get started!

NOTE: This post is not intended as an endorsement of the company “CloudOps.”

This week, I’ve working to describe the “cloud operation model” or “cloud ops” to Dell internal and external customers.  CloudOps is really just DevOps but packaged more broadly to help explain how hardware, software, and operations interact.  The critical concept I’m trying to convey is that we’re not spending enough time working with customers on operations.

Running a cloud is driven by operational processes and choices.

Back in 2001 when virtualization was a shiny new thing, no one had any idea on how to operate a virtualized data center.  My company (now owned by Quest) struggled to win deals outside of our own data center because our customers did not know how to operate virtualized hardware.  Ultimately, VMware created the SAN based data center consolidation pattern and sales exploded.  That solution is much more about operations than hardware (SANs) or software (ESX).

So here in 2011, we have the same challenge with cloud.  (The majority of) Dell’s customers do not know how to operate a hyperscale data center because there is no commonly accepted pattern.  That’s where the cloud operation model comes into play – we have cloud proven hardware and cloud proven software, but we had been missing a description of the operational cloud mojo.

My team’s first OpenStack project started as a cloud installer (aka Crowbar), but we’ve learned that it is more fundamental than that.  To achieve “4 hours to cloud,” our approach embraced the DevOps philosophy that deployments should be automated, dynamic and repeatable.  Our choice to extend Opscode’s Chef Server allowed us to bring in more than just a software capability: it delivers a core operational foundation that enables customers to manage their data center at significant scale.

We had to deliver a CloudOps Foundation because Cloud is not a static configuration that can be distilled in a 10 page white paper!

Cloud scale requires an Operations Foundation that can respond and react because deployed software and infrastructure is constantly evolving and adapting.  I do not mean moving around assets like VMs.  I am talking about something that closer to refactoring code and writing software features.   Like the applications that run on the cloud, we need to recognize that cloud is a moving target and build systems that can handle that.

We’re delivering OpenStack using an operational platform that can respond to the code as it changes and expands.  There is more than enough stable code and proven capability in OpenStack for our customers with CloudOps mojo to start building their operational foundation and to create commercial public clouds.  These first providers are not waiting for a “final release” of OpenStack where it’s suddenly “production ready.”

The beauty of an open source cloud with an active community is that it will be constantly improving.

Some may be hoping that in 5 years we will have established patterns for hyperscale; however, I think those days are past.  Instead, we’ll see tools that accelerate infrastructure agility.  We already have those for public cloud deployments and now it’s time to bring those into the data center itself.  But that is the subject for another post (BlackOps).

Use the 80/80 rule to crush your competition: you have to know WHICH 20% matters. #Lean #Agile

 In software, the 80/20 rule is a harsh reality.  It has two equally distressing parts:

  1. 80% of your feature set is common while 20% is unique. 
  2. 80% of your time going into creating 20% of the features

Part 1 should be a good thing – 80% of what you build will help all your customers.  Unfortunately, “unique” means that 20% of what you invest in will only help a fraction of your audience.  No problem you say? 

How do you know WHICH 20% is the unique part and which is the 80% common part?

Not knowing the 80 from the 20 is where Part 2 is particularly unkind.  Since you spend the majority of your investment on features for a narrow audience, you’d better get that pick your top features wisely.

The cold reality is that is that it’s not obvious which features are included in the 80% and which are in the 20%.  If you want to build a successful product, you need a way to pick the right features.

At most 50% of the features for a product are obvious in advance.

Let me explain using my last “next big thing” as an example.  I’m built a mobile sandwich application called sAndroidwich™.  Here are my product manager’s 10 features (in rank order):

  1. Bread (top)
  2. Bread (bottom)
  3. Bacon
  4. Romaine Lettuce
  5. Tomato
  6. Tuna
  7. Smoked Turkey
  8. Hummus
  9. Pepper Jack Cheese
  10. Cheddar Cheese (developers think Cheddar is easy if you already know Jack)

It’s pretty obvious that we’d identified BLT as our core market because everyone loves bacon, but what about the next 5 features?  Our product manager has 25 years of experience consuming sandwiches and swears that he knows this market inside and out.  Will these features put me into the top 3 social food apps?  You bet!  Call up Y Combinator, we’re going to IPO!

My potential feature list should have looked more like this:

  • Feature #            Features
  • 1-5                          Bread, Bread, Bacon, Lettuce, Tomato
  • 6-8                          Turkey Market: Turkey, Jack, Mustard
  • 9-11                       Beef Market: Beef, Cheddar, Mayo
  • 12-14                     Tuna Market: Tuna, Munster, Pickles
  • 15-16                     Veggie Market: Sprouts, Hummus

That’s 16 features even though I only have time for 10!  In addition to simply listing more features, I’ve also added market segments.  It’s important to remember that 80/20 rule also applies to features by market so features for 1 market may not help (or even hurt) sales in an adjacent market.

The challenge to picking features is that 50% of them are common to all users and their use is obvious while 30% of them are common to all users but you can’t distinguish them from the unique features.  I consider these to be “nonobvious common.”  You should take the time to list 160% of your potential features if you hope to find the real 80%.

To figure out the 30% nonobvious common features, you must accept that your own experience and bias clouds your judgment.

If you make the assumption that you can predict which of the features in the 80% and which are 20% then you will be wrong about 50% of your feature set!  If you accept that the second 50% of your features can only be discovered by customer interactions then you’re open to discovering the hidden 30% of common features.

Discovering this hidden 30% is critical to success because they are your market differentiation!

If you can find the hidden 30% then your competitor is probably handing you the golden goose.  In most cases, they are waiting while their engineering team is building the wrong features or focusing their 80% effort on the less critical 20% features.  This behavior ultimately causes feature fan out – which will have to wait for a future post.

BTW: sAndroidwich™ never made it into the top 10 apps – my team’s bias toward tuna and hummus (omega 3s AND delicious) meant that we missed the super-hot Beef and Jack market.   If only we’d shipped the BLT features (using Lean) then market tested and added incrementally, we may have been able to adjust before iSubpad and Po’Berry got all the users.

Substituting Action for Knowledge – adopting “ready, fire, aim” as a strategy (and when to run like hell)

Today my mother-in-law (a practicing psychiatrist) was bemoaning the current medical practice of substituting action for knowledge. In her world, many doctors will make rapid changes to their patients’ therapy. Their goal is to address the issues immediately presented (patient feels sad so Dr prescribes antidepressants) rather than taking time to understand the patients’ history or make changes incrementally and measure impacts. It feels like another example of our cultural compulsion to fix problems as quickly as possible.

Her comments made me question the core way that I evangelize!

Do Lean and Agile substitute action for knowledge? No. We use action to acquire knowledge.

The fundamental assumption that drives poor decision-making is that we have enough information to make a design, solve a problem or define a market. Lean and Agile’s more core tenet is that we must attack this assumption. We must assume that we can’t gather enough information to fully define our objective. The good news, is that even without much analysis we know a lot! We know:

  • roughly what we want to do (road map)
  • the first steps we should take (tactics)
  • who will be working on the problem (team members)
  • generally how much effort it will take (time & team size)
  • who has the problem that we are trying to solve (market)

We also know that we’ll learn a lot more as we get closer to our target. Every delay in starting effectively pushed our “day of clarity” further into the future. For that reason, it is essential that we build a process that constantly reviews and adjusts its targets.

We need to build a process that acquires knowledge as progress is made and makes rapid progress.

In Agile, we translate this need into the decorations of our process: reviews for learning, retrospectives for adjustments, planning for taking action and short iterations to drive the feedback loop.  Agile’s mantra is “ready, fire, aim, fire, aim, fire, aim, …” which is very different from simply jumping out of a plane without a parachute and hoping you’ll find a haystack to land in.

For cloud deployments, this means building operational knowledge in stages.  Technology is simply evolving too quickly and best practices too slowly for anyone to wait for a packaged solution to solve all their cloud infrastructure problems.  We tried this and it does not work: clouds are a mixture hardware, software and operations.  More accurately, clouds are an operational model supported by hardware and software.

Currently, 80% of cloud deployment effort is operations (or “DevOps“).

When I listen to people’s plans about building product or deploying cloud, I get very skeptical when they take a lot of time to aim at objects far off on the horizon.  Perhaps they are worried that they will substitute action for knowledge; however, I think they would be better served to test their knowledge with a little action.

My MIL agrees – she sees her patients frequently and makes small adjustments to their treatment as needed.  Wow, that’s an Rx for Agile!

The Go-Fasterer OpenStack Cloud Strategy

Dell’s OpenStack strategy (besides being interesting by itself) brings together Agile and Lean approaches and serves as a good illustration of the difference between the two approaches.

Before I can start the illustration, I need to explain the strategy clearly enough that the discussion makes sense.   Of course, my group is selling these systems so the strategy starts a sales pitch.  Bear with me, this is a long post and I promise we’ll get to the process parts as fast as possible.

Dell’s OpenStack strategy is to enter the market with the smallest possible working cloud infrastructure practical.  We have focused maniacally on eliminating all barriers and delays for customers’ evaluation processes.  Our targets are early adopters who want to invest in a real, hands-on OpenStack evaluation and understand they will have to work to figure out OpenStack.   White gloves, silver spoons and expensive licensed applications are not included in this offering.

We are delivering a cloud foundation kit: 7u hardware setup (6 nodes+switch), white paper, installer, and a dollop of consulting services.  It is a very small foot print system with very little integration.  The most notable deliverable is our target of going from boxes to working cloud in less than 4 hours (I was calling this “nuts to soup before lunch” but marketing didn’t bite).

Enough background?  Let’s talk about business process!

From this point on, our product offering is just an example.   You should imagine your product or service in these descriptions.  You should think about the internal reconfiguration required needed to bring your product or service to market in the way I am describing.

There are two critical elements in the go-fasterer strategy:

  1. a very limited “lean” product and
  2. a very fast “agile” installation process.

The offering challenges the de facto definition of solutions as being complete packages bursting with features, prescriptive processes, licensed companion products and armies of consultants.  While Dell will eventually have a solution that meets (or exceeds) these criteria; our team did not think we should wait until we had all those components before we begin engaging customers.

Our first offering is not for everyone by design.  It is highly targeted to early adopters who have specific needs (desire to move quickly) that outweigh all other feature requirements.  They are willing to invest in a less complete product because to core alone solves an important problem.

The concept of stripping back your product to the very core is the essence of Lean process.  Along this line of thinking, maintaining ship readiness is the primary mantra – if you can’t sell your product then your entire company’s existence is at risk.  I like the way the Poppendieck ‘s describe it:  you should consider product features as perishable inventory.  If we were selling fruit salad and you had bananas and apples but no cherries then it makes sense to sell apple/banana medley while you work on the cherries.

Whittling back a product to the truly smallest possible feature set is very threatening and difficult.  It forces teams to take risks and guesses that leave you with a product that many customers will reject.  Let me repeat that: you’re objective is to create a product that many customers will reject.  You must do this because it:

  1. gets into the market much faster for some customers (earning $ is wonderfully clarifying)
  2. learn immediately what’s missing (fewer future guesses)
  3. learn immediately what’s important to customers (less risk)
  4. builds credibility that you are delivering something (you’re building relationships)

Ironically, while lean approaches exist to reduce risk and guesswork; they will feel very risky and like gambling to organizations used to traditional processes.   This is not surprising because our objective is to go faster so initially we will be uncomfortable that we have enough information to make decisions.

The best cure for lack of information is not more analysis!  The cure is interacting with customers.

Lean says that you need product if you want to interact meaningfully with customers.  This is because customers (even those who are not buying right away) will take you more seriously if you’ve got a product.  Talking about products that you are going to release is like talking about the person you wanted to take to prom but never asked.

To achieve product early, you need to find the true minimum product set.  This is not the smallest comfortable set.  It is the set that is so small, so uncomfortable, so stripped down that it seems to barely do anything at all.

In our case, we considered it sufficient if the current OpenStack release could be reliably and quickly installed on Dell hardware.  We believe there are early adopter customers who want to evaluate OpenStack right away and their primary concern starting their pilot and marketing towards eventually deployment.

Mixing Agile into Lean is needed to make the “skinny down” discipline practical and repeatable.

Agile brings in a few critical disciplines to enable Lean:

  1. Prioritized roadmaps help keep teams focused on what’s needed first but don’t lose sight of longer term plans.
  2. Predictable pace of delivery allows committed interactions with customers that give timelines for fixing issues or adding capabilities.
  3. Working out of order keeps the great from being the enemy of the good so that we delay field testing while we solve imagined problems.
  4. Focus on quality / automation / repeatability reduces paying for technical debt internally and time firefighting careless defects when a product is “in the wild” with customers.
  5. Insistence on installable “ship ready” product ensures that product gets into the field whenever the right customer is found.  Note: this does not mean any customer.  Selling to the wrong customer can be deadly too, but that’s a different topic.
  6. Feedback driven iterations ensures that Lean engagements with customers are interactive and inform development.

These disciplines are important for any organization but vital when you go Lean.  To take your product early and aggressively to market, you must have confidence that you can continue to deliver after your customers get a taste of the product.

You cannot succeed with Lean if you cannot quickly evolve your initial offering.

The enabling compromise with Lean is that you will keep the train running with incremental improvements:  Lean fails if you engage customers early then disappear back into a long delivery cycle.  That means committing to an Agile product delivery cycle if you want Lean (note: the reverse not true)

I think of Lean and Agile as two sides of the same results driven coin: Lean faces towards the customer and market while Agile faces internally to engineering.

Please let me know how your team is trying to accelerate product delivery.

Note: of course, you’re also welcome to contact me if you’re interested in being an early adopter for our OpenStack foundation kit.