DevOps vs Cloud Native: Damn, where did all this platform complexity come from?

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms.  Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.
11-11-16-hirschfeld-1080x675This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility.  While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams?  We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places.  That means that operations still owns the complex underlay and platform jobs.  Overall, I think that’s a workable diversion.

Apply, Rinse, Repeat! How do I get that DevOps conditioner out of my hair?

I’ve been trying to explain the pain Tao of physical ops in a way that’s accessible to people without scale ops experience.   It comes down to a yin-yang of two elements: exploding complexity and iterative learning.

Science = Explosions!Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications.  Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.

But that’s not the real challenge!  The bigger problem is that it’s impossible to design for all those parameters in advance.

When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system.  After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”

Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.

  1. The first cycle was completely generic to provide a working baseline and validate the physical environment.
  2. The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.
  3. The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment
  4. The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.

Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer!  Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.

32nd rule to measure complexity + 6 hyperscale network design rules

If you’ve studied computer science then you know there are algorithms that calculate “complexity.” Unfortunately, these have little practical use for data center operators.  My complexity rule does not require a PhD:

The 32nd rule: If it takes more than 30 seconds to pick out what would be impacted by a device failure then your design is too complex.

6 Hyperscale Network Design Rules

  1. Cost Matters
  2. Keep Networks Flat
  3. Filter at the Edge
  4. Design Fault Zones
  5. Plan for Local Traffic
  6. Offer load balancers (to your users)

Sorry for the teaser… I’ll be able to release more substance behind this list soon.   Until then comments are (as always) welcome!




Death by Ant Bytes

Or the Dangers of Incremental Complexity

Products are not built in big bangs: they are painfully crafted layer upon layer, decision after decision, day by day.  It’s also a team sport where each member makes countless decisions that hopefully help flow towards something customers love.

In fact, these decisions are so numerous and small that they seem to cost nothing.  Our judgment and creativity to builds the product crumb by drop.  Each and every morning we shows up for work ready to bake wholesome chocolaty goodness into the product.   It’s seeming irrelevance of each atomic bit that lulls us into false thinking that every addition is just a harmless Pythonesque “wafer thin”  bite.

That’s right, not all these changes are good.  It’s just as likely (perhaps more likely) that the team is tinkering with the recipe.  Someone asks them to add a pinch of cardamom today, pecans tomorrow, and raisins next week.  Individually, these little changes seem to be trivial.  Taken together, they can delay your schedule at best or ruin your product at worst.

Let me give you a concrete example:

In a past job, we had to build an object model for taxis.  At our current stage, this was pretty simple: a truck has a name, a home base, and an assigned driver.  One of our team independently looked ahead and decided individually that he should also add make, model, MPG, and other performance fields.  He also decided that assignments needed a whole new model since they could date range (start, end) and handle multiple drivers.  Many of you are probably thinking all this was just what engineers are supposed to do – anticipate needs.  Read on…

By the time he’d built the truck model, it had taken 5x as a long and resulted in 100s of lines of code.  It got worse the very next week when we built the meter interface code and learned more about the system.  For reporting requirements, MPG and performance fields had to be handled outside the taxi model.  We also found that driver assignments were much more naturally handled by looking at fare information.   Not only had we wasted a lot of time, we had to spend even more time reversing the changes we’d put in.

One of my past CEOs called this a “death by ant bites” and “death of a million cuts.”

It’s one of the most pernicious forms of feature creep because every single one of the changes can be justified.  I’m not suggesting that all little adds are bad, but they all cost something.   Generally, if someone says they are anticipating a future need, then you’re being bitten by an ant.

You need to make sure that your team is watching each other’s back and keeping everyone honest.  It’s even better to take turns playing devil’s advocate on each feature.  It’s worth an extra 10 minutes in a meeting to justify if that extra feature is required.

PS: Test Driven Design (TDD) repels ants because it exposes the true cost for those anticipatory or seemingly minor changes.  That “10 minute” feature is really a half day of work to design, test, integrate, and document.  If it’s not worth doing right, then it’s not worth adding to the product.

Time vs. Materials: $1,000 printer power button

Or why I teach my kids to solder

I just spent four hours doing tech support over a $0.01 part on an $80 inkjet printer.  According to my wife, those hours were a drop in the budget in a long line of comrades-in-geekdom who had been trying to get her office printer printing.  All told, at least $1,000 worth of expert’s time was invested.

It really troubles me when the ratio of purchase cost to support cost exceeds 10x for a brand new device.

In this case, a stuck power button cover forced the printer into a cryptic QA test mode.  It was obvious that the button was stuck, but not so obvious that that effectively crippled the printer.   Ultimately, my 14 year old striped the printer down, removed the $0.01 button cover, accidentally stripped a cable, soldered it back together, and finally repaired the printer.

From a cost perspective, my wife’s office would have been exponentially smarter to dump the whole thing in to the trash and get a new one.   Even the effort of returning it to the store was hardly worth the time lost dealing with the return.

This thinking really, really troubles me.

I have to wonder what it would cost our industry to create products that were field maintainable, easier to troubleshoot, and less likely to fail.  The automotive industry seems to be ahead of us in some respects.  They create products that a reliable, field maintainable, and conform to standards (given Toyota’s recent woes, do I need to reconsider this statement?).  Unfortunately, they are slow to innovate and have become highly constrained by legislative oversight.  Remember the old “If Microsoft made cars” joke?

For the high tech industry, I see systemic challenges driven by a number of market pressures:

  1. Pace of innovation: our use of silicon is just graduating from crawling to baby steps.  Products from the 90s look like stone tablets compared to 10’s offerings.   This is not just lipstick, these innovations disrupt design processes making it expensive to maintain legacy systems.
  2. Time to market: global competitive pressures to penetrate new markets give new customer acquisition design precedence.
  3. Lack of standards: standards can’t keep up with innovation and market pressures.  We’re growing to accept the consensus model for ad hoc standardization.  Personally, I like this approach, but we’re still learning how to keep it fair.
  4. System complexity: To make systems feature rich and cost effective, we make them tightly coupled.  This is great at design time, but eliminates maintainability because it’s impossible to isolate and replace individual components.
  5. Unequal wealth and labor rates:  Our good fortune and high standard of living make it impractical for us to spend time repairing or upgrading.  We save this labor by buying new products made in places where labor is cheap.  These cheap goods often lack quality and the cycle repeats.
  6. Inventory costs: Carrying low-demand, non-standard goods in inventory is expensive.   I can a printer with thousands of resistors soldered onto a board for $89 while buying the same resistors alone would cost more than the whole printer.  Can anyone afford to keep the parts needed for maintenance in stock?
  7. Disposable resources: We deplete limited resources as if they were unlimited.  Not even going to start on this rant…

Looking at these pressures makes the challenge appear overwhelming, but we need to find a way out of this trap.

That sounds like the subject for a future post!