How do platforms die? One step at a time [the Fidelity Gap]

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem.  When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations.  Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelityFidelity Gap gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab.   Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements.  Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack.  There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing.  This design hides production defects and usability issues from developers.  These are issues that would be exposed quickly if the community required multi-instance development.  Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps?  We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers.  That system can then be fully automated with Ansible, Chef, Puppet and Salt.  Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away?  If you want to get to scale, start with the end in mind.

Introducing Digital Rebar. Building strong foundations for New Stack infrastructure

digital_rebarThis week, I have the privilege to showcase the emergence of RackN’s updated approach to data center infrastructure automation that is container-ready and drives “cloud-style” DevOps on physical metal.  While it works at scale, we’ve also ensured it’s light enough to run a production-fidelity deployment on a laptop.

You grow to cloud scale with a ready-state foundation that scales up at every step.  That’s exactly what we’re providing with Digital Rebar.

Over the past two years, the RackN team has been working on microservices operations orchestration in the OpenCrowbar code base.  By embracing these new tools and architecture, Digital Rebar takes that base into a new directions.  Yet, we also get to leverage a scalable heterogeneous provisioner and integrations for all major devops tools.  We began with critical data center automation already working.

Why Digital Rebar? Traditional data center ops is being disrupted by container and service architectures and legacy data centers are challenged with gracefully integrating this new way of managing containers at scale: we felt it was time to start a dialog the new foundational layer of scale ops.

Both our code and vision has substantially diverged from the groundbreaking “OpenStack Installer” MVP the RackN team members launched in 2011 from inside Dell and is still winning prizes for SUSE.

We have not regressed our leading vendor-neutral hardware discovery and configuration features; however, today, our discussions are about service wrappers, heterogeneous tooling, immutable container deployments and next generation platforms.

Over the next few days, I’ll be posting more about how Digital Rebar works (plus video demos).

Transitioning from a Bossy Boss into a Digital Age Leader [Series Conclusion]


We hope you’ve enjoyed our discussion about digital management over the last seven posts. This series was born of our frustration with patterns of leadership in digital organizations: overly directing leaders stifle their team while hands-off leaders fail to provide critical direction. Neither culture is leading effectively!

Digital managers have to be two things at once

We felt that our “cultural intuition” is failing us.  That drove us to describe what’s broken and how to fix it.

Digital work and workers operate in a new model where top-down management is neither appropriate nor effective. To point, many digital workers actively resist being given too much direction, rules or structure. No, we are not throwing out management; on the contrary, we believe management is more important than ever, but changes to both work and workers has made it much harder than before.

That’s especially true when Boomers and Millennials try to work together because of differences in leadership experience and expectation. As Brad is always pointing out in his book Liquid Leadership, “what motivates a Millennial will not motivate a Boomer,” or even a Gen Xer.

Millennials may be so uncomfortable having to set limits and enforce decisions that they avoid exerting the very leadership that digital workers need! While GenX and Boomers may be creating and expecting unrealistic deadlines simply because they truly do not understand the depth of the work involved.

So who’s right and who’s wrong? As we’ve pointed out in previous posts, it’s neither! Why? Because unlike Industrial Age Models, there is no one way to get something done in The Information Age.

We desperately need a management model that works for everyone. How does a digital manager know when it’s time to be directing? If you’ve communicated a shared purpose well then you are always at liberty to 1) ask your team if this is aligned and 2) quickly stop any activity that is not aligned.

The trap we see for digital managers who have not communicated the shared goals is that they lack the team authority to take the lead.

We believe that digital leadership requires finding a middle ground using these three guidelines:

  1. Clearly express your intent and trust, don’t force, your team will follow it
  2. Respect your teams’ ability to make good decisions around the intent.
  3. Don’t be shy to exercise your authority when your team needs direction

Digital management is hard: you don’t get the luxury of authority or the comfort of certainty.

If you are used to directing then you have to trust yourself to communicate clearly at an abstract level and then let go of the details. If you are used to being hands-off then you have to get over being specific and assertive when the situation demands it.

Our frustration was that neither Boomer nor Millennial culture is providing effective management. Instead, we realized that elements of both are required. It’s up to the digital manager to learn when each mode is required.

Thank you for following along. It has been an honor.

DefCore Update – slowly taming the Interop hydra.

Last month, the OpenStack board charged the DefCore committee to tighten the specification. That means adding more required capabilities to the guidelines and reducing the number of exceptions (“flags”).  Read the official report by Chris Hoge.

Cartography by Dave McAlister is licensed under a. Creative Commons Attribution 4.0 International License.

It turns out interoperability is really, really hard in heterogenous environments because it’s not just about API – implementation choices change behavior.

I see this in both the cloud and physical layers. Since OpenStack is setup as a multi-vendor and multi-implementation (private/public) ecosystem, getting us back to a shared least common denominator is a monumental challenge. I also see a similar legacy in physical ops with OpenCrowbar where each environment is a snowflake and operators constantly reinvent the same tooling instead of sharing expertise.

Lack of commonality means the industry wastes significant effort recreating operational knowledge for marginal return. Increasing interop means reducing variations which, in turn, increases the stakes for vendors seeking differentiation.

We’ve been working on DefCore for years so that we could get to this point. Our first real Guideline, 2015.03, was an intentionally low bar with nearly half of the expected tests flagged as non-required. While the latest guidelines do not add new capabilities, they substantially reduce the number of exceptions granted. Further, we are in process of adding networking capabilities for the planned 2016.01 guideline (ready for community review at the Tokyo summit).

Even though these changes take a long time to become fully required for vendors, we can start testing interoperability of clouds using them immediately.

While, the DefCore guidelines via Foundation licensing policy does have teeth, vendors can take up to three years [1] to comply. That may sounds slow, but the real authority of the program comes from customer and vendor participation not enforcement [2].

For that reason, I’m proud that DefCore has become a truly diverse and broad initiative.

I’m further delighted by the leadership demonstrated by Egle Sigler, my co-chair, and Chris Hoge, the Foundation staff leading DefCore implementation.  Happily, their enthusiasm is also shared by many other people with long term DefCore investments including mid-cycle attendees Mark Volker (VMware), Catherine Deip (IBM) who is also a RefStack PTL, Shamail Tahir (EMC), Carol Barrett (Intel), Rocky Grober (Huawei), Van Lindberg (Rackspace), Mark Atwood (HP), Todd Moore (IBM), Vince Brunssen (IBM). We also had four DefCore related project PTLs join our mid-cycle: Kyle Mestery (Neutron), Nikhil Komawar (Glance),  John Dickinson (Swift), and Matthew Treinish (Tempest).

Thank you all for helping keep DefCore rolling and working together to tame the interoperability hydra!

[1] On the current schedule – changes will now take 1 year to become required – vendors have a three year tail! Three years? Since the last two Guideline are active, the fastest networking capabilities will be a required option is after 2016.01 is superseded in January 2017. Vendors who (re)license just before that can use the mark for 12 months (until January 2018!)

[2] How can we make this faster? Simple, consumers need to demand that their vendor pass the latest guidelines. DefCore provides Guidelines, but consumers checkbooks are the real power in the ecosystem.

When Two Right Decisions Make Things Wrong [Digital Management Series, 7 of 8]


The Duality Trap is one digital management danger that’s so destructive, we felt this series would be incomplete without a discussion. It’s especially problematic for The Digital Native managers and often mishandled by traditionally trained ones too.

Each apple is delicious. Which would you choose?

Each apple is delicious. Which would you choose?

The Duality Trap occurs when there are multiple right answers to a question. How often does this happen? Every single time. In fact, it’s a side effect of good digital management. Why?

In hierarchical management, the boss is always right so there’s no duality. Since we’ve thrown out hierarchical decision making, every team action is potentially subject to review by everyone on the team. The very loose structure that allows individual autonomy and rapid response has the natural consequence of also creating cognitive friction when individuals approach problems differently.

These different approaches are generally all valid ways to progress.

Digital natives fundamentally understand choice duality and may present alternatives just to ensure team diversity. Unfortunately, while where may be multiple valid solutions, the team can only pick one [1]. Nine times out of ten, the team will simply pick and move on. In that outlier case, they are counting on you, their digital manager, to resolve the selection.

Here’s the trap: resolving a duality does not mean “picking the winner” because having a winner implies the choices were unequal. If you’re team is stuck then there are at least two good choices.

If you are a traditional manager, the temptation to become Ronald “the decider” Reagan is nearly irresistible. Under the title=authority to decide model, you must justify your salary with making a “right” decision. You’ve been waiting for this moment to exert your authority for days. But, unbeknownst to “the decider,” this big moment will immediately undermine the team’s autonomy. On the other hand, If you are a digital native then this is the moment you’ve been dreading because you’ve got to be decisive. Despite 5 to 10 really good choices, you have to make ONE. So, a digital native can appear to be indecisive. However, not deciding is the worst possible choice. So what should you do?

First, remember that teams are strengthened when they are clearly aligned around an intent.

Resolving the duality trap is an opportunity to emphasize your intent. The best approach is to ask your team to review the options again in light of your shared objectives. In many cases, they will be able to resolve the issue from that perspective. If not, then you should:

  1. validate all options could work
  2. have the team state desired outcomes that can be measured
  3. pick the option that most aligns with your intent
  4. ask if the option your team does choose fit the overall agenda of; speed of delivery but quality drops, quality of deep diving into the project (upping the quality) but you may miss a crucial deadline (this may narrow down your choices.
  5. ask the team to monitor for the results

In this case, even as you are driving a decision, you are still sharing the responsibility for the outcome with the team. It’s important for the team that you focus on the desired results and not on which course was chosen. It is very likely that any of the choices would work out and achieve positive outcomes.

So it’s OK to get out of the trap of picking “best” options when there are multiple right choices.  

In an age of ambiguity, it is easy to fall into the duality trap. Just remember, there is no one way to get it all done these days. Which means a GREAT people manager realizes 2 things; a) your people need more of your support than ever. This comes in the form of training, finding solutions, and building a team that has the right chemistry. And b) getting out of their way.

Get ready as we wrap up this series in post 8: Transitioning from a Bossy Boss into a Digital Age Leader.

[1] If you are in a situation where you an allow divergence for minimal cost (like which phone brand people use) then do not force your team to choose!

As Docker rises above (and disrupts) clouds, I’m thinking about their community landscape

Watching the lovefest of DockerConf last week had me digging up my April 2014 “Can’t Contain(erize) the Hype” post.  There’s no doubt that Docker (and containers more broadly) is delivering on it’s promise.  I was impressed with the container community navigating towards an open platform in RunC and vendor adoption of the trusted container platforms.

I’m a fan of containers and their potential; yet, remotely watching the scope and exuberance of Docker partnerships seems out of proportion with the current capabilities of the technology.

The latest update to the Docker technology, v1.7, introduces a lot of important network, security and storage features.  The price of all that progress is disruption to ongoing work and integration to the ecosystem.

There’s always two sides to the rapid innovation coin: “Sweet, new features!  Meh, breaking changes to absorb.”

Docker Ecosystem Explained

Docker Ecosystem Explained

There remains a confusion between Docker the company and Docker the technology.  I like how the chart (right) maps out potential areas in the Docker ecosystem.  There’s clearly a lot of places for companies to monetize the technology; however, it’s not as clear if the company will be able to secede lucrative regions, like orchestration, to become a competitive landscape.

While Docker has clearly delivered a lot of value in just a year, they have a fair share of challenges ahead.  

If OpenStack is a leading indicator, we can expect to see vendor battlegrounds forming around networking and storage.  Docker (the company) has a chance to show leadership and build community here yet could cause harm by giving up the arbitrator role be a contender instead.

One thing that would help control the inevitable border skirmishes will be clear definitions of core, ecosystem and adjacencies.  I see Docker blurring these lines with some of their tools around orchestration, networking and storage.  I believe that was part of their now-suspended kerfuffle with CoreOS.

Thinking a step further, parts of the Docker technology (RunC) have moved over to Linux Foundation governance.  I wonder if the community will drive additional shared components into open governance.  Looking at Node.js, there’s clear precedent and I wonder if Joyent’s big Docker plans have them thinking along these lines.

StackEngine Docker on Metal via RackN Workload for OpenCrowbar


In our quest for fast and cost effective container workloads, RackN and StackEngine have teamed up to jointly develop a bare metal StackEngine workload for the RackN Enterprise version of OpenCrowbar.  Want more background on StackEngine? also did a recent post covering StackEngine capabilities.

While this work is early, it is complete enough for field installs.  We’d like to include potential users in our initial integration because we value your input.

Why is this important?  We believe that there are significant cost, operational and performance benefits to running containers directly on metal.  This collaboration is a tangible step towards demonstrating that value.

What did we create?  The RackN workload leverages our enterprise distribution of OpenCrowbar to create a ready state environment for StackEngine to be able to deploy and automate Docker container apps.

In this pass, that’s a pretty basic Centos 7.1 environment that’s hardware and configured.  The workload takes your StackEngine customer key as the input.  From there, it will download and install StackEngine on all the nodes in the system.  When you choose which nodes also manage the cluster, the workloads will automatically handle the cross registration.

What is our objective?  We want to provide a consistent and sharable way to run directly on metal.  That accelerates the exploration of this approach to operationalizing container infrastructure.

What is the roadmap?  We want feedback on the workload to drive the roadmap.  Our first priority is to tune to maximize performance.  Later, we expect to add additional operating systems, more complex networking and closed-loop integration with StackEngine and RackN for things like automatic resources scheduling.

How can you get involved?  If you are interested in working with a tech-preview version of the technology, you’ll need to a working OpenCrowbar Drill implementation (via Github or early access available from RackN), a StackEngine registration key and access to the RackN/StackEngine workload (email or for access).