A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

How do platforms die? One step at a time [the Fidelity Gap]

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem.  When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations.  Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelityFidelity Gap gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab.   Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements.  Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack.  There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing.  This design hides production defects and usability issues from developers.  These are issues that would be exposed quickly if the community required multi-instance development.  Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps?  We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers.  That system can then be fully automated with Ansible, Chef, Puppet and Salt.  Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away?  If you want to get to scale, start with the end in mind.

Introducing Digital Rebar. Building strong foundations for New Stack infrastructure

digital_rebarThis week, I have the privilege to showcase the emergence of RackN’s updated approach to data center infrastructure automation that is container-ready and drives “cloud-style” DevOps on physical metal.  While it works at scale, we’ve also ensured it’s light enough to run a production-fidelity deployment on a laptop.

You grow to cloud scale with a ready-state foundation that scales up at every step.  That’s exactly what we’re providing with Digital Rebar.

Over the past two years, the RackN team has been working on microservices operations orchestration in the OpenCrowbar code base.  By embracing these new tools and architecture, Digital Rebar takes that base into a new directions.  Yet, we also get to leverage a scalable heterogeneous provisioner and integrations for all major devops tools.  We began with critical data center automation already working.

Why Digital Rebar? Traditional data center ops is being disrupted by container and service architectures and legacy data centers are challenged with gracefully integrating this new way of managing containers at scale: we felt it was time to start a dialog the new foundational layer of scale ops.

Both our code and vision has substantially diverged from the groundbreaking “OpenStack Installer” MVP the RackN team members launched in 2011 from inside Dell and is still winning prizes for SUSE.

We have not regressed our leading vendor-neutral hardware discovery and configuration features; however, today, our discussions are about service wrappers, heterogeneous tooling, immutable container deployments and next generation platforms.

Over the next few days, I’ll be posting more about how Digital Rebar works (plus video demos).

RackN fills holes with Drill Release

Originally posted on RackN:

Drill Man! by BruceLowell.com [creative commons] Drill Man! by BruceLowell.com [creative commons] We’re so excited about our in-process release that we’ve been relatively quiet about the last OpenCrowbar Drill release (video tour here).  That’s not a fair reflection of the level of capability and maturity reflected in the code base; yes, Drill’s purpose was to set the stage for truly ground breaking ops automation work in the next release (“Epoxy”).

So, what’s in Drill?  Scale and Containers on Metal Workloads!  [official release notes]

The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker Swarm, Kubernetes, CloudFoundry and StackEngine workloads.

In addition to workloads, we put the platform through its paces in real ops environments at scale.  That resulted in even richer network configurations and options plus performance…

View original 100 more words

Deploy to Metal? No sweat with RackN new Ansible Dynamic Inventory API

Content originally posted by Ansibile & RackN so I added a video demo.  Also, see Ansible’s original post for more details about the multi-vendor “Simple OpenStack Initiative.”

The RackN team takes our already super easy Ansible integration to a new level with added SSH Key control and dynamic inventory with the recent OpenCrowbar v2.3 (Drill) release.  These two items make full metal control more accessible than ever for Ansible users.

The platform offers full key management.  You can add keys at the system. deployment (group of machines) and machine levels.  These keys are operator settable and can be added and removed after provisioning has been completed.  If you want to control access to groups on a servers or group of server basis, OpenCrowbar provides that control via our API, CLI and UI.

We also provide a API path for Ansible dynamic inventory.  Using the simple Python client script (reference example), you can instantly a complete upgraded node inventory of your system.  The inventory data includes items like number of disks, cpus and amount of RAM.  If you’ve grouped machines in OpenCrowbar, those groups are passed to Ansible.  Even better, the metadata schema includes the networking configuration and machine status.

With no added configuration, you can immediately use Ansible as your multi-server CLI for ad hoc actions and installation using playbooks.

Of course, the OpenCrowbar tools are also available if you need remote power control or want a quick reimage of the system.

RackN respects that data centers are heterogenous.  Our vision is that your choice of hardware, operating system and network topology should not break devops deployments!  That’s why we work hard to provide useful abstracted information.  We want to work with you to help make sure that OpenCrowbar provides the right details to create best practice installations.

For working with bare metal, there’s no simpler way to deliver consistent repeatable results

Transitioning from a Bossy Boss into a Digital Age Leader [Series Conclusion]


We hope you’ve enjoyed our discussion about digital management over the last seven posts. This series was born of our frustration with patterns of leadership in digital organizations: overly directing leaders stifle their team while hands-off leaders fail to provide critical direction. Neither culture is leading effectively!

Digital managers have to be two things at once

We felt that our “cultural intuition” is failing us.  That drove us to describe what’s broken and how to fix it.

Digital work and workers operate in a new model where top-down management is neither appropriate nor effective. To point, many digital workers actively resist being given too much direction, rules or structure. No, we are not throwing out management; on the contrary, we believe management is more important than ever, but changes to both work and workers has made it much harder than before.

That’s especially true when Boomers and Millennials try to work together because of differences in leadership experience and expectation. As Brad is always pointing out in his book Liquid Leadership, “what motivates a Millennial will not motivate a Boomer,” or even a Gen Xer.

Millennials may be so uncomfortable having to set limits and enforce decisions that they avoid exerting the very leadership that digital workers need! While GenX and Boomers may be creating and expecting unrealistic deadlines simply because they truly do not understand the depth of the work involved.

So who’s right and who’s wrong? As we’ve pointed out in previous posts, it’s neither! Why? Because unlike Industrial Age Models, there is no one way to get something done in The Information Age.

We desperately need a management model that works for everyone. How does a digital manager know when it’s time to be directing? If you’ve communicated a shared purpose well then you are always at liberty to 1) ask your team if this is aligned and 2) quickly stop any activity that is not aligned.

The trap we see for digital managers who have not communicated the shared goals is that they lack the team authority to take the lead.

We believe that digital leadership requires finding a middle ground using these three guidelines:

  1. Clearly express your intent and trust, don’t force, your team will follow it
  2. Respect your teams’ ability to make good decisions around the intent.
  3. Don’t be shy to exercise your authority when your team needs direction

Digital management is hard: you don’t get the luxury of authority or the comfort of certainty.

If you are used to directing then you have to trust yourself to communicate clearly at an abstract level and then let go of the details. If you are used to being hands-off then you have to get over being specific and assertive when the situation demands it.

Our frustration was that neither Boomer nor Millennial culture is providing effective management. Instead, we realized that elements of both are required. It’s up to the digital manager to learn when each mode is required.

Thank you for following along. It has been an honor.