A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

How do platforms die? One step at a time [the Fidelity Gap]

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem.  When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations.  Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelityFidelity Gap gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab.   Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements.  Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack.  There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing.  This design hides production defects and usability issues from developers.  These are issues that would be exposed quickly if the community required multi-instance development.  Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps?  We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers.  That system can then be fully automated with Ansible, Chef, Puppet and Salt.  Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away?  If you want to get to scale, start with the end in mind.

RackN fills holes with Drill Release

Rob H's avatarCloud2030 Podcast

Drill Man! by BruceLowell.com [creative commons] Drill Man! by BruceLowell.com [creative commons] We’re so excited about our in-process release that we’ve been relatively quiet about the last OpenCrowbar Drill release (video tour here).  That’s not a fair reflection of the level of capability and maturity reflected in the code base; yes, Drill’s purpose was to set the stage for truly ground breaking ops automation work in the next release (“Epoxy”).

So, what’s in Drill?  Scale and Containers on Metal Workloads!  [official release notes]

The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker Swarm, Kubernetes, CloudFoundry and StackEngine workloads.

In addition to workloads, we put the platform through its paces in real ops environments at scale.  That resulted in even richer network configurations and options plus performance…

View original post 100 more words

RackN fills holes with Drill Release

Drill Man! by BruceLowell.com [creative commons]

Drill Man! by BruceLowell.com [creative commons]

We’re so excited about our in-process release that we’ve been relatively quiet about the last OpenCrowbar Drill release (video tour here).  That’s not a fair reflection of the level of capability and maturity reflected in the code base; yes, Drill’s purpose was to set the stage for truly ground breaking ops automation work in the next release (“Epoxy”).

So, what’s in Drill?  Scale and Containers on Metal Workloads!  [official release notes]

The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker Swarm, Kubernetes, CloudFoundry and StackEngine workloads.

In addition to workloads, we put the platform through its paces in real ops environments at scale.  That resulted in even richer network configurations and options plus performance and tuning.  The RackN team continues to adapt the platform to match real work ops.

We believe that operations tools should adapt to their environments not vice versa.

We’ve encountered some pretty extreme quirks and our philosophy is embrace don’t force users to change tools or process necessarily.  For example, Drill automatically keeps last IPv4 octets aligned between interfaces.  Even better, we can help slipstream migrations (like IPv4 to IPv6) in place to minimize disruptions.

This is the top lesson you’ll see reflected in the Epoxy release:  RackN will keep finding ways to adapt to the ops environment.  

RackN fills holes with OpenCrowbar Drill Release

By Rob Hirschfeld

I’ve been relatively quiet about the OpenCrowbar Drill release and that’s not a fair reflection of the level of capability and maturity reflected in the code base; however, it really just sets the stage for truly ground breaking ops automation work in the next release (“Epoxy”).

So, what’s in Drill?  Scale and Containers on Metal Workloads!  https://github.com/opencrowbar/core/releases

The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker, Kubernetes, CloudFoundry and StackEngine workloads.

In addition to workloads, we put the platform through its paces in real ops environments at scale.  That resulted in even richer network configurations and options plus performance and tuning.  The RackN team continues to adapt OpenCrowbar to match real work ops.

One critical lesson you’ll see more in the Epoxy release: OpenCrowbar and the team at RackN will keep finding ways to adapt to the ops environment.  We believe that tools should adapt to their environments: we’ve encountered some pretty extreme quirks and our philosophy is embrace don’t force change.

Defending Ops without Killing Unicorns

By Rob Hirschfeld

Cloud Ops is a brutal business: operators are expected to maintain a stable and robust operating environment while also embracing waves of disruptive changes using unproven technologies. While we want to promote these promising new technologies, the unicorns, operators still have to keep the lights on; consequently, most companies turn to outside experts or internal strike teams to get this new stuff working.

Our experience is that doing an on-site deployment by professional services (PS) is often much harder than expected. Why? Because of inherent mission conflict. The PS paratrooper team sent to accomplish the “install Foo!” mission are at odds with the operators’ maintain and defend mission. Where the short-term team is willing to blast open a wall for access, the long-term team is highly averse to collateral damage. Both teams are faced with an impossible situation.

I’ve been promoting Open Ops around a common platform (obviously, OpenCrowbar in my opinion) as a way to solve address cross-site standardization.

Why would a physical automation standard help? Generally, the pros expect to arrive with everything “ready state” including OS installed and all the networking ready. Unfortunately, there’s a significant gap between an OS installed and … installs are always a journey of discovery as the teams figure out the real requirements.

Here are some questions that we’ve put together to gauge is the installs are really going the way you think:

  • How often is the customer site ready for deployment?  If not, how long does that take to correct?
  • How far into a deployment do you get before an error in deployment is detected?  How often that error repeated across all the systems?
  • How often is an “error” actually an operational requirement at the site that cannot be changed without executive approval and weeks of work?
  • How often are issues found after deployment is started that cause a install restart?
  • Can the deployment be recreated on another site?  Can the install be recreated in a test environment for troubleshooting?
  • How often are systems hand or custom updated as part of a routine installation?
  • How often are developers needed to troubleshoot issues that end up being configuration related?
  • How often are back-doors left in place to help with troubleshooting?
  • What is the upgrade process like?  Is the state of the “as left” system sufficiently documented to plan an upgrade?
  • What happens if there’s a major OS upgrade or patch required that impacts the installed application?
  • Can changes to the site be rolled out in stages?
  • Can the upgrade be automated and rehearsed?

DefCore Update – slowly taming the Interop hydra.

Last month, the OpenStack board charged the DefCore committee to tighten the specification. That means adding more required capabilities to the guidelines and reducing the number of exceptions (“flags”).  Read the official report by Chris Hoge.

Cartography by Dave McAlister is licensed under a. Creative Commons Attribution 4.0 International License.

It turns out interoperability is really, really hard in heterogenous environments because it’s not just about API – implementation choices change behavior.

I see this in both the cloud and physical layers. Since OpenStack is setup as a multi-vendor and multi-implementation (private/public) ecosystem, getting us back to a shared least common denominator is a monumental challenge. I also see a similar legacy in physical ops with OpenCrowbar where each environment is a snowflake and operators constantly reinvent the same tooling instead of sharing expertise.

Lack of commonality means the industry wastes significant effort recreating operational knowledge for marginal return. Increasing interop means reducing variations which, in turn, increases the stakes for vendors seeking differentiation.

We’ve been working on DefCore for years so that we could get to this point. Our first real Guideline, 2015.03, was an intentionally low bar with nearly half of the expected tests flagged as non-required. While the latest guidelines do not add new capabilities, they substantially reduce the number of exceptions granted. Further, we are in process of adding networking capabilities for the planned 2016.01 guideline (ready for community review at the Tokyo summit).

Even though these changes take a long time to become fully required for vendors, we can start testing interoperability of clouds using them immediately.

While, the DefCore guidelines via Foundation licensing policy does have teeth, vendors can take up to three years [1] to comply. That may sounds slow, but the real authority of the program comes from customer and vendor participation not enforcement [2].

For that reason, I’m proud that DefCore has become a truly diverse and broad initiative.

I’m further delighted by the leadership demonstrated by Egle Sigler, my co-chair, and Chris Hoge, the Foundation staff leading DefCore implementation.  Happily, their enthusiasm is also shared by many other people with long term DefCore investments including mid-cycle attendees Mark Volker (VMware), Catherine Deip (IBM) who is also a RefStack PTL, Shamail Tahir (EMC), Carol Barrett (Intel), Rocky Grober (Huawei), Van Lindberg (Rackspace), Mark Atwood (HP), Todd Moore (IBM), Vince Brunssen (IBM). We also had four DefCore related project PTLs join our mid-cycle: Kyle Mestery (Neutron), Nikhil Komawar (Glance),  John Dickinson (Swift), and Matthew Treinish (Tempest).

Thank you all for helping keep DefCore rolling and working together to tame the interoperability hydra!

[1] On the current schedule – changes will now take 1 year to become required – vendors have a three year tail! Three years? Since the last two Guideline are active, the fastest networking capabilities will be a required option is after 2016.01 is superseded in January 2017. Vendors who (re)license just before that can use the mark for 12 months (until January 2018!)

[2] How can we make this faster? Simple, consumers need to demand that their vendor pass the latest guidelines. DefCore provides Guidelines, but consumers checkbooks are the real power in the ecosystem.

As Docker rises above (and disrupts) clouds, I’m thinking about their community landscape

Watching the lovefest of DockerConf last week had me digging up my April 2014 “Can’t Contain(erize) the Hype” post.  There’s no doubt that Docker (and containers more broadly) is delivering on it’s promise.  I was impressed with the container community navigating towards an open platform in RunC and vendor adoption of the trusted container platforms.

I’m a fan of containers and their potential; yet, remotely watching the scope and exuberance of Docker partnerships seems out of proportion with the current capabilities of the technology.

The latest update to the Docker technology, v1.7, introduces a lot of important network, security and storage features.  The price of all that progress is disruption to ongoing work and integration to the ecosystem.

There’s always two sides to the rapid innovation coin: “Sweet, new features!  Meh, breaking changes to absorb.”

Docker Ecosystem Explained

Docker Ecosystem Explained

There remains a confusion between Docker the company and Docker the technology.  I like how the chart (right) maps out potential areas in the Docker ecosystem.  There’s clearly a lot of places for companies to monetize the technology; however, it’s not as clear if the company will be able to secede lucrative regions, like orchestration, to become a competitive landscape.

While Docker has clearly delivered a lot of value in just a year, they have a fair share of challenges ahead.  

If OpenStack is a leading indicator, we can expect to see vendor battlegrounds forming around networking and storage.  Docker (the company) has a chance to show leadership and build community here yet could cause harm by giving up the arbitrator role be a contender instead.

One thing that would help control the inevitable border skirmishes will be clear definitions of core, ecosystem and adjacencies.  I see Docker blurring these lines with some of their tools around orchestration, networking and storage.  I believe that was part of their now-suspended kerfuffle with CoreOS.

Thinking a step further, parts of the Docker technology (RunC) have moved over to Linux Foundation governance.  I wonder if the community will drive additional shared components into open governance.  Looking at Node.js, there’s clear precedent and I wonder if Joyent’s big Docker plans have them thinking along these lines.

Is there something between a Container and VM? Apparently, yes.

The RackN team has started designing reference architectures for containers on metal (discussed on TheNewStack.io) with the hope of finding hardware design that is cost and performance optimized for containers instead of simply repurposing premium virtualized cloud infrastructure.  That discussion turned up something unexpected…

That post generated a twitter thread that surfaced Hyper.sh and ClearLinux as hardware enabled (Intel VT-x) alternatives to containers.

This container alternative likely escapes notice of many because it requires hardware capabilities that are not/partially exposed inside cloud virtual machines; however, it could be a very compelling story for operators looking for containers on metal.

Here’s my basic understanding: these technologies offer container-like light-weight & elastic behavior with the isolation provided by virtual machines.  This is possible because they use CPU capabilities to isolate environments.

7/3 Update: Feedback about this post has largely been “making it easier for VMs to run docker automatically is not interesting.”  What’s your take on it?

Details behind RackN Kubernetes Workload for OpenCrowbar

Since I’ve already bragged about how this workload validates OpenCrowbar’s deep ops impact, I can get right down to the nuts and bolts of what RackN CTO Greg Althaus managed to pack into this workload.

Like any scale install, once you’ve got a solid foundation, the actual installation goes pretty quickly.  In Kubernetes’ case, that means creating strong networking and etcd configuration.

Here’s a 30 minute video showing the complete process from O/S install to working Kubernetes:

Here are the details:

Clustered etcd – distributed key store

etcd is the central data service that maintains the state for the Kubernetes deployment.  The strength of the installation rests on the correctness of etcd.  The workload builds an etcd cluster and synchronizes all the instances as nodes are added.

Networking with Flannel and Proxy

Flannel is the default overlay network for Kubernetes that handles IP assignment and intercontainer communication with UDP encapsulation.  The workload configures Flannel as for networking with etcd as the backing store.

An important part of the overall networking setup is the configuration of a proxy so that the nodes can get external access for Docker image repos.

Docker Setup

We install the latest Docker on the system.  That may not sound very exciting; however, Docker iterates faster than most Linux images so it’s important that we keep you current.

Master & Minion Kubernetes Nodes

Using etcd as a backend, the workload sets up one (or more) master nodes with the API server and other master services.  When the minions are configured, they are pointed to the master API server(s).  You get to choose how many masters and which systems become masters.  If you did not choose correctly, it’s easy to rinse and repeat.

Highly Available using DNS Round Robin

As the workload configures API servers, it also adds them to a DNS round robin pool (made possible by [new DNS integrations]).  Minions are configured to use the shared DNS name so that they automatically round-robin all the available API servers.  This ensures both load balancing and high availability.  The pool is automatically updated when you add or remove servers.

Installed on Real Metal

It’s worth including that we’ve done cluster deployments of 20 physical nodes (with 80 in process!).  Since OpenCrowbar architecture abstracts the vendor hardware, the configuration is multi-vendor and heterogenous.  That means that this workload (and our others) delivers tangible scale implementations quickly and reliably.

Future Work for Advanced Networking

Flannel is really very basic SDN.  We’d like to see additional networking integrations including OpenContrail as per Pedro Marques work.

At this time, we are not securing communication with etcd.  This requires advanced key management is a more advanced topic.

Why is RackN building this?  We are a physical ops automation company.

We are seeking to advance the state of data center operations by helping get complex scale platforms operationalized.  We want to work with the relevant communities to deliver repeatable best practices around next-generation platforms like Kubernetes.  Our speciality is in creating a general environment for ops success: we work with partners who are experts on using the platforms.

We want to engage with potential users before we turn this into a open community project; however, we’ve chosen to make the code public.  Please get us involved (community forum)!  You’ll need a working OpenCrowbar or RackN Enterprise install as a pre-req and we want to help you be successful.