DevOps workers, you mother was right: always bring a clean Underlay.

Why did your mom care about underwear? She wanted you to have good hygiene. What is good Ops hygiene? It’s not as simple as keeping up with the laundry, but the idea is similar. It means that we’re not going to get surprised by something in our environment that we’d taken for granted. It means that we have a fundamental level of control to keep clean. Let’s explore this in context.

l_1600_1200_9847591C-0837-4A7D-A69D-54041685E1C6.jpegI’ve struggled with the term “underlay” for infrastructure of a long time. At RackN, we generally prefer the term “ready state” to describe getting systems prepared for install; however, underlay fits very well when we consider it as the foundation for a more building up a platform like Kubernetes, Docker Swarm, Ceph and OpenStack. Even more than single operator applications, these community built platforms require carefully tuned and configured environments. In my experience, getting the underlay right dramatically reduces installation challenges of the platform.

What goes into a clean underlay? All your infrastructure and most of your configuration.

Just buying servers (or cloud instances) does not make a platform. Cloud underlay is nearly as complex, but let’s assume metal here. To turn nodes into a cluster, you need setup their RAID and BIOS. Generally, you’ll also need to configure out-of-band management IPs and security. Those RAID and BIOS settings specific to the function of each node, so you’d better get that right. Then install the operating system. That will need access keys, IP addresses, names, NTP, DNS and proxy configuration just as a start. Before you connect to the wide, make sure to update to your a local mirror and site specific requirements. Installing Docker or a SDN layer? You may have to patch your kernel. It’s already overwhelming and we have not even gotten to the platform specific details!

Buried in this long sequence of configurations are critical details about your network, storage and environment.

Any mistake here and your install goes off the rails. Imagine that your building a house: it’s very expensive to change the plumbing lines once the foundation is poured. Thankfully, software configuration is not concrete but the costs of dealing with bad setup is just as frustrating.

The underlay is the foundation of your install. It needs to be automated and robust.

The challenge compounds once an installation is already in progress because adding the application changes the underlay. When (not if) you make a deploy mistake, you’ll have to either reset the environment or make your deployment idempotent (meaning, able to run the same script multiple times safely). Really, you need to do both.

Why do you need both fast resets and component idempotency? They each help you troubleshoot issues but in different ways. Fast resets ensure that you understand the environment your application requires. Post install tweaks can mask systemic problems that will only be exposed under load. Idempotent action allows you to quickly iterate over individual steps to optimize and isolate components. Together they create resilient automation and good hygiene.

In my experience, the best deployments involved a non-recoverable/destructive performance test followed by a completely fresh install to reset the environment. The Ops equivalent of a full dress rehearsal to flush out issues. I’ve seen similar concepts promoted around the Netflix Chaos Monkey pattern.

If your deployment is too fragile to risk breaking in development and test then you’re signing up for an on-going life of fire fighting. In that case, you’ll definitely need all the “clean underware” you can find.

We need DevOps without Borders! Is that “Hybrid DevOps?”

The RackN team has been working on making DevOps more portable for over five years.  Portable between vendors, sites, tools and operating systems means that our automation needs be to hybrid in multiple dimensions by design.

Why drive for hybrid?  It’s about giving users control.

launch!I believe that application should drive the infrastructure, not the reverse.  I’ve heard may times that the “infrastructure should be invisible to the user.”  Unfortunately, lack of abstraction and composibility make it difficult to code across platforms.  I like the term “fidelity gap” to describe the cost of these differences.

What keeps DevOps from going hybrid?  Shortcuts related to platform entangled configuration management.

Everyone wants to get stuff done quickly; however, we make the same hard-coded ops choices over and over again.  Big bang configuration automation that embeds sequence assumptions into the script is not just technical debt, it’s fragile and difficult to upgrade or maintain.  The problem is not configuration management (that’s a critical component!), it’s the lack of system level tooling that forces us to overload the configuration tools.

What is system level tooling?  It’s integrating automation that expands beyond configuration into managing sequence (aka orchestration), service orientation, script modularity (aka composibility) and multi-platform abstraction (aka hybrid).

My ops automation experience says that these four factors must be solved together because they are interconnected.

What would a platform that embraced all these ideas look like?  Here is what we’ve been working towards with Digital Rebar at RackN:

Mono-Infrastructure IT “Hybrid DevOps”
Locked into a single platform Portable between sites and infrastructures with layered ops abstractions.
Limited interop between tools Adaptive to mix and match best-for-job tools.  Use the right scripting for the job at hand and never force migrate working automation.
Ad hoc security based on site specifics Secure using repeatable automated processes.  We fail at security when things get too complex change and adapt.
Difficult to reuse ops tools Composable Modules enable Ops Pipelines.  We have to be able to interchange parts of our deployments for collaboration and upgrades.
Fragile Configuration Management Service Oriented simplifies API integration.  The number of APIs and services is increasing.  Configuration management is not sufficient.
 Big bang: configure then deploy scripting Orchestrated action is critical because sequence matters.  Building a cluster requires sequential (often iterative) operations between nodes in the system.  We cannot build robust deployments without ongoing control over order of operations.

Should we call this “Hybrid Devops?”  That sounds so buzz-wordy!

I’ve come to believe that Hybrid DevOps is the right name.  More technical descriptions like “composable ops” or “service oriented devops” or “cross-platform orchestration” just don’t capture the real value.  All these names fail to capture the portability and multi-system flavor that drives the need for user control of hybrid in multiple dimensions.

Simply put, we need devops without borders!

What do you think?  Do you have a better term?

Post-OpenStack DefCore, I’m Chasing “open infrastructure” via cross-platform Interop

Like my previous DefCore interop windmill tilting, this is not something that can be done alone. Open infrastructure is a collaborative effort and I’m looking for your help and support. I believe solving this problem benefits us as an industry and individually as IT professionals.

2013-09-13_18-56-39_197So, what is open infrastructure?   It’s not about running on open source software. It’s about creating platform choice and control. In my experience, that’s what defines open for users (and developers are not users).

I’ve spent several years helping lead OpenStack interoperability (aka DefCore) efforts to ensure that OpenStack cloud APIs are consistent between vendors. I strongly believe that effort is essential to build an ecosystem around the project; however, in talking to enterprise users, I’ve learned that that their  real  interoperability gap is between that many platforms, AWS, Google, VMware, OpenStack and Metal, that they use everyday.

Instead of focusing inward to one platform, I believe the bigger enterprise need is to address automation across platforms. It is something I’m starting to call hybrid DevOps because it allows users to mix platforms, service APIs and tools.

Open infrastructure in that context is being able to work across platforms without being tied into one platform choice even when that platform is based on open source software. API duplication is not sufficient: the operational characteristics of each platform are different enough that we need a different abstraction approach.

We have to be able to compose automation in a way that tolerates substitution based on infrastructure characteristics. This is required for metal because of variation between hardware vendors and data center networking and services. It is equally essential for cloud because of variation between IaaS capabilities and service delivery models. Basically, those  minor  differences between clouds create significant challenges in interoperability at the operational level.

Rationalizing APIs does little to address these more structural differences.

The problem is compounded because the differences are not nicely segmented behind abstraction layers. If you work to build and sustain a fully integrated application, you must account for site specific needs throughout your application stack including networking, storage, access and security. I’ve described this as all deployments have 80% of the work common but the remaining 20% is mixed in with the 80% instead of being nicely layers. So, ops is cookie dough not vinaigrette.

Getting past this problem for initial provisioning on a single platform is a false victory. The real need is portable and upgrade-ready automation that can be reused and shared. Critically, we also need to build upon the existing foundations instead of requiring a blank slate. There is openness value in heterogeneous infrastructure so we need to embrace variation and design accordingly.

This is the vision the RackN team has been working towards with open source Digital Rebar project. We now able to showcase workload deployments (Docker, Kubernetes, Ceph, etc) on multiple cloud platforms that also translate to full bare metal deployments. Unlike previous generations of this tooling (some will remember Crowbar), we’ve been careful to avoid injecting external dependencies into the DevOps scripts.

While we’re able to demonstrate a high degree of portability (or fidelity) across multiple platforms, this is just the beginning. We are looking for users and collaborators who want to want to build open infrastructure from an operational perspective.

You are invited to join us in making open cross-platform operations a reality.

Deployment Fidelity – reducing tooling transistions for fun and profit

At the OpenStack Tokyo summit, I gave a short interview on Deployment Fidelity.  I’ve come to see the fidelity problem more broadly as the hybrid DevOps challenge that I described in my 2016 Predictions post as the end of mono-clouds.  Thanks Ken Hui from OpenStack Superuser TV for resurfacing this link!

A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

Introducing Digital Rebar. Building strong foundations for New Stack infrastructure

digital_rebarThis week, I have the privilege to showcase the emergence of RackN’s updated approach to data center infrastructure automation that is container-ready and drives “cloud-style” DevOps on physical metal.  While it works at scale, we’ve also ensured it’s light enough to run a production-fidelity deployment on a laptop.

You grow to cloud scale with a ready-state foundation that scales up at every step.  That’s exactly what we’re providing with Digital Rebar.

Over the past two years, the RackN team has been working on microservices operations orchestration in the OpenCrowbar code base.  By embracing these new tools and architecture, Digital Rebar takes that base into a new directions.  Yet, we also get to leverage a scalable heterogeneous provisioner and integrations for all major devops tools.  We began with critical data center automation already working.

Why Digital Rebar? Traditional data center ops is being disrupted by container and service architectures and legacy data centers are challenged with gracefully integrating this new way of managing containers at scale: we felt it was time to start a dialog the new foundational layer of scale ops.

Both our code and vision has substantially diverged from the groundbreaking “OpenStack Installer” MVP the RackN team members launched in 2011 from inside Dell and is still winning prizes for SUSE.

We have not regressed our leading vendor-neutral hardware discovery and configuration features; however, today, our discussions are about service wrappers, heterogeneous tooling, immutable container deployments and next generation platforms.

Over the next few days, I’ll be posting more about how Digital Rebar works (plus video demos).

Deploy to Metal? No sweat with RackN new Ansible Dynamic Inventory API

Content originally posted by Ansibile & RackN so I added a video demo.  Also, see Ansible’s original post for more details about the multi-vendor “Simple OpenStack Initiative.”

The RackN team takes our already super easy Ansible integration to a new level with added SSH Key control and dynamic inventory with the recent OpenCrowbar v2.3 (Drill) release.  These two items make full metal control more accessible than ever for Ansible users.

The platform offers full key management.  You can add keys at the system. deployment (group of machines) and machine levels.  These keys are operator settable and can be added and removed after provisioning has been completed.  If you want to control access to groups on a servers or group of server basis, OpenCrowbar provides that control via our API, CLI and UI.

We also provide a API path for Ansible dynamic inventory.  Using the simple Python client script (reference example), you can instantly a complete upgraded node inventory of your system.  The inventory data includes items like number of disks, cpus and amount of RAM.  If you’ve grouped machines in OpenCrowbar, those groups are passed to Ansible.  Even better, the metadata schema includes the networking configuration and machine status.

With no added configuration, you can immediately use Ansible as your multi-server CLI for ad hoc actions and installation using playbooks.

Of course, the OpenCrowbar tools are also available if you need remote power control or want a quick reimage of the system.

RackN respects that data centers are heterogenous.  Our vision is that your choice of hardware, operating system and network topology should not break devops deployments!  That’s why we work hard to provide useful abstracted information.  We want to work with you to help make sure that OpenCrowbar provides the right details to create best practice installations.

For working with bare metal, there’s no simpler way to deliver consistent repeatable results