100,000 of Anything is Hard – Scaling Concerns for Digital Rebar Architecture

Our architectural plans for Digital Rebar are beyond big – they are for massive distributed scale. Not up, but out. We are designing for the case where we have common automation content packages distributed over 100,000 stand-alone sites (think 5G cell towers) that are not synchronously managed. In that case, there will be version drift between the endpoints and content. For example, we may need to patch an installation script quickly over a whole fleet but want to upgrade the endpoints more slowly.

It’s a hard problem and it’s why we’ve focused on composable systems and fine-grain versioning.

It’s also part of the RackN move into a biweekly release cadence for Digital Rebar. That means that we are iterating from tip development to stable every two weeks. It’s fast because we don’t want operators deploying the development “tip” to access features or bug fixes.

This works for several reasons. First, much of the Digital Rebar value is delivered as content instead of in the scaffolding. Each content package has it’s own version cycle and is not tied to Digital Rebar versions. Second, many Digital Rebar features are relatively small, incremental additions. Faster releases allows content creators and operators to access that buttery goodness more quickly without trying to manage the less stable development tip.

Critical enablers for this release pace are feature flags. Starting in v3.2, Digital Rebar introduced the system level tags that are set when new features are added. These flags allow content developers to introspect the system in a multi-version way to see which behaviors are available in each endpoint. This is much more consistent and granular than version matching.

We are not designing a single endpoint system: we are planning for content that spans 1,000s of endpoints.

Feature flags are part of our 100,000 endpoint architecture thinking. In large scale systems, there can be significant version drift within a fleet deployment. We have to expect that automation designers want to enable advanced features before they are universally deployed in the fleet. That means that the system needs a way to easily advertise specific capabilities internally. Automation can then be written with different behaviors depending on the environment. For example, changing exit codes could have broken existing scripts except that scripts used flags to determine which codes were appropriate for the system. These are NOT API issues that work well with semantic versioning (semver), they are deeper system behaviors.

This matters even if you only have a single endpoint because it also enables sharing in the Digital Rebar community.

Without these changes, composable automation designed for the Digital Rebar community would quickly become very brittle and hard to maintain. Our goal is to ensure a decoupling of endpoint and content. This same benefit allows the community to share packages and large scale sites to coordinate upgrades. I don’t think that we’re done yet. This is a hard problem and we’re still evolving all the intricacies of updating and delivering composable automation.

It’s the type of complex, operational thinking that excites the RackN engineering team. I hope it excites you too because we’d love to get your thinking on how to make it even better!

 

Putting a little ooooh! in orchestration

The RackN team is proud of saying that we left the Orchestration out when we migrated from Digital Rebar v2 to v3. That would mean more if anyone actually agreed on what orchestration means… In this our case, I think we can be pretty specific: Digital Rebar v3 does not manage work across multiple nodes. At this point, we’re emphatic about it because cross machine actions add a lot of complexity and require application awareness that quickly blossoms into operational woe, torture and frustration (aka WTF).

That’s why Digital Rebar focused on doing a simple yet powerful job doing multi-boot workflow on a single machine.

In the latest releases (v3.2+), we’ve delivered an easy to understand stage and task running system that is simple to extend, transparent in operation and extremely fast. There’s no special language (DSL) to learn or database to master. And if you need those things, then we encourage you to use the excellent options from Chef, Puppet, SaltStack, Ansible and others. This is because our primary design focus is planning work over multiple boots and operating system environments instead of between machines. Digital Rebar shines when you need 3+ reboots to automatically scrub, burn-in, inventory, install and then post-configure a machine.

But we may have crossed an orchestration line with our new cluster token capability.

Starting in the v3.4 release, automation authors will be able to use a shared profile to coordinate work between multiple machines. This is not a Digital Rebar feature per se – it’s a data pattern that leverages Digital Rebar locking, profiles and parameters to share information between machines. This allows scripts to elect leaders, create authoritative information (like tokens) and synchronize actions. The basic mechanism is simple: we create a shared machine profile that includes a token that allows editing the profile. Normally, machines can only edit themselves so we have to explicitly enable editing profiles with a special use token. With this capability, all the machines assigned to the profile can update the profile (and only that profile). The profile becomes an atomic, secure shared configuration space.

For example, when building a Kubernetes cluster using Kubeadm, the installation script needs to take different actions depending on which node is first. The first node needs to initialize the cluster master, generate a token and share its IP address. The subsequent nodes must wait until the master is initialized and then join using the token. The installation pattern is basically a first-in leader election while all others wait for the leader. There’s no need for more complex sequencing because the real install “orchestration” is done after the join when Kubernetes starts to configure the nodes.

Our experience is that recent cloud native systems are all capable of this type of shotgun start where all the nodes start in parallel with the minimal bootstrap coordination that Digital Rebar can provide.

Individually, the incremental features needed to enable cluster building were small additions to Digital Rebar. Together, they provide a simple yet powerful management underlay. At RackN, we believe that simple beats complex everyday and we’re fighting hard to make sure operations stays that way.

Podcast with Peter Miron talking NATS Service, Edge and Cloud Native Foundation

 

 

 

Joining this week’s L8ist Sh9y Podcast is Peter Miron, General Manager for NATS project sponsored by Apcera provides details on this open source project how it integrates with modern application architecture as well as their participation in Cloud Native Foundation.

About NATS

NATS is a family of open source products that are tightly integrated but can be deployed independently. NATS is being deployed globally by thousands of companies, spanning innovative use-cases including: Mobile apps, Microservices and Cloud Native, and IoT. NATS is also available as a hosted solution, NATS Cloud

The core NATS Server acts as a central nervous system for building distributed applications. There are dozens of clients ranging from Java, .NET, to GO. NATS Streaming extends the platform to provide for real-time streaming & big data use-cases.

 

Topic                                      Time (Minutes.Seconds)

Introduction                                          0.00 – 2.07
What is NATS?                                      2.07 – 3.36
Built for Containers, Short Term        3.36 – 5.14
Simple Example                                    5.14 – 6.51
Container ServiceMesh Concept       6.51 – 9.20
Loosely Coupled?                                 9.20 – 12.02
Inter-process Communication           12.02 – 15.11
Security                                                  15.11 – 18.02
Generic Politics Discussion                18.02 – 24.10
Edge Computing & NATS                   24.10 – 28.55
Apps to Service Portability                 28.55 – 32.37
Open Source Politics – CNCF            32.37 – 39.53
Conclusion                                            39.53  – END

Podcast Guest: Peter Miron
General Manager for NATS team

Peter Miron is an architect at Apcera, a highly secure, policy-driven platform for cloud-native applications and microservices. He was previously the director of technology for Pershing.

Before joining Pershing, Miron worked as the SVP of engineering at Bitly and vice president at Vonage. He also worked as the CTO of Knewton.

Miron holds a bachelor’s degree in art history from Syracuse University.

 

Sirens of Open Infrastructure beacons to OpenStack Community

OpenStack is a real platform doing real work for real users.  So why does OpenStack have a reputation for not working?  It falls into the lack of core-focus paradox: being too much to too many undermines your ability to do something well.  In this case, we keep conflating the community and the code.

I have a long history with the project but have been pretty much outside of it (yay, Kubernetes!) for the last 18 months.  That perspective helps me feel like I’m getting closer to the answer after spending a few days with the community at the latest OpenStack Summit in Sydney Australia.  While I love to think about the why, the what the leaders are doing about it is very interesting too.

Fundamentally, OpenStack’s problem is that infrastructure automation is too hard and big to be solved within a single effort.  

It’s so big that any workable solution will fail for a sizable number of hopeful operators.  That does not keep people from the false aspiration that OpenStack code will perfectly fit their needs (especially if they are unwilling to trim their requirements).

But the problem is not inflated expectations for OpenStack VM IaaS code, it’s that we keep feeding them.  I have been a long time champion for a small core with a clear ecosystem boundary.  When OpenStack code claims support for other use cases, it invites disappointment and frustration.

So why is OpenStack foundation moving to expand its scope as an Open Infrastructure community with additional focus areas?  It’s simple: the community is asking them to do it.

Within the vast space of infrastructure automation, there are clusters of aligned interest.  These clusters are sufficiently narrow that they can collaborate on shared technologies and practices.  They also have an partial overlap (Venn) with adjacencies where OpenStack is already present.  There is a strong economic and social drive for members in these overlapped communities to bridge together instead of creating new disparate groups.  Having the OpenStack foundation organize these efforts is a natural and expected function.

The danger of this expansion comes from also carrying the expectation that the technology (code) will also be carried into the adjacencies.  That’s my my exact rationale the original VM IaaS needs to be smaller.  The wealth of non-core projects crosses clusters of interests.  Instead of allowing these clusters to optimize their needs around shared interests, the users get the impression that they must broadly adopt unneeded or poorly fit components.  The idea of “competitive” projects should be reframed because they may overlap in function but not ui use-case fit.

It’s long past time to give up expectations that OpenStack is a “one-stop-shop” of infrastructure automation.  In my opinion, it undermines the community mission by excluding adjacencies.

I believe that OpenStack must work to embrace its role as an open infrastructure community; however, it must also do the hard work to create welcoming space for adjacencies.  These adjacencies will compete with existing projects currently under the OpenStack code tent.  The community needs to embrace that the hard work done so far may simply be sunk cost for new use cases. 

It’s the OpenStack community and the experience, not the code, that creates long term value.

Building Kubernetes based highly customizable environments on OpenStack with Kubespray

This talk was given on November 8 at the OpenStack Summit Sydney event.

Abstract

Kubespray (formerly Kargo) – is a project under Kubernetes community umbrella. From the technical side, it is a set of tools, that bring the possibility to deploy production-ready Kubernetes cluster easily.

Kubespray supports multiple Linux distributions to host the Kubernetes clusters (including Ubuntu, Debian, CentOS/RHEL and Container Linux by CoreOS), multiple cloud providers to be used as an underlay for the cluster deployment (AWS, DigitalOcean, GCE, Azure and OpenStack), together with the ability to use Bare Metal installations. It may consume Docker and rkt as the container runtimes for the containerized workloads, as well as a wide variety of networking plugins (Flannel, Weave, Calico and Canal); or built-in cloud provider networking instead.

In this talk we will describe the options of using Kubespray for building Kubernetes environments on OpenStack and how can you benefit from it.

What can I expect to learn?

Active Kubernetes community members, Ihor Dvoretskyi and Rob Hirschfeld, will highlight the benefits of running Kubernetes on top of OpenStack, and will describe how Kubespray may simplify the cluster building and management options for these use-cases.

Complete presentation

Slides
https://www.slideshare.net/RackN/slideshelf

Speakers

Ihor Dvoretskyi

Ihor is a Developer Advocate at Cloud Native Computing Foundation (CNCF), focused on the upstream Kubernetes-related efforts. He acts as a Product Manager at Kubernetes community, leading Product Management Special Interest Group with the goals of growing Kubernetes as a #1 open source container orchestration platform.

Rob Hirschfeld

Rob Hirschfeld has been involved in OpenStack since the earliest days with a focus on ops and building the infrastructure that powers cloud and storage.  He’s also co-Chair of the Kubernetes Cluster Ops SIG and a four term OpenStack board member.

 

Breaking the Silicon Floor – Digital Rebar v3.2 unlocks full life-cycle control for hardware provisioning

The difficulty in fully automating physical infrastructure environments, especially for distributed edge, adds significant cost, complexity and delay when building IT infrastructure. We’ve called this “underlay” or “ready state” in the past but “last mile” may be just as apt. The simple fact is that underlay is the foundation for everything you build above it so mistakes there are amplified.

Historically, simple systems still required manual or custom steps while complex systems where fragile and hard to learn. This dichotomy drives operators to add a cloud abstraction layer as a compromise because the cloud adds simple provisioning APIs at the prices of hidden operational complexity.

What if we had those simple APIs directly against the metal? Without the operational complexity?

That’s exactly what we’ve achieved in the latest Digital Rebar release. In this release, the RackN team refined the Digital Rebar control flows introduced in v3.1 based on customer and field experience. These flow are simple to understand, composable to build and amazingly fast in execution.

For example, you can build workflows that handle discovering machines with burn-in and inventory stages that install ssh keys that automatically register themselves for Terraform consumption. Our Terraform provider can then take those machines and make new workflow requests like “install CentOS” and tell me when it’s ready. When you’re finished, another workflow will teardown the system and scrub the data. That’s very cloud like behavior but directly on metal.

These workflows are designed to drive automatic behavior (like joining a Kubernetes cluster), simplify API requests (like target state for Terraform), or prepare environments for orchestration (like dynamic inventory for Ansible). They reflect our design goal to ensure that Digital Rebar integrates upstack easily.

Our point with Digital Rebar is to drive full automation down into the physical layer. By fixing the underlay, our approach accelerates and simplifies orchestration and platform layers above. We’re excited about the progress and invite you take 5 minutes to try our quick start.

Follow the Digital Rebar Community:

Digital Rebar Releases V3.2 – Stage Workflow

In v3.2, Digital Rebar continues to refine the groundbreaking provisioning workflow introduced in v3.1. Updates to the workflow make it easier to consume by external systems like Terraform. We’ve also improved the consistency and performance of both the content and service.

Note: we are accelerating the release schedule for Digital Rebar with a target of 4 to 6 weeks per release. The goal is to incrementally capture new features in stable releases so there is not a lengthy delay before fixes and features are available.

Here’s a list of features for the v3.2 release.

  • Promoted stage automation to release status in open source – these were RackN content during beta
  • Plugins now include content layers – they don’t require separate content and versioning is easier
  • Feature flags on endpoint and content – allows automation to detect if needed requirements are in place before attempting to use them
  • Improve exit codes from jobs – improves coordination and consistency in jobs
  • Allow runner to continue processing into new installed OS – helps with Terraform handoff and direct disk imaging
  • Add tooling for direct image deploy to sledgehammer – self explanatory
  • Change CLI to use Server models instead of swagger generated code – improves consistency and maintainability of the CLI
  • Machine Inventory (gohai utility) – collects machine information (in Golang!) so that automation can make decisions based on configuration
  • General bug fixes and performance enhancements – this was a release theme
  • Make it easier to export content from an endpoint – user requested feature
  • Improve how tokens and secrets are handed by the server – based on audit

The release of workflow and the addition of inventory means that Digital Rebar v3 effectively replaces all key functions of v2 with a significantly smaller footprint, minimal learning curve and improved performance. One v2 major feature, multi-node coordination, is not on any roadmap for v3 because we believe those use case are well serviced by upstack integrations like Terraform and Ansible.

Follow the Digital Rebar Community:

HashiConf 2017: Messy yet Effective Hybrid Portability

Last week, I was able to attend the HashiConf 2017 event in my hometown of Austin, Texas.  HashiCorp has a significant following of loyal fans for their platforms and the show reflected their enthusiasm for the HashiCorp clean and functional design aesthetic.  I count the RackN team in that list – we embedded Consul deeply into Digital Rebar v2 and recently announced a cutting edge bare metal Terraform integration (demo video) with Digital Rebar Provision (v3).

Overall, the show was impressively executed.  It was a comfortable size to connect with attendees and most of the attendees were users instead of vendors.  The announcements at the show were also notable.  HashiCorp announced enterprise versions of all their popular platforms including Consul, Vault, Nomad and Terraform.  For their enterprise versions include a cross-cutting service, Sentinel, that provides a policy engine to help enforce corporate governance.

Since all the tools are open source, creating an enterprise version can cause angst in the community.  I felt that they handled the introduction well and the additions were well received.  Typically, governance controls are a good demarcation for Enterprise features.

I was particularly impressed with the breadth and depth of Terraform use discussed at the event.  Terraform is enjoying broad adoption as a cluster builder so it was not surprising to see it featured on many talks.  The primary benefits highlighted were cloud portability and infrastructure as code.  

This was surprising to me because Terraform plans are not actually cloud agnostic – they have to be coded to match the resources exposed by the target.

When I asked people about this the answer was simple: the Terraform format itself provides sufficient abstraction.  The benefit of having a single tool and format for multiple infrastructure created very effective portability.

Except the lack of cloud abstractions also drove a messy pattern that I saw in multiple sessions.  Many companies have written custom (“soon to be open sourced”™) Terraform plan generators in their own custom markup languages.  That’s right – there’s an emerging, snowflaked Terraform generator pattern.  I completely understand the motivation to build this layer; however, it strikes me as an anti-pattern.

Infrastructure portability (aka hybrid) is a both universal goal and frighteningly complex.  Clearly, Terraform is a step in the right direction, but it’s only a step.  At HashiConf, I enjoyed watching companies trying take that next step with varying degrees of success.  Let’s get some popcorn and see how it turns out!

Until then, check out our Digital Rebar Terraform provider.  It will make your physical infrastructure “cloud equivalent” so you can run similar plans between cloud and metal.

For more information on the Digital Rebar Terraform provider, listen to this recent podcast.

First Digital Rebar Online Meetup Next Week

Welcome to the first Digital Rebar online meetup!  In our inaugural meetup we’ll provide an introduction to  Digital Rebar Provision, name our mascot, discuss current and future features, and do a short demo of the product. The meetup is Sept 26, 2017 at 11:00am PST. Please join the community at https://www.meetup.com/digitalrebar/ and register for the event.

Online Link – https://zoom.us/j/3403934274  

We will cover the following topics:

  • Welcome !!
  • Introduction to Digital Rebar Provision (DRP) and RackN
  • Naming the Digital Rebar mascot [1]
  • Discussion on DRP version 3.1 features
  • Feature and roadmap planning for DRP version 3.2
  • Use Github Projects or Trello Board
  • Demo of DRP workload deployment
  • Getting in touch with the Digital Rebar community and RackN
  • Questions and answers period

NOTES:

Please note we’ll be using Zoom.us for our meeting; so please check in a few minutes early and make sure you have the Zoom client installed and working for you.

[1]
Name the mascot: https://twitter.com/digitalrebar/status/907724637487935488
Digital Rebar Provision:  http://rebar.digital/
RackN: https://www.rackn.com/

Digital Rebar v3.1 Release Annoucement

We’ve made open network provisioning radically simpler.  So simple, you can install in 5 minutes and be provisioning in under 30.  That’s a bold claim, but it’s also an essential deliverable for us to bridge the Ops execution gap in a way that does not disrupt your existing tool chains.

We’ve got a remarkable list of feature additions between Digital Rebar Provision (DRP) v3.0 and v3.1 that take it from basic provision into a powerful distributed infrastructure automation tool.

But first, we need to put v3.1 into a broader perspective: the new features are built from hard learned DevOps lessons.  The v2 combination of integrated provisioning and orchestration meant we needed a lot of overhead like Docker, Compose, PostgreSQL, Consul and RAILS.  That was needed for complex “one-click” cluster builds; however it’s overkill for users of Ansible, Terraform and immutable infrastructure flows.  

The v3 mantra is about starting simple and allowing users to grow automation incrementally.  RackN has been building advanced automation packages and powerful UX management to support that mission.

So what’s in the release?  The v3.0 release focused on getting core Provision infrastructure APIs, process and patterns working as a stand alone service. The v3.1 release targeted major architectural needs to streamline content management, event notification and add out-of-band actions.  

Key v3.1 Features

  • New Mascot and Logo!  We have a cloud native bare metal bear.  DRP fans should ask about stickers and t-shirts. Name coming soon! 
  • Layered Storage System. DRP storage model allows for layered storage tiers to support the content model and a read only base layer. These features allow operators to distribute content in a number of different ways and make field upgrades and multi-site synchronization possible.
  • Content packaging system.  DRP contents API allows operators to manage packages of other models via a single API call.  Content bundles are read-only and versioned so that field upgrades and patches can be distributed.
  • Plug-in system.  DRP allows API extensions and event listeners that are in the same process space as the DRP server.  This enables IPMI extensions and slack notifiers.
  • Stages, Tasks & Jobs.  DRP has a simple work queue system in which tasks are stored and tracked on machines during stages in their boot sequences.  This feature combines server and DRP client actions to create fast, simple and flexible workflows that don’t require agents or SSH access.
  • Websocket API for event subscription.  DRP clients can subscribe to system events using a long term websocket interface.  Subscriptions include filters so that operators can select very narrow notification scopes.
  • Removal of the minimal embedded UI (moving to community hosted UX).   DRP decoupled the user interface from the service API.  This allows features to be added to the UX without having to replace the Service.  This also allows community members to create their own UX.  RackN has agreed to support community users at no cost on a limited version of our commercial UX.

All of these features enable DRP to perform 100% of the hardware provision workflows that our customers need to run a fully autonomous, CI/CD enabled data center.  RackN has been showing examples of Ansible, Kubernetes, and Terraform to Metal integration as a reference implementations.

Getting the physical layer right is critical to closing your infrastructure execution gaps.  DRP v3.1 goes beyond getting it right – it makes it fast, simple and open.  Take a test drive of the open source code or give RackN a call to see our advanced automation demos.