100,000 of Anything is Hard – Scaling Concerns for Digital Rebar Architecture

Our architectural plans for Digital Rebar are beyond big – they are for massive distributed scale. Not up, but out. We are designing for the case where we have common automation content packages distributed over 100,000 stand-alone sites (think 5G cell towers) that are not synchronously managed. In that case, there will be version drift between the endpoints and content. For example, we may need to patch an installation script quickly over a whole fleet but want to upgrade the endpoints more slowly.

It’s a hard problem and it’s why we’ve focused on composable systems and fine-grain versioning.

It’s also part of the RackN move into a biweekly release cadence for Digital Rebar. That means that we are iterating from tip development to stable every two weeks. It’s fast because we don’t want operators deploying the development “tip” to access features or bug fixes.

This works for several reasons. First, much of the Digital Rebar value is delivered as content instead of in the scaffolding. Each content package has it’s own version cycle and is not tied to Digital Rebar versions. Second, many Digital Rebar features are relatively small, incremental additions. Faster releases allows content creators and operators to access that buttery goodness more quickly without trying to manage the less stable development tip.

Critical enablers for this release pace are feature flags. Starting in v3.2, Digital Rebar introduced the system level tags that are set when new features are added. These flags allow content developers to introspect the system in a multi-version way to see which behaviors are available in each endpoint. This is much more consistent and granular than version matching.

We are not designing a single endpoint system: we are planning for content that spans 1,000s of endpoints.

Feature flags are part of our 100,000 endpoint architecture thinking. In large scale systems, there can be significant version drift within a fleet deployment. We have to expect that automation designers want to enable advanced features before they are universally deployed in the fleet. That means that the system needs a way to easily advertise specific capabilities internally. Automation can then be written with different behaviors depending on the environment. For example, changing exit codes could have broken existing scripts except that scripts used flags to determine which codes were appropriate for the system. These are NOT API issues that work well with semantic versioning (semver), they are deeper system behaviors.

This matters even if you only have a single endpoint because it also enables sharing in the Digital Rebar community.

Without these changes, composable automation designed for the Digital Rebar community would quickly become very brittle and hard to maintain. Our goal is to ensure a decoupling of endpoint and content. This same benefit allows the community to share packages and large scale sites to coordinate upgrades. I don’t think that we’re done yet. This is a hard problem and we’re still evolving all the intricacies of updating and delivering composable automation.

It’s the type of complex, operational thinking that excites the RackN engineering team. I hope it excites you too because we’d love to get your thinking on how to make it even better!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s