About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

2017 SRE & DevOps Influencers

Seems fitting to start 2018 by finally posting this list I started in May while working on my DevOpsDays “SRE vs DevOps” presentation, I pulled an SRE and DevOps reading list from some of my favorite authors.  I quickly realized that the actual influencer list needed to be expanded some – additional and suggestions welcome.  A list like this is never complete.

Offered WITHOUT ordering… I’m sorry if I missed someone!  I’ll make it up by podcasting with them!

SRE & DevOps Focused

Developer, Open Source & Social Connectors

Completely non-technical, but have to shout out to my hard working author friends Heidi Joy Treadway @heiditretheway and Jennifer Willis @jenwillis.

2017 Gartner IO & DC Wrap Up

Like other Gartner events, the Infrastructure and Operations (IO) show is all about enterprises maintaining systems.  There are plenty of hype chasing sessions, but the vibe is distinctly around working systems and practical implementations.  Think: sports coats not t-shirts.  In that way, it’s less breathless and wild-eyed than something like KubeCon (which is busy celebrating a bumper crop of 1.0 projects).  The very essence of this show is to project an aura of calm IT stewardship.

So what keeps these seasoned IT pros awake?  Lack of cross-vendor Integration.

Terry Cosgrove of Gartner said this very clearly, “most components were not designed to work together.” This was not just a comment about the industry, but within vendor suites.  In today’s acquisitive and agile market, there’s no expectation that even products from a single vendor will integrate smoothly.  Why is integration so hard?  We’re innovating so quickly that legacy APIs and new architectures don’t align well. For enterprises who cannot simply jump to the new-new thing, integrations drive considerable value.

Cosgrove went on to add that enterprises need to OWN the integrations – they can’t delegate that to vendors.

That advice resonated for me.  We’re clearly in a best-of-breed IT environment where hybrid and portability concerns dominate discussions.  This is not about vendor lock-in but innovation.  That leads us back to the need for better integrations between products, platforms and projects.  Customers need to start rejecting products without great, documented APIs; otherwise, there is no motivation for products to focus on integration over adding features.  

Sadly, it was left to the audience to infer the “use dollars to force vendors to integrate” message.

There were many other topics of interest at the show.  Here’s a very short synopsis of my favorites:

  • Edge is coming and will be a big deal.  We’re still having to explain what it is.  Check back next summit (or listen to our great podcasts to get ahead of the curve).
  • AI Ops is not really AI, it’s just smarter logging.  We’ll get there eventually, but it will take some time.
  • DevOps is still a thing and it’s still hard because of the culture change required.  We’re slowly getting to a point where “DevOps = Automated Processes” and that’s OK.  If you agree with that then you’ve missed the point of system thinking and lean.  We’re done trying to explain it to you for now.
  • No start-ups.  Sadly, disruptive innovation is antithetical to this show and that may be OK.  The audience counts on the analysts to filter this for them instead of getting raw.

In all these cases, it’s listener beware.  There’s more behind the curtain that you are allowed to see.

Sound and Fury as AWS Pulls Back Curtain for Bare Metal Offering

Yesterday, AWS confirmed that it actually uses physical servers to run its cloud infrastructure and, gasp, no one was surprised.  The actual news about the i3.metal instances by AWS Chief Evangelist Jeff Barr shows that bare metal is being treated as just another AMI managed instance type (see also Geekwire, Techcrunch, Venture Beat).  For AWS users, there’s no drama here because it’s an incremental add to processes they are already know well.

Infrastructure as a Service (IaaS) is fundamentally about automation and API not the type of infrastructure.

Lack of drama is a key principle at RackN: provisioning hardware should be as easy to automate as a virtual machine. The addition of bare metal to the AWS instance types validates two important parts of the AWS cloud automation story.  First, having control metal is valuable and, second, operations are expected image (AMI) based deployments.

There are interesting AWS specific items to unpack around this bare metal announcement that shows otherwise hidden details about AWS infrastructure.

It took Amazon a long time to create this offering because allowing users to access bare metal requires a specialized degree of isolation inside their massive data center.  It’s only recently possible in AWS data centers because of their custom hardware and firmware.  These changes provide AWS with a hidden control layer under the operating system abstraction.  This does not mean everyone needs this hardware – it’s an AWS specific need based on their architecture.

It’s not a surprise the AWS has built cloud infrastructure optimized hardware.  All the major cloud providers design purpose-built machines with specialized firmware to handle their scale network, security and management challenges.

The specialized hardware may create challenges for users compared to regular virtualized servers.  There are already a few added requirements for AMIs before they can run on the i3.metal instance.  Any image deploy to metal process requires a degree of matching the target server.  That’s the reason that Digital Rebar defaults to safer (but slower) kickstart and pre-seed processes.

Overall, this bare metal announcement is signifying nothing dramatic and that’s a very good thing.

Automating every layer of a data center should be the expected default.  Our mission has been to make metal just another type of automated infrastructure and we’re glad to have AWS finally get on the same page with us.

100,000 of Anything is Hard – Scaling Concerns for Digital Rebar Architecture

Our architectural plans for Digital Rebar are beyond big – they are for massive distributed scale. Not up, but out. We are designing for the case where we have common automation content packages distributed over 100,000 stand-alone sites (think 5G cell towers) that are not synchronously managed. In that case, there will be version drift between the endpoints and content. For example, we may need to patch an installation script quickly over a whole fleet but want to upgrade the endpoints more slowly.

It’s a hard problem and it’s why we’ve focused on composable systems and fine-grain versioning.

It’s also part of the RackN move into a biweekly release cadence for Digital Rebar. That means that we are iterating from tip development to stable every two weeks. It’s fast because we don’t want operators deploying the development “tip” to access features or bug fixes.

This works for several reasons. First, much of the Digital Rebar value is delivered as content instead of in the scaffolding. Each content package has it’s own version cycle and is not tied to Digital Rebar versions. Second, many Digital Rebar features are relatively small, incremental additions. Faster releases allows content creators and operators to access that buttery goodness more quickly without trying to manage the less stable development tip.

Critical enablers for this release pace are feature flags. Starting in v3.2, Digital Rebar introduced the system level tags that are set when new features are added. These flags allow content developers to introspect the system in a multi-version way to see which behaviors are available in each endpoint. This is much more consistent and granular than version matching.

We are not designing a single endpoint system: we are planning for content that spans 1,000s of endpoints.

Feature flags are part of our 100,000 endpoint architecture thinking. In large scale systems, there can be significant version drift within a fleet deployment. We have to expect that automation designers want to enable advanced features before they are universally deployed in the fleet. That means that the system needs a way to easily advertise specific capabilities internally. Automation can then be written with different behaviors depending on the environment. For example, changing exit codes could have broken existing scripts except that scripts used flags to determine which codes were appropriate for the system. These are NOT API issues that work well with semantic versioning (semver), they are deeper system behaviors.

This matters even if you only have a single endpoint because it also enables sharing in the Digital Rebar community.

Without these changes, composable automation designed for the Digital Rebar community would quickly become very brittle and hard to maintain. Our goal is to ensure a decoupling of endpoint and content. This same benefit allows the community to share packages and large scale sites to coordinate upgrades. I don’t think that we’re done yet. This is a hard problem and we’re still evolving all the intricacies of updating and delivering composable automation.

It’s the type of complex, operational thinking that excites the RackN engineering team. I hope it excites you too because we’d love to get your thinking on how to make it even better!

 

Putting a little ooooh! in orchestration

The RackN team is proud of saying that we left the Orchestration out when we migrated from Digital Rebar v2 to v3. That would mean more if anyone actually agreed on what orchestration means… In this our case, I think we can be pretty specific: Digital Rebar v3 does not manage work across multiple nodes. At this point, we’re emphatic about it because cross machine actions add a lot of complexity and require application awareness that quickly blossoms into operational woe, torture and frustration (aka WTF).

That’s why Digital Rebar focused on doing a simple yet powerful job doing multi-boot workflow on a single machine.

In the latest releases (v3.2+), we’ve delivered an easy to understand stage and task running system that is simple to extend, transparent in operation and extremely fast. There’s no special language (DSL) to learn or database to master. And if you need those things, then we encourage you to use the excellent options from Chef, Puppet, SaltStack, Ansible and others. This is because our primary design focus is planning work over multiple boots and operating system environments instead of between machines. Digital Rebar shines when you need 3+ reboots to automatically scrub, burn-in, inventory, install and then post-configure a machine.

But we may have crossed an orchestration line with our new cluster token capability.

Starting in the v3.4 release, automation authors will be able to use a shared profile to coordinate work between multiple machines. This is not a Digital Rebar feature per se – it’s a data pattern that leverages Digital Rebar locking, profiles and parameters to share information between machines. This allows scripts to elect leaders, create authoritative information (like tokens) and synchronize actions. The basic mechanism is simple: we create a shared machine profile that includes a token that allows editing the profile. Normally, machines can only edit themselves so we have to explicitly enable editing profiles with a special use token. With this capability, all the machines assigned to the profile can update the profile (and only that profile). The profile becomes an atomic, secure shared configuration space.

For example, when building a Kubernetes cluster using Kubeadm, the installation script needs to take different actions depending on which node is first. The first node needs to initialize the cluster master, generate a token and share its IP address. The subsequent nodes must wait until the master is initialized and then join using the token. The installation pattern is basically a first-in leader election while all others wait for the leader. There’s no need for more complex sequencing because the real install “orchestration” is done after the join when Kubernetes starts to configure the nodes.

Our experience is that recent cloud native systems are all capable of this type of shotgun start where all the nodes start in parallel with the minimal bootstrap coordination that Digital Rebar can provide.

Individually, the incremental features needed to enable cluster building were small additions to Digital Rebar. Together, they provide a simple yet powerful management underlay. At RackN, we believe that simple beats complex everyday and we’re fighting hard to make sure operations stays that way.

Terraform Bare Metal – A Leap forward for SDx

Software Defined Infrastructure (SDx) allows operators to manage data centers in a more consistent and controlled way. It allows teams to define their environment as code and use automation to execute that definition in practice. To deliver this capability for physical (aka bare metal) servers, RackN has created a Digital Rebar provider for Terraform. The provider is a simple addition that take just seconds to enable. (Video Demonstrations at End of Blog)

The Terraform Bare Metal provider allows plans to provision and recover servers using a node resource.

The operation of this provider is simple and relies on standard workflow stages in Digital Rebar. Adding the Terraform Content Package installs a new stage that adds Terraform parameters. Including this stage in the global workflow will automatically register machines as available for Terraform. The integration uses two parameters to manage the server pool: Terraform Managed and Terraform Assigned.

When the Terraform provider asks for a node resource, it queries the Digital Rebar API for machines that are managed (true) and not assigned (false) plus whatever additional filters were required in the plan. The provider then uses the API to set assigned true and the requested Stage (e.g. centos-install) and polls until the node enters the Complete stage. The destroy action reverses the action to release the node. Digital Rebar uses the stage changes as a trigger to restart the machine workflow.

Using a Terraform plan with Digital Rebar, operators can manage complex data centers layouts from a single command line.

For users, all of the above steps are completely hidden. Operators can monitor the request using the Digital Rebar UX to ensure the plan is executing. In addition, plan metadata can set user or identification values to the machines when they are reserved to help track allocations. In this way, administrators can easily track and account for machines reserved via Terraform.

For full out-of-band control, users should add the RackN IPMI plugin. This adds the ability to force power states during plan execution. The provider does not require out-of-band management to function. RackN also maintains Packet.net and VirtualBox plugins with the same API as the IPMI plugin. This allows developers to easily test plans against virtual or cloud resources.

RackN customers are making big plans to use this simple and powerful integration to manage their own SDx roadmap. We’re excited to hear about new ways to improve data center operations, especially new edge ideas. Let us know what you are thinking!

Demonstration of Terraform Bare Metal Provisioning with Digital Rebar Provision V3.2

Setting up the Environment to run Digital Rebar Provision V3.2 for Terraform

Sirens of Open Infrastructure beacons to OpenStack Community

OpenStack is a real platform doing real work for real users.  So why does OpenStack have a reputation for not working?  It falls into the lack of core-focus paradox: being too much to too many undermines your ability to do something well.  In this case, we keep conflating the community and the code.

I have a long history with the project but have been pretty much outside of it (yay, Kubernetes!) for the last 18 months.  That perspective helps me feel like I’m getting closer to the answer after spending a few days with the community at the latest OpenStack Summit in Sydney Australia.  While I love to think about the why, the what the leaders are doing about it is very interesting too.

Fundamentally, OpenStack’s problem is that infrastructure automation is too hard and big to be solved within a single effort.  

It’s so big that any workable solution will fail for a sizable number of hopeful operators.  That does not keep people from the false aspiration that OpenStack code will perfectly fit their needs (especially if they are unwilling to trim their requirements).

But the problem is not inflated expectations for OpenStack VM IaaS code, it’s that we keep feeding them.  I have been a long time champion for a small core with a clear ecosystem boundary.  When OpenStack code claims support for other use cases, it invites disappointment and frustration.

So why is OpenStack foundation moving to expand its scope as an Open Infrastructure community with additional focus areas?  It’s simple: the community is asking them to do it.

Within the vast space of infrastructure automation, there are clusters of aligned interest.  These clusters are sufficiently narrow that they can collaborate on shared technologies and practices.  They also have an partial overlap (Venn) with adjacencies where OpenStack is already present.  There is a strong economic and social drive for members in these overlapped communities to bridge together instead of creating new disparate groups.  Having the OpenStack foundation organize these efforts is a natural and expected function.

The danger of this expansion comes from also carrying the expectation that the technology (code) will also be carried into the adjacencies.  That’s my my exact rationale the original VM IaaS needs to be smaller.  The wealth of non-core projects crosses clusters of interests.  Instead of allowing these clusters to optimize their needs around shared interests, the users get the impression that they must broadly adopt unneeded or poorly fit components.  The idea of “competitive” projects should be reframed because they may overlap in function but not ui use-case fit.

It’s long past time to give up expectations that OpenStack is a “one-stop-shop” of infrastructure automation.  In my opinion, it undermines the community mission by excluding adjacencies.

I believe that OpenStack must work to embrace its role as an open infrastructure community; however, it must also do the hard work to create welcoming space for adjacencies.  These adjacencies will compete with existing projects currently under the OpenStack code tent.  The community needs to embrace that the hard work done so far may simply be sunk cost for new use cases. 

It’s the OpenStack community and the experience, not the code, that creates long term value.

Five ways I’m Sad, Mad and Scared: the new critical security flaw in firmware no one will patch.

There is new security vulnerability that should be triggering a massive server fleet wide upgrade and patch for data center operators everywhere.  This one undermines fundamental encryption features embedded into servers’ trusted platform module (TPM).   According to Sophos.com, “this one’s a biggie.”

Yet, it’s unlikely anyone will actually patch their firmware to fix this serious issue.

Why?  A lack of automation.  Even if you agree with the urgency of this issue,

  1. It’s unlikely that you can perform a system wide software patch or system re-image without significant manual effort or operational risk
  2. It’s unlikely that you are actually using TPM because they are tricky to setup and maintain
  3. It’s unlikely that you have any tooling that automates firmware updates across your fleet
  4. It’s unlikely that you have automation to gracefully roll out an update that can coordinate BIOS and operating system updates
  5. Even if you can do the above (IF YOU CAN, PLEASE CALL ME), it’s unlikely that you can coordinate updating both patching the BIOS and re-encrypting/rotating the data signed by the keys in the TPM

Being able to perform actions should be foundational; however, I know from talking to many operators that there are serious automation and process gaps at this layer.  These gaps weaken the whole system because we neither turn on security features embedded in our infrastructure nor automate ways to systematically maintain them.

This type of work is hard to do.  So we don’t do it, we don’t demand it and we don’t budget for it.

Our systems are way too complex to expect issues like this to be improved away by the next wave of technology.  In fact, we see the exact opposite.  The faster we move, the more flaws are injected into the system.  This is not security problem alone.  Bugs, patches and dependencies cause even more system churn and risk.

I have not given up hoping that our industry will prioritize infrastructure automation so that we can improve our posture.  I’ve seen that fixing the bottom layers of the stack makes a meaningful difference in the layers above.  If you’ve been following our work, then you already know that is the core of our mission at RackN.

It’s up to each of us individually to start fixing the problem.  It won’t be easy but you don’t have to do it alone.  We have to do this together.

Deep Thinking & Tech + Great Guests – L8ist Sh9y Podcast

I love great conversations about technology – especially ones where the answer is not very neatly settled into winners and losers (which is ALL of them in IT).  I’m excited that RackN has (re)launched the L8ist Sh9y (aka Latest Shiny) podcast around this exact theme.

Please check out the deep and thoughtful discussion I just had with Mark Thiele (notes) of Apcera where we covered Mark’s thought on why public cloud will be under 20% of IT and culture issues head on.

Spoiler: we have David Linthicum coming next, SO SUBSCRIBE.

I’ve been a guest on some great podcasts (CloudcastgcOnDemandDatanautsIBM DojoHPEFoodfight) and have deep respect for critical work they do in industry.

We feel there’s still room for deep discussions specifically around automated IT Operations in cloud, data center and edge; consequently, we’re branching out to start including deep interviews in addition to our initial stable of IT Ops deep technical topics like TerraformEdge ComputingGartnerSYM review, Kubernetes and, of course, our own Digital Rebar.

Soundcloud Subscription Information

Deep Thinking & Tech + Great Guests – L8ist Sh9y podcast relaunched

I love great conversations about technology – especially ones where the answer is not very neatly settled into winners and losers (which is ALL of them in IT).  I’m excited that RackN has (re)launched the L8ist Sh9y (aka Latest Shiny) podcast around this exact theme.

Please check out the deep and thoughtful discussion I just had with Mark Thiele (notes) of Aperca where we covered Mark’s thought on why public cloud will be under 20% of IT and culture issues head on.

Spoiler: we have David Linthicum coming next, SO SUBSCRIBE.

I’ve been a guest on some great podcasts (Cloudcast, gcOnDemand, Datanauts, IBM Dojo, HPEFoodfight) and have deep respect for critical work they do in industry.

We feel there’s still room for deep discussions specifically around automated IT Operations in cloud, data center and edge; consequently, we’re branching out to start including deep interviews in addition to our initial stable of IT Ops deep technical topics like Terraform, Edge Computing, GartnerSYM review, Kubernetes and, of course, our own Digital Rebar.

Soundcloud Subscription Information