How scared do we need to be for Ops collaboration & investment?

Note: Yesterday RackN posted Are you impatient enough to be an SRE?  and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?” 

Cia leak (1)As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project.  Curious?  Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way.  Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on.  We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification.  That practice and pace should be the norm instead of the exception.”

Spiraling Ops Debt & the SRE coding imperative

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

c2wfuvaveaaronn

Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components.  For example, Kubernetes has a 3 month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

 

Evolution or Rebellion? The rise of Site Reliability Engineers (SRE)

What is a Google SRE?  Charity Majors gave a great overview on Datanauts #65, Susan Fowler from Uber talks about “no ops” tensions and Patrick Hill from Atlassian wrote up a good review too.  This is not new: Ben Treynor defined it back in 2014.

DevOps is under attack.

Well, not DevOps exactly but the common misconception that DevOps is about Developers doing Ops (it’s really about lean process, system thinking, and positive culture).  It turns out the Ops is hard and, as I recently discussed with John Furrier, developers really really don’t want be that focused on infrastructure.

In fact, I see containers and serverless as a “developers won’t waste time on ops revolt.”  (I discuss this more in my 2016 retrospective).

The tension between Ops and Dev goes way back and has been a source of confusion for me and my RackN co-founders.  We believe we are developers, except that we spend our whole time focused on writing code for operations.  With the rise of Site Reliability Engineers (SRE) as a job classification, our type of black swan engineer is being embraced as a critical skill.  It’s recognized as the only way to stay ahead of our ravenous appetite for  computing infrastructure.

I’ve been writing about Site Reliability Engineering (SRE) tasks for nearly 5 years under a lot of different names such as DevOps, Ready State, Open Operations and Underlay Operations. SRE is a term popularized by Google (there’s a book!) for the operators who build and automate their infrastructure. Their role is not administration, it is redefining how infrastructure is used and managed within Google.

Using infrastructure effectively is a competitive advantage for Google and their SREs carry tremendous authority and respect for executing on that mission.

ManagersMeanwhile, we’re in the midst of an Enterprise revolt against running infrastructure. Companies, for very good reasons, are shutting down internal IT efforts in favor of using outsourced infrastructure. Operations has simply not been able to complete with the capability, flexibility and breadth of infrastructure services offered by Amazon.

SRE is about operational excellence and we keep up with the increasingly rapid pace of IT.  It’s a recognition that we cannot scale people quickly as we add infrastructure.  And, critically, it is not infrastructure specific.

Over the next year, I’ll continue to dig deeply into the skills, tools and processes around operations.  I think that SRE may be the right banner for these thoughts and I’d like to hear your thoughts about that.

MORE?  Here’s the next post in the series about Spiraling Ops Debt.  Or Skip to Podcasts with Eric Wright and Stephen Spector.

How do platforms die? One step at a time [the Fidelity Gap]

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem.  When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations.  Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelityFidelity Gap gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab.   Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements.  Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack.  There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing.  This design hides production defects and usability issues from developers.  These are issues that would be exposed quickly if the community required multi-instance development.  Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps?  We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers.  That system can then be fully automated with Ansible, Chef, Puppet and Salt.  Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away?  If you want to get to scale, start with the end in mind.

DNS is critical – getting physical ops integrations right matters

Why DNS? Maintaining DNS is essential to scale ops.  It’s not as simple as naming servers because each server will have multiple addresses (IPv4, IPv6, teams, bridges, etc) on multiple NICs depending on the systems function and applications. Plus, Errors in DNS are hard to diagnose.

Names MatterI love talking about the small Ops things that make a huge impact in quality of automation.  Things like automatically building a squid proxy cache infrastructure.

Today, I get to rave about the DNS integration that just surfaced in the OpenCrowbar code base. RackN CTO, Greg Althaus, just completed work that incrementally updates DNS entries as new IPs are added into the system.

Why is that a big deal?  There are a lot of names & IPs to manage.

In physical ops, every time you bring up a physical or virtual network interface, you are assigning at least one IP to that interface. For OpenCrowbar, we are assigning two addresses: IPv4 and IPv6.  Servers generally have 3 or more active interfaces (e.g.: BMC, admin, internal, public and storage) so that’s a lot of references.  It gets even more complex when you factor in DNS round robin or other common practices.

Plus mistakes are expensive.  Name resolution is an essential service for operations.

I know we all love memorizing IPv4 addresses (just wait for IPv6!) so accurate naming is essential.  OpenCrowbar already aligns the address 4th octet (Admin .106 goes to the same server as BMC .106) but that’s not always practical or useful.  This is not just a Day 1 problem – DNS drift or staleness becomes an increasing challenging problem when you have to reallocate IP addresses.  The simple fact is that registering IPs is not the hard part of this integration – it’s the flexible and dynamic updates.

What DNS automation did we enable in OpenCrowbar?  Here’s a partial list:

  1. recovery of names and IPs when interfaces and systems are decommissioned
  2. use of flexible naming patterns so that you can control how the systems are registered
  3. ability to register names in multiple DNS infrastructures
  4. ability to understand sub-domains so that you can map DNS by region
  5. ability to register the same system under multiple names
  6. wild card support for C-Names
  7. ability to create a DNS round-robin group and keep it updated

But there’s more! The integration includes both BIND and PowerDNS integrations. Since BIND does not have an API that allows incremental additions, Greg added a Golang service to wrap BIND and provide incremental updates and deletes.

When we talk about infrastructure ops automation and ready state, this is the type of deep integration that makes a difference and is the hallmark of the RackN team’s ops focus with RackN Enterprise and OpenCrowbar.

Ops Bridges > Building a Sharable Ops Infrastructure with Composable Tool Chain Orchestration

This posted started from a discussion with Judd Maltin that he documented in a post about “wanting a composable run deck.”

Fitz and Trantrums: Breaking the Chains of LoveI’ve had several conversations comparing OpenCrowbar with other “bare metal provisioning” tools that do thing like serve golden images to PXE or IPXE server to help bootstrap deployments.  It’s those are handy tools, they do nothing to really help operators drive system-wide operations; consequently, they have a limited system impact/utility.

In building the new architecture of OpenCrowbar (aka Crowbar v2), we heard very clearly to have “less magic” in the system.  We took that advice very seriously to make sure that Crowbar was a system layer with, not a replacement to, standard operations tools.

Specifically, node boot & kickstart alone is just not that exciting.  It’s a combination of DHCP, PXE, HTTP and TFTP or DHCP and an IPXE HTTP Server.   It’s a pain to set this up, but I don’t really get excited about it anymore.   In fact, you can pretty much use open ops scripts (Chef) to setup these services because it’s cut and dry operational work.

Note: Setting up the networking to make it all work is perhaps a different question and one that few platforms bother talking about.

So, if doing node provisioning is not a big deal then why is OpenCrowbar important?  Because sustaining operations is about ongoing system orchestration (we’d say an “operations model“) that starts with provisioning.

It’s not the individual services that’s critical; it’s doing them in a system wide sequence that’s vital.

Crowbar does NOT REPLACE the services.  In fact, we go out of our way to keep your proven operations tool chain.  We don’t want operators to troubleshoot our IPXE code!  We’d much rather use the standard stuff and orchestrate the configuration in a predicable way.

In that way, OpenCrowbar embraces and composes the existing operations tool chain into an integrated system of tools.  We always avoid replacing tools.  That’s why we use Chef for our DSL instead of adding something new.

What does that leave for Crowbar?  Crowbar is providing a physical infratsucture targeted orchestration (we call it “the Annealer”) that coordinates this tool chain to work as a system.  It’s the system perspective that’s critical because it allows all of the operational services to work together.

For example, when a node is added then we have to create v4 and v6 IP address entries for it.  This is required because secure infrastructure requires reverse DNS.  If you change the name of that node or add an alias, Crowbar again needs to update the DNS.  This had to happen in the right sequence.  If you create a new virtual interface for that node then, again, you need to update DNS.   This type of operational housekeeping is essential and must be performed in the correct sequence at the right time.

The critical insight is that Crowbar works transparently alongside your existing operational services with proven configuration management tools.  Crowbar connects links in your tool chain but keeps you in the driver’s seat.

Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure]

This post is the third in a 4 part series about Success factors for Operating Open Source Infrastructure.

turning upIn an automated configuration deployment scenario, problems surface very quickly. They prevent deployment and force resolution before progress can be made. Unfortunately, many times this appears to be a failure within the deployment automation. My personal experience has been exactly the opposite: automation creates a “fail fast” environment in which critical issues are discovered and resolved during provisioning instead of sleeping until later.

Our ability to detect and stop until these issues are resolved creates exactly the type of repeatable, successful deployment that is essential to long-term success. When we look at these deployments, the most important success factors are that the deployment is consistent, known and predictable. Our ability to quickly identify and resolve issues that do not match those patterns dramatically improves the long-term stability of the system by creating an environment that has been benchmarked against a known reference.

Benchmarking against a known reference is ultimately the most significant value that we can provide in helping customers bring up complex solutions such as Openstack and Hadoop. Being successful with these deployments over the long term means that you have established a known configuration, and that you have maintained it in a way that is explainable and reference-able to other places.

Reference Implementation

The concept of a reference implementation provides tremendous value in deployment. Following a pattern that is a reference implementation enables you to compare notes, get help and ultimately upgrade and change deployment in known, predictable ways. Customers who can follow and implement a vendors’ reference, or the community’s reference implementation, are able to ask for help on the mailing lists, call in for help and work with the community in ways that are consistent and predictable.

Let’s explore what a reference implementation looks like.

In a reference implementation you have a consistent, known state of your physical infrastructure that has been implemented based upon a RA. That implementation follows a known best practice using standard gear in a consistent, known configuration. You can therefore explain your configuration to a community of other developers, or other people who have similar configuration, and can validate that your problem is not the physical configuration. Fundamentally, everything in a reference implementation is driving towards the elimination of possible failure cause. In this case, we are making sure that the physical infrastructure is not causing problems (getting to a ready state), because other people are using a similar (or identical) physical infrastructure configuration.

The next components of a reference implementation are the underlying software configurations for operating system management monitoring network configuration, IP networking stacks. Pretty much the entire component of the application is riding on. There are a lot of moving parts and complexity in this scenario, witha high likelihood of causing failures. Implementing and deploying the software stacks in an automated way, has enabled us to dramatically reduce the potential for problems caused by misconfiguration. Because the number of permutations of software in the reference stack is so high, it is essential that successful deployment tightly manages what exactly is deployed, in such a way that they can identify, name, and compare notes with other deployments.

Achieving Repeatable Deployments

In this case, our referenced deployment consists of the exact composition of the operating system, infrastructure tooling, and capabilities for the deployment. By having a reference capability, we can ensure that we have the same:

  • Operating system
  • Monitoring
  • Configuration stacks
  • Security tooling
  • Patches
  • Network stack (including bridges and VLAN, IP table configurations)

Each one of these components is a potential failure point in a deployment. By being able to configure and maintain that configuration automatically, we dramatically increase the opportunities for success by enabling customers to have a consistent configuration between sites.

Repeatable reference deployments enable customers to compare notes with Dell and with others in the community. It enables us to take and apply what we have learned from one site to another. For example, if a new patch breaks functionality, then we can quickly determine how that was caused. We can then fix the solution, add in the complimentary fix, and deploy it at that one site. If we are aware that 90% of our other sites have exactly the same configuration, it enables those other sites to avoid a similar problem. In this way, having both a pattern and practice referenced deployment enables the community to absorb or respond much more quickly, and be successful with a changing code base. We found that it is impractical to expect things not to change.

The only thing that we can do is build resiliency for change into these deployments. Creating an automated and tested referenceable deployment is the best way to cope with change.