Ops Bridges > Building a Sharable Ops Infrastructure with Composable Tool Chain Orchestration

This posted started from a discussion with Judd Maltin that he documented in a post about “wanting a composable run deck.”

Fitz and Trantrums: Breaking the Chains of LoveI’ve had several conversations comparing OpenCrowbar with other “bare metal provisioning” tools that do thing like serve golden images to PXE or IPXE server to help bootstrap deployments.  It’s those are handy tools, they do nothing to really help operators drive system-wide operations; consequently, they have a limited system impact/utility.

In building the new architecture of OpenCrowbar (aka Crowbar v2), we heard very clearly to have “less magic” in the system.  We took that advice very seriously to make sure that Crowbar was a system layer with, not a replacement to, standard operations tools.

Specifically, node boot & kickstart alone is just not that exciting.  It’s a combination of DHCP, PXE, HTTP and TFTP or DHCP and an IPXE HTTP Server.   It’s a pain to set this up, but I don’t really get excited about it anymore.   In fact, you can pretty much use open ops scripts (Chef) to setup these services because it’s cut and dry operational work.

Note: Setting up the networking to make it all work is perhaps a different question and one that few platforms bother talking about.

So, if doing node provisioning is not a big deal then why is OpenCrowbar important?  Because sustaining operations is about ongoing system orchestration (we’d say an “operations model“) that starts with provisioning.

It’s not the individual services that’s critical; it’s doing them in a system wide sequence that’s vital.

Crowbar does NOT REPLACE the services.  In fact, we go out of our way to keep your proven operations tool chain.  We don’t want operators to troubleshoot our IPXE code!  We’d much rather use the standard stuff and orchestrate the configuration in a predicable way.

In that way, OpenCrowbar embraces and composes the existing operations tool chain into an integrated system of tools.  We always avoid replacing tools.  That’s why we use Chef for our DSL instead of adding something new.

What does that leave for Crowbar?  Crowbar is providing a physical infratsucture targeted orchestration (we call it “the Annealer”) that coordinates this tool chain to work as a system.  It’s the system perspective that’s critical because it allows all of the operational services to work together.

For example, when a node is added then we have to create v4 and v6 IP address entries for it.  This is required because secure infrastructure requires reverse DNS.  If you change the name of that node or add an alias, Crowbar again needs to update the DNS.  This had to happen in the right sequence.  If you create a new virtual interface for that node then, again, you need to update DNS.   This type of operational housekeeping is essential and must be performed in the correct sequence at the right time.

The critical insight is that Crowbar works transparently alongside your existing operational services with proven configuration management tools.  Crowbar connects links in your tool chain but keeps you in the driver’s seat.

OpenCrowbar stands up 100 node community challenge

OpenCrowbar community contributors are offering a “100 Node Challenge” by volunteering to setup a 100+ node Crowbar system to prove out the v2 architecture at scale.  We picked 100* nodes since we wanted to firmly break the Crowbar v1 upper ceiling.

going up!The goal of the challenge is to prove scale of the core provisioning cycle.  It’s intended to be a short action (less than a week) so we’ll need advanced information about the hardware configuration.  The expectation is to do a full RAID/Disk hardware configuration beyond the base IPMI config before laying down the operating system.

The challenge logistics starts with an off-site prep discussion of the particulars of the deployment, then installing OpenCrowbar at the site and deploying the node century.  We will also work with you about using OpenCrowbar to manage the environment going forward.  

Sound too good to be true?  Well, as community members are doing this on their own time, we are only planning one challenge candidate and want to find the right target.
We will not be planning custom code changes to support the deployment, however, we would be happy to work with you in the community to support your needs.  If you want help to sustain the environment or have longer term plans, I have also been approached by community members who willing to take on full or part-time Crowbar consulting engagements.
Let’s get rack’n!
* we’ll consider smaller clusters but you have to buy the drinks and pizza.

You need a Squid Proxy fabric! Getting Ready State Best Practices

Sometimes a solving a small problem well makes a huge impact for operators.  Talking to operators, it appears that automated configuration of Squid does exactly that.

Not a SQUID but...

If you were installing OpenStack or Hadoop, you would not find “setup a squid proxy fabric to optimize your package downloads” in the install guide.   That’s simply out of scope for those guides; however, it’s essential operational guidance.  That’s what I mean by open operations and creating a platform for sharing best practice.

Deploying a base operating system (e.g.: Centos) on a lot of nodes creates bit-tons of identical internet traffic.  By default, each node will attempt to reach internet mirrors for packages.  If you multiply that by even 10 nodes, that’s a lot of traffic and a significant performance impact if you’re connection is limited.

For OpenCrowbar developers, the external package resolution means that each dev/test cycle with a node boot (which is up to 10+ times a day) is bottle necked.  For qa and install, the problem is even worse!

Our solution was 1) to embed Squid proxies into the configured environments and the 2) automatically configure nodes to use the proxies.   By making this behavior default, we improve the overall performance of a deployment.   This further improves the overall network topology of the operating environment while adding improved control of traffic.

This is a great example of how Crowbar uses existing operational tool chains (Chef configures Squid) in best practice ways to solve operations problems.  The magic is not in the tool or the configuration, it’s that we’ve included it in our out-of-the-box default orchestrations.

It’s time to stop fumbling around in the operational dark.  We need to compose our tool chains in an automated way!  This is how we advance operational best practice for ready state infrastructure.

OpenCrowbar Design Principles: Attribute Injection [Series 6 of 6]

This is part 5 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar.  The content is reposted from the OpenCrowbar docs repo.

Attribute Injection

Attribute Injection is an essential aspect of the “FuncOps” story because it helps clean boundaries needed to implement consistent scripting behavior between divergent sites.

attribute_injectionIt also allows Crowbar to abstract and isolate provisioning layers. This operational approach means that deployments are composed of layered services (see emergent services) instead of locked “golden” images. The layers can be maintained independently and allow users to compose specific configurations a la cart. This approach works if the layers have clean functional boundaries (FuncOps) that can be scoped and managed atomically.

To explain how Attribute Injection accomplishes this, we need to explore why search became an anti-pattern in Crowbar v1. Originally, being able to use server based search functions in operational scripting was a critical feature. It allowed individual nodes to act as part of a system by searching for global information needed to make local decisions. This greatly added Crowbar’s mission of system level configuration; however, it also created significant hidden interdependencies between scripts. As Crowbar v1 grew in complexity, searches became more and more difficult to maintain because they were difficult to correctly scope, hard to centrally manage and prone to timing issues.

Crowbar was not unique in dealing with this problem – the Attribute Injection pattern has become a preferred alternative to search in integrated community cookbooks.

Attribute Injection in OpenCrowbar works by establishing specific inputs and outputs for all state actions (NodeRole runs). By declaring the exact inputs needed and outputs provided, Crowbar can better manage each annealing operation. This control includes deployment scoping boundaries, time sequence of information plus override and substitution of inputs based on execution paths.

This concept is not unique to Crowbar. It has become best practice for operational scripts. Crowbar simply extends to paradigm to the system level and orchestration level.

Attribute Injection enabled operations to be:

  • Atomic – only the information needed for the operation is provided so risk of “bleed over” between scripts is minimized. This is also a functional programming preference.
  • Isolated Idempotent – risk of accidentally picking up changed information from previous runs is reduced by controlling the inputs. That makes it more likely that scripts can be idempotent.
  • Cleanly Scoped – information passed into operations can be limited based on system deployment boundaries instead of search parameters. This allows the orchestration to manage when and how information is added into configurations.
  • Easy to troubleshoot – since the information is limited and controlled, it is easier to recreate runs for troubleshooting. This is a substantial value for diagnostics.

OpenCrowbar Design Principles: Emergent services [Series 5 of 6]

This is part 5 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar.  The content is reposted from the OpenCrowbar docs repo.

Emergent services

We see data center operations as a duel between conflicting priorities. On one hand, the environment is constantly changing and systems must adapt quickly to these changes. On the other hand, users of the infrastructure expect it to provide stable and consistent services for consumption. We’ve described that as “always ready, never finished.”

Our solution to this duality to expect that the infrastructure Crowbar builds is decomposed into well-defined service layers that can be (re)assembled dynamically. Rather than require any component of the system to be in a ready state, Crowbar design principles assume that we can automate the construction of every level of the infrastructure from bios to network and application. Consequently, we can hold off (re)making decisions at the bottom levels until we’ve figured out that we’re doing at the top.

Effectively, we allow the overall infrastructure services configuration to evolve or emerge based on the desired end use. These concepts are built on computer science principles that we have appropriated for Ops use; since we also subscribe to Opscode “infrastructure as code”, we believe that these terms are fitting in a DevOps environment. In the next pages, we’ll explore the principles behind this approach including concepts around simulated annealing, late binding, attribute injection and emergent design.

Emergent (aka iterative or evolutionary) design challenges the traditional assumption that all factors must be known before starting

  • Dependency graph – multidimensional relationship
  • High degree of reuse via abstraction and isolation of service boundaries.
  • Increasing complexity of deployments means more dependencies
  • Increasing revision rates of dependencies but with higher stability of APIs

OpenCrowbar Design Principles: Simulated Annealing [Series 4 of 6]

This is part 4 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar.  The content is reposted from the OpenCrowbar docs repo.

Simulated Annealing

simulated_annealingSimulated Annealing is a modeling strategy from Computer Science for seeking optimum or stable outcomes through iterative analysis. The physical analogy is the process of strengthening steel by repeatedly heating, quenching and hammering. In both computer science and metallurgy, the process involves evaluating state, taking action, factoring in new data and then repeating. Each annealing cycle improves the system even though we may not know the final target state.

Annealing is well suited for problems where there is no mathematical solution, there’s an irregular feedback loop or the datasets change over time. We have all three challenges in continuous operations environments. While it’s clear that a deployment can modeled as directed graph (a mathematical solution) at a specific point in time, the reality is that there are too many unknowns to have a reliable graph. The problem is compounded because of unpredictable variance in hardware (NIC enumeration, drive sizes, BIOS revisions, network topology, etc) that’s even more challenging if we factor in adapting to failures. An operating infrastructure is a moving target that is hard to model predictively.

Crowbar implements the simulated annealing algorithm by decomposing the operations infrastructure into atomic units, node-roles, that perform the smallest until of work definable. Some or all of these node-roles are changed whenever the infrastructure changes. Crowbar anneals the environment by exercising the node-roles in a consistent way until system re-stabilizes.

One way to visualize Crowbar annealing is to imagine children who have to cross a field but don’t have a teacher to coordinate. Once student takes a step forward and looks around then another sees the first and takes two steps. Each child advances based on what their peers are doing. None wants to get too far ahead or be left behind. The line progresses irregularly but consistently based on the natural relationships within the group.

To understand the Crowbar Annealer, we have to break it into three distinct components: deployment timeline, annealing and node-role state. The deployment timeline represents externally (user, hardware, etc) initiated changes that propose a new target state. Once that new target is committed, Crowbar anneals by iterating through all the node-roles in a reasonable order. As the Annealer runs the node-roles they update their own state. The aggregate state of all the node-roles determines the state of the deployment.

A deployment is a combination of user and system defined state. Crowbar’s job is to get deployments stable and then maintain over time.

OpenCrowbar Design Principles: Reintroduction [Series 1 of 6]

While “ready state” as a concept has been getting a lot of positive response, I forget that much of the innovation and learning behind that concept never surfaced as posts here.  The Anvil (2.0) release included the OpenCrowbar team cataloging our principles in docs.  Now it’s time to repost the team’s work into a short series over the next three days.

In architecting the Crowbar operational model, we’ve consistently twisted adapted traditional computer science concepts like late binding, simulated annealing, emergent behavior, attribute injection and functional programming to create a repeatable platform for sharing open operations practice (post 2).

Functional DevOps aka “FuncOps”

Ok, maybe that’s not going to be the 70’s era hype bubble name, but… the operational model behind Crowbar is entering its third generation and its important to understand the state isolation and integration principles behind that model is closer to functional than declarative programming.

Parliament is Crowbar’s official FuncOps sound track

The model is critical because it shapes how Crowbar approaches the infrastructure at a fundamental level so it makes it easier to interact with the platform if you see how we are approaching operations. Crowbar’s goal is to create emergent services.

We’ll expore those topics in this series to explain Crowbar’s core architectural principles.  Before we get into that, I’d like to review some history.

The Crowbar Objective

Crowbar delivers repeatable best practice deployments. Crowbar is not just about installation: we define success as a sustainable operations model where we continuously improve how people use their infrastructure. The complexity and pace of technology change is accelerating so we must have an approach that embraces continuous delivery.

Crowbar’s objective is to help operators become more efficient, stable and resilient over time.

Background

When Greg Althaus (github @GAlhtaus) and Rob “zehicle” Hirschfeld (github @CloudEdge) started the project, we had some very specific targets in mind. We’d been working towards using organic emergent swarming (think ants) to model continuous application deployment. We had also been struggling with the most routine foundational tasks (bios, raid, o/s install, networking, ops infrastructure) when bringing up early scale cloud & data applications. Another key contributor, Victor Lowther (github @VictorLowther) has critical experience in Linux operations, networking and dependency resolution that lead to made significant contributions around the Annealing and networking model. These backgrounds heavily influenced how we approached Crowbar.

First, we started with best of field DevOps infrastructure: Opscode Chef. There was already a remarkable open source community around this tool and an enthusiastic following for cloud and scale operators . Using Chef to do the majority of the installation left the Crowbar team to focus on

crowbar_engineKey Features

  • Heterogeneous Operating Systems – chose which operating system you want to install on the target servers.
  • CMDB Flexibility (see picture) – don’t be locked in to a devops toolset. Attribute injection allows clean abstraction boundaries so you can use multiple tools (Chef and Puppet, playing together).
  • Ops Annealer –the orchestration at Crowbar’s heart combines the best of directed graphs with late binding and parallel execution. We believe annealing is the key ingredient for repeatable and OpenOps shared code upgrades
  • Upstream Friendly – infrastructure as code works best as a community practice and Crowbar use upstream code
  • without injecting “crowbarisms” that were previously required. So you can share your learning with the broader DevOps community even if they don’t use Crowbar.
  • Node Discovery (or not) – Crowbar maintains the same proven discovery image based approach that we used before, but we’ve streamlined and expanded it. You can use Crowbar’s API outside of the PXE discovery system to accommodate Docker containers, existing systems and VMs.
  • Hardware Configuration – Crowbar maintains the same optional hardware neutral approach to RAID and BIOS configuration. Configuring hardware with repeatability is difficult and requires much iterative testing. While our approach is open and generic, the team at Dell works hard to validate a on specific set of gear: it’s impossible to make statements beyond that test matrix.
  • Network Abstraction – Crowbar dramatically extended our DevOps network abstraction. We’ve learned that a networking is the key to success for deployment and upgrade so we’ve made Crowbar networking flexible and concise. Crowbar networking works with attribute injection so that you can avoid hardwiring networking into DevOps scripts.
  • Out of band control – when the Annealer hands off work, Crowbar gives the worker implementation flexibility to do it on the node (using SSH) or remotely (using an API). Making agents optional means allows operators and developers make the best choices for the actions that they need to take.
  • Technical Debt Paydown – We’ve also updated the Crowbar infrastructure to use the latest libraries like Ruby 2, Rails 4, Chef 11. Even more importantly, we’re dramatically simplified the code structure including in repo documentation and a Docker based developer environment that makes building a working Crowbar environment fast and repeatable.

OpenCrowbar (CB2) vs Crowbar (CB1)?

Why change to OpenCrowbar? This new generation of Crowbar is structurally different from Crowbar 1 and we’ve investing substantially in refactoring the tooling, paying down technical debt and cleanup up documentation. Since Crowbar 1 is still being actively developed, splitting the repositories allow both versions to progress with less confusion. The majority of the principles and deployment code is very similar, I think of Crowbar as a single community.

Continue Reading > post 2

OpenCrowbar: ready to fly as OpenOps neutral platform – Dell stepping back

greg and rob

Two of Crowbar Founders: me with Greg Althaus [taken Jan 2013]

With the Anvil release in the bag, Dell announced on the community list yesterday that it has stopped active contribution on the Crowbar project.  This effectively relaunches Crowbar as a truly vendor-neutral physical infrastructure provisioning tool.

While I cannot speak for my employer, Dell, about Crowbar; I continue serve in my role as a founder of the Crowbar Project.  I agree with Eric S Raymond that founders of open source projects have a responsibility to sustain their community and ensure its longevity.

In the open DevOps bare metal provisioning market, there is nothing that matches the capabilities developed in either Crowbar v1 or OpenCrowbar.  The operations model and system focused approach is truly differentiated because no other open framework has been able to integrate networking, orchestration, discovery, provisioning and configuration management like Crowbar.

It is time for the community to take Crowbar beyond the leadership of a single hardware vendor, OS vendor, workload or CMDB tool.  OpenCrowbar offers operations freedom and flexibility to build upon an abstracted physical infrastructure (what I’ve called “ready state“).

We have the opportunity to make open operations a reality together.

As a Crowbar founder and its acting community leader, you are welcome to contact me directly or through the crowbar list about how to get engaged in the Crowbar community or help get connected to like-minded Crowbar resources.

Open Operations [4/4 series on Operating Open Source Infrastructure]

This post is the final in a 4 part series about Success factors for Operating Open Source Infrastructure.

tl;dr Note: This is really TWO tightly related posts: 
  part 1 is OpenOps background. 
  part 2 is about OpenStack, Tempest and DefCore.

2012-01-11_17-42-11_374One of the substantial challenges of large-scale deployments of open source software is that it is very difficult to come up with a best practice, or a reference implementation that can be widely explained or described by the community.

Having a best practice deployment is essential for the growth of the community because it enables multiple people to deploy the software in a repeatable, stable way. This, in turn, fosters community growth so that more people can adopt software in a consistent way. It does little good if operators have no consistent pattern for deployment, because that undermines the developers’ abilities to extend, the testers’ abilities to ensure quality, and users’ ability to repeat the success of others.

Fundamentally, the goal of an open source project, from a user’s perspective, is that they can quickly achieve and repeat the success of other people in the community.

When we look at these large-scale projects we really try to create a pattern of success that can be repeated over and over again. This ensures growth of the user base, and it also helps the developer reduce time spent troubleshooting problems.

That does not mean that every single deployment should be identical, but there is substantial value in having a limited number of success patterns. Customers can then be assured not only of quick time to value with these projects., They can also get help without having everybody else in the community attempt to untangle how one person created a site-specific. This is especially problematic if someone created an unnecessarily unique scenario. That simply creates noise and confusion in the environment., Noise is a huge cost for the community, and needs to be eliminated nor an open source project to flourish.

This isn’t any different from in proprietary software but most of these activities are hidden. A proprietary project vendor can make much stronger recommendations and install guidance because they are the only source of truth in that project. In an open source project, there are multiple sources of truth, and there are very few people who are willing to publish their exact reference implementation or test patterns. Consequently, my team has taken a strong position on creating a repeatable reference implementation for Openstack deployments, based on extensive testing. We have found that our test patterns and practices are grounded in successful customer deployments and actual, physical infrastructure deployments. So, they are very pragmatic, repeatable, and sustained.

We found that this type of testing, while expensive, is also a significant value to our customers, and something that they appreciate and have been willing to pay for.

OpenStack as an Example: Tempest for Reference Validation

The Crowbar project incorporated OpenStack Tempest project as an essential part of every OpenStack deployment. From the earliest introduction of the Tempest suite, we have understood the value of a baselining test suite for OpenStack. We believe that using the same tests the developers use for a single node test is a gate for code acceptance against a multi-node deployment, and creates significant value both for our customers and the OpenStack project as a whole.  This was part of my why I embraced the suggestion of basing DefCore on tests.

While it is important to have developer tests that gate code check-ins, the ultimate goal for OpenStack is to create scale-out multi-node deployments. This is a fundamental design objective for OpenStack.

With developers and operators using the same test suite, we are able to proactively measure the success of the code in the scale deployments in a way that provides quick feedback for the developers. If Tempest tests do not pass a multi-node environment, they are not providing significant value for developers to ensure that their code is operating against best practice scenarios. Our objective is to continue to extend the Tempest suite of tests so that they are an accurate reflection of the use cases that are encountered in a best practice, referenced deployment.

Along these lines, we expect that the community will continue to expand the Tempest test suite to match actual deployment scenarios reflected in scale and multi-node configurations. Having developers be responsible for passing these tests as part of their day-to-day activities ensures that development activities do not disrupt scale operations. Ultimately, making proactive gating tests ensures that we are creating scenarios in which code quality is continually increasing, as is our ability to respond and deploy the OpenStack infrastructure.

I am very excited and optimistic that the expanding the Tempest suite holds the key to making OpenStack the most stable, reliable, performance cloud implementation available in the market. The fact that this test suite can be extended in the community, and contributed to by a broad range of implementations, only makes that test suite more valuable and more likely to fully encompass all use cases necessary for reference implementations.

Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure]

This post is the third in a 4 part series about Success factors for Operating Open Source Infrastructure.

turning upIn an automated configuration deployment scenario, problems surface very quickly. They prevent deployment and force resolution before progress can be made. Unfortunately, many times this appears to be a failure within the deployment automation. My personal experience has been exactly the opposite: automation creates a “fail fast” environment in which critical issues are discovered and resolved during provisioning instead of sleeping until later.

Our ability to detect and stop until these issues are resolved creates exactly the type of repeatable, successful deployment that is essential to long-term success. When we look at these deployments, the most important success factors are that the deployment is consistent, known and predictable. Our ability to quickly identify and resolve issues that do not match those patterns dramatically improves the long-term stability of the system by creating an environment that has been benchmarked against a known reference.

Benchmarking against a known reference is ultimately the most significant value that we can provide in helping customers bring up complex solutions such as Openstack and Hadoop. Being successful with these deployments over the long term means that you have established a known configuration, and that you have maintained it in a way that is explainable and reference-able to other places.

Reference Implementation

The concept of a reference implementation provides tremendous value in deployment. Following a pattern that is a reference implementation enables you to compare notes, get help and ultimately upgrade and change deployment in known, predictable ways. Customers who can follow and implement a vendors’ reference, or the community’s reference implementation, are able to ask for help on the mailing lists, call in for help and work with the community in ways that are consistent and predictable.

Let’s explore what a reference implementation looks like.

In a reference implementation you have a consistent, known state of your physical infrastructure that has been implemented based upon a RA. That implementation follows a known best practice using standard gear in a consistent, known configuration. You can therefore explain your configuration to a community of other developers, or other people who have similar configuration, and can validate that your problem is not the physical configuration. Fundamentally, everything in a reference implementation is driving towards the elimination of possible failure cause. In this case, we are making sure that the physical infrastructure is not causing problems (getting to a ready state), because other people are using a similar (or identical) physical infrastructure configuration.

The next components of a reference implementation are the underlying software configurations for operating system management monitoring network configuration, IP networking stacks. Pretty much the entire component of the application is riding on. There are a lot of moving parts and complexity in this scenario, witha high likelihood of causing failures. Implementing and deploying the software stacks in an automated way, has enabled us to dramatically reduce the potential for problems caused by misconfiguration. Because the number of permutations of software in the reference stack is so high, it is essential that successful deployment tightly manages what exactly is deployed, in such a way that they can identify, name, and compare notes with other deployments.

Achieving Repeatable Deployments

In this case, our referenced deployment consists of the exact composition of the operating system, infrastructure tooling, and capabilities for the deployment. By having a reference capability, we can ensure that we have the same:

  • Operating system
  • Monitoring
  • Configuration stacks
  • Security tooling
  • Patches
  • Network stack (including bridges and VLAN, IP table configurations)

Each one of these components is a potential failure point in a deployment. By being able to configure and maintain that configuration automatically, we dramatically increase the opportunities for success by enabling customers to have a consistent configuration between sites.

Repeatable reference deployments enable customers to compare notes with Dell and with others in the community. It enables us to take and apply what we have learned from one site to another. For example, if a new patch breaks functionality, then we can quickly determine how that was caused. We can then fix the solution, add in the complimentary fix, and deploy it at that one site. If we are aware that 90% of our other sites have exactly the same configuration, it enables those other sites to avoid a similar problem. In this way, having both a pattern and practice referenced deployment enables the community to absorb or respond much more quickly, and be successful with a changing code base. We found that it is impractical to expect things not to change.

The only thing that we can do is build resiliency for change into these deployments. Creating an automated and tested referenceable deployment is the best way to cope with change.