About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

Understanding OpenStack Designated Code Sections – Three critical questions

Posted on June 5, 2014 by Rob H

A collaboration with Michael Still (TC Member from Rackspace) & Joshua McKenty and Cross posted by Rackspace.

After nearly a year of discussion, the OpenStack board launched the DefCore process with 10 principles that set us on path towards a validated interoperability standard. We created the concept of “designated sections” to address concerns that using API tests to determine core would undermine commercial and community investment in a working, shared upstream implementation.

Designated sections provides the “you must include this” part of the core definition. Having common code as part of core is a central part of how DefCore is driving OpenStack operability.

So, why do we need this?

From our very formation, OpenStack has valued implementation over specification; consequently, there is a fairly strong community bias to ensure contributions are upstreamed. This bias is codified into the very structure of the GNU General Public License (GPL) but intentionally missing in the Apache Public License (APL v2) that OpenStack follows. The choice of Apache2 was important for OpenStack to attract commercial interests, who often consider GPL a “poison pill” because of the upstream requirements.

Nothing in the Apache license requires consumers of the code to share their changes; however, the OpenStack foundation does have control of how the OpenStack™ brand is used. Thus it’s possible for someone to fork and reuse OpenStack code without permission, but they cannot called it “OpenStack” code. This restriction only has strength if the OpenStack brand has value (protecting that value is the primary duty of the Foundation).

This intersection between License and Brand is the essence of why the Board has created the DefCore process.

Ok, how are we going to pick the designated code?

Figuring out which code should be designated is highly project specific and ultimately subjective; however, it’s also important to the community that we have a consistent and predictable strategy. While the work falls to the project technical leads (with ratification by the Technical Committee), the DefCore and Technical committees worked together to define a set of principles to guide the selection.

This Technical Committee resolution formally approves the general selection principles for “designated sections” of code, as part of the DefCore effort. We’ve taken the liberty to create a graphical representation (above) that visualizes this table using white for designated and black for non-designated sections. We’ve also included the DefCore principle of having an official “reference implementation.”

Here is the text from the resolution presented as a table:

Should be DESIGNATED:	Should NOT be DESIGNATED:
code provides the project external REST API, or code is shared and provides common functionality for all options, or code implements logic that is critical for cross-platform operation	code interfaces to vendor-specific functions, or project design explicitly intended this section to be replaceable, or code extends the project external REST API in a new or different way, or code is being deprecated

The resolution includes the expectation that “code that is not clearly designated is assumed to be designated unless determined otherwise. The default assumption will be to consider code designated.”

This definition is a starting point. Our next step is to apply these rules to projects and make sure that they provide meaningful results.

Wow, isn’t that a lot of code?

Not really. Its important to remember that designated sections alone do not define core: the must-pass tests are also a critical component. Consequently, designated code in projects that do not have must-pass tests is not actually required for OpenStack licensed implementation.

All I Ever Wanted Is A Composable RunDeck

Posted on June 4, 2014 by Rob H

I love the “just enough orchestration” description of OpenCrowbar. It’s Goldilocks orchestration!

OpenCrowbar Design Principles: Attribute Injection [Series 6 of 6]

Posted on May 30, 2014 by Rob H

This is part 5 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar. The content is reposted from the OpenCrowbar docs repo.

Attribute Injection

Attribute Injection is an essential aspect of the “FuncOps” story because it helps clean boundaries needed to implement consistent scripting behavior between divergent sites.

It also allows Crowbar to abstract and isolate provisioning layers. This operational approach means that deployments are composed of layered services (see emergent services) instead of locked “golden” images. The layers can be maintained independently and allow users to compose specific configurations a la cart. This approach works if the layers have clean functional boundaries (FuncOps) that can be scoped and managed atomically.

To explain how Attribute Injection accomplishes this, we need to explore why search became an anti-pattern in Crowbar v1. Originally, being able to use server based search functions in operational scripting was a critical feature. It allowed individual nodes to act as part of a system by searching for global information needed to make local decisions. This greatly added Crowbar’s mission of system level configuration; however, it also created significant hidden interdependencies between scripts. As Crowbar v1 grew in complexity, searches became more and more difficult to maintain because they were difficult to correctly scope, hard to centrally manage and prone to timing issues.

Crowbar was not unique in dealing with this problem – the Attribute Injection pattern has become a preferred alternative to search in integrated community cookbooks.

Attribute Injection in OpenCrowbar works by establishing specific inputs and outputs for all state actions (NodeRole runs). By declaring the exact inputs needed and outputs provided, Crowbar can better manage each annealing operation. This control includes deployment scoping boundaries, time sequence of information plus override and substitution of inputs based on execution paths.

This concept is not unique to Crowbar. It has become best practice for operational scripts. Crowbar simply extends to paradigm to the system level and orchestration level.

Attribute Injection enabled operations to be:

Atomic – only the information needed for the operation is provided so risk of “bleed over” between scripts is minimized. This is also a functional programming preference.
Isolated Idempotent – risk of accidentally picking up changed information from previous runs is reduced by controlling the inputs. That makes it more likely that scripts can be idempotent.
Cleanly Scoped – information passed into operations can be limited based on system deployment boundaries instead of search parameters. This allows the orchestration to manage when and how information is added into configurations.
Easy to troubleshoot – since the information is limited and controlled, it is easier to recreate runs for troubleshooting. This is a substantial value for diagnostics.

OpenCrowbar Design Principles: Emergent services [Series 5 of 6]

Posted on May 30, 2014 by Rob H

This is part 5 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar. The content is reposted from the OpenCrowbar docs repo.

Emergent services

We see data center operations as a duel between conflicting priorities. On one hand, the environment is constantly changing and systems must adapt quickly to these changes. On the other hand, users of the infrastructure expect it to provide stable and consistent services for consumption. We’ve described that as “always ready, never finished.”

Our solution to this duality to expect that the infrastructure Crowbar builds is decomposed into well-defined service layers that can be (re)assembled dynamically. Rather than require any component of the system to be in a ready state, Crowbar design principles assume that we can automate the construction of every level of the infrastructure from bios to network and application. Consequently, we can hold off (re)making decisions at the bottom levels until we’ve figured out that we’re doing at the top.

Effectively, we allow the overall infrastructure services configuration to evolve or emerge based on the desired end use. These concepts are built on computer science principles that we have appropriated for Ops use; since we also subscribe to Opscode “infrastructure as code”, we believe that these terms are fitting in a DevOps environment. In the next pages, we’ll explore the principles behind this approach including concepts around simulated annealing, late binding, attribute injection and emergent design.

Emergent (aka iterative or evolutionary) design challenges the traditional assumption that all factors must be known before starting

Dependency graph – multidimensional relationship
High degree of reuse via abstraction and isolation of service boundaries.
Increasing complexity of deployments means more dependencies
Increasing revision rates of dependencies but with higher stability of APIs

OpenCrowbar Design Principles: Simulated Annealing [Series 4 of 6]

Posted on May 29, 2014 by Rob H

This is part 4 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar. The content is reposted from the OpenCrowbar docs repo.

Simulated Annealing

Simulated Annealing is a modeling strategy from Computer Science for seeking optimum or stable outcomes through iterative analysis. The physical analogy is the process of strengthening steel by repeatedly heating, quenching and hammering. In both computer science and metallurgy, the process involves evaluating state, taking action, factoring in new data and then repeating. Each annealing cycle improves the system even though we may not know the final target state.

Annealing is well suited for problems where there is no mathematical solution, there’s an irregular feedback loop or the datasets change over time. We have all three challenges in continuous operations environments. While it’s clear that a deployment can modeled as directed graph (a mathematical solution) at a specific point in time, the reality is that there are too many unknowns to have a reliable graph. The problem is compounded because of unpredictable variance in hardware (NIC enumeration, drive sizes, BIOS revisions, network topology, etc) that’s even more challenging if we factor in adapting to failures. An operating infrastructure is a moving target that is hard to model predictively.

Crowbar implements the simulated annealing algorithm by decomposing the operations infrastructure into atomic units, node-roles, that perform the smallest until of work definable. Some or all of these node-roles are changed whenever the infrastructure changes. Crowbar anneals the environment by exercising the node-roles in a consistent way until system re-stabilizes.

One way to visualize Crowbar annealing is to imagine children who have to cross a field but don’t have a teacher to coordinate. Once student takes a step forward and looks around then another sees the first and takes two steps. Each child advances based on what their peers are doing. None wants to get too far ahead or be left behind. The line progresses irregularly but consistently based on the natural relationships within the group.

To understand the Crowbar Annealer, we have to break it into three distinct components: deployment timeline, annealing and node-role state. The deployment timeline represents externally (user, hardware, etc) initiated changes that propose a new target state. Once that new target is committed, Crowbar anneals by iterating through all the node-roles in a reasonable order. As the Annealer runs the node-roles they update their own state. The aggregate state of all the node-roles determines the state of the deployment.

A deployment is a combination of user and system defined state. Crowbar’s job is to get deployments stable and then maintain over time.

OpenCrowbar Design Principles: Late Binding [Series 3 of 6]

Posted on May 29, 2014 by Rob H

This is part 3 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar. The content is reposted from the OpenCrowbar docs repo.

Ops Late Binding

In terms of computer science languages, late binding describes a class of 4th generation languages that do not require programmers to know all the details of the information they will store until the data is actually stored. Historically, computers required very exact and prescriptive data models, but later generation languages embraced a more flexible binding.

Ops is fluid and situational.

Many DevOps tooling leverages eventual consistency to create stable deployments. This iterative approach assumes that repeated attempts of executing the same idempotent scripts do deliver this result; however, they are do not deliver predictable upgrades in situations where there are circular dependencies to resolve.

It’s not realistic to predict the exact configuration of a system in advance –

the operational requirements recursively impact how the infrastructure is configured
ops environments must be highly dynamic
resilience requires configurations to be change tolerant

Even more complex upgrade where the steps cannot be determined in advanced because the specifics of the deployment direct the upgrade.

Late Binding is a foundational topic for Crowbar that we’ve been talking about since mid-2012. I believe that it’s an essential operational consideration to handle resiliency and upgrades. We’ve baked it deeply into OpenCrowbar design.

Continue Reading > post 4

OpenCrowbar Design Principles: The Ops Challenge [Series 2 of 6]

Posted on May 28, 2014 by Rob H

This is part 2 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar. The content is reposted from the OpenCrowbar docs repo.

The operations challenge

A deployment framework is key to solving the problems of deploying, configuring, and scaling open source clusters for cloud computing.

Deploying an open source cloud can be a complex undertaking. Manual processes, can take days or even weeks working to get a cloud fully operational. Even then, a cloud is never static, in the real world cloud solutions are constantly on an upgrade or improvement path. There is continuous need to deploy new servers, add management capabilities, and track the upstream releases, while keeping the cloud running, and providing reliable services to end users. Service continuity requirements dictate a need for automation and orchestration. There is no other way to reduce the cost while improving the uptime reliability of a cloud.

These were among the challenges that drove the development of the OpenCrowbar software framework from it’s roots as an OpenStack installer into a much broader orchestration tool. Because of this evolution, OpenCrowbar has a number of architectural features to address these challenges:

Abstraction Around OrchestrationOpenCrowbar is designed to simplify the operations of large scale cloud infrastructure by providing a higher level abstraction on top of existing configuration management and orchestration tools based on a layered deployment model.
Web ArchitectureOpenCrowbar is implemented as a web application server, with a full user interface and a predictable and consistent REST API.
Platform Agnostic ImplementationOpenCrowbar is designed to be platform and operating system agnostic. It supports discovery and provisioning from a bare metal state, including hardware configuration, updating and configuring BIOS and BMC boards, and operating system installation. Multiple operating systems and heterogeneous operating systems are supported. OpenCrowbar enables use of time-honored tools, industry standard tools, and any form of scriptable facility to perform its state transition operations.
Modular ArchitectureOpenCrowbar is designed around modular plug-ins called Barclamps. Barclamps allow for extensibility and customization while encapsulating layers of deployment in manageable units.
State Transition Management EngineThe core of OpenCrowbar is based on a state machine (we call it the Annealer) that tracks nodes, roles, and their relationships in groups called deployments. The state machine is responsible for analyzing dependencies and scheduling state transition operations (transitions).
Data modelOpenCrowbar uses a dedicated database to track system state and data. As discovery and deployment progresses, system data is collected and made available to other components in the system. Individual components can access and update this data, reducing dependencies through a combination of deferred binding and runtime attribute injection.
Network AbstractionOpenCrowbar is designed to support a flexible network abstraction, where physical interfaces, BMC’s, VLANS, binding, teaming, and other low level features are mapped to logical conduits, which can be referenced by other components. Networking configurations can be created dynamically to adapt to changing infrastructure.

Continue Reading > post 3

OpenCrowbar Design Principles: Reintroduction [Series 1 of 6]

Posted on May 28, 2014 by Rob H

While “ready state” as a concept has been getting a lot of positive response, I forget that much of the innovation and learning behind that concept never surfaced as posts here. The Anvil (2.0) release included the OpenCrowbar team cataloging our principles in docs. Now it’s time to repost the team’s work into a short series over the next three days.

In architecting the Crowbar operational model, we’ve consistently ~~twisted~~ adapted traditional computer science concepts like late binding, simulated annealing, emergent behavior, attribute injection and functional programming to create a repeatable platform for sharing open operations practice (post 2).

Functional DevOps aka “FuncOps”

Ok, maybe that’s not going to be the 70’s era hype bubble name, but… the operational model behind Crowbar is entering its third generation and its important to understand the state isolation and integration principles behind that model is closer to functional than declarative programming.

Parliament is Crowbar’s official FuncOps sound track

The model is critical because it shapes how Crowbar approaches the infrastructure at a fundamental level so it makes it easier to interact with the platform if you see how we are approaching operations. Crowbar’s goal is to create emergent services.

We’ll expore those topics in this series to explain Crowbar’s core architectural principles. Before we get into that, I’d like to review some history.

The Crowbar Objective

Crowbar delivers repeatable best practice deployments. Crowbar is not just about installation: we define success as a sustainable operations model where we continuously improve how people use their infrastructure. The complexity and pace of technology change is accelerating so we must have an approach that embraces continuous delivery.

Crowbar’s objective is to help operators become more efficient, stable and resilient over time.

Background

When Greg Althaus (github @GAlhtaus) and Rob “zehicle” Hirschfeld (github @CloudEdge) started the project, we had some very specific targets in mind. We’d been working towards using organic emergent swarming (think ants) to model continuous application deployment. We had also been struggling with the most routine foundational tasks (bios, raid, o/s install, networking, ops infrastructure) when bringing up early scale cloud & data applications. Another key contributor, Victor Lowther (github @VictorLowther) has critical experience in Linux operations, networking and dependency resolution that lead to made significant contributions around the Annealing and networking model. These backgrounds heavily influenced how we approached Crowbar.

First, we started with best of field DevOps infrastructure: Opscode Chef. There was already a remarkable open source community around this tool and an enthusiastic following for cloud and scale operators . Using Chef to do the majority of the installation left the Crowbar team to focus on

Key Features

Heterogeneous Operating Systems – chose which operating system you want to install on the target servers.
CMDB Flexibility (see picture) – don’t be locked in to a devops toolset. Attribute injection allows clean abstraction boundaries so you can use multiple tools (Chef and Puppet, playing together).
Ops Annealer –the orchestration at Crowbar’s heart combines the best of directed graphs with late binding and parallel execution. We believe annealing is the key ingredient for repeatable and OpenOps shared code upgrades
Upstream Friendly – infrastructure as code works best as a community practice and Crowbar use upstream code
without injecting “crowbarisms” that were previously required. So you can share your learning with the broader DevOps community even if they don’t use Crowbar.
Node Discovery (or not) – Crowbar maintains the same proven discovery image based approach that we used before, but we’ve streamlined and expanded it. You can use Crowbar’s API outside of the PXE discovery system to accommodate Docker containers, existing systems and VMs.
Hardware Configuration – Crowbar maintains the same optional hardware neutral approach to RAID and BIOS configuration. Configuring hardware with repeatability is difficult and requires much iterative testing. While our approach is open and generic, the team at Dell works hard to validate a on specific set of gear: it’s impossible to make statements beyond that test matrix.
Network Abstraction – Crowbar dramatically extended our DevOps network abstraction. We’ve learned that a networking is the key to success for deployment and upgrade so we’ve made Crowbar networking flexible and concise. Crowbar networking works with attribute injection so that you can avoid hardwiring networking into DevOps scripts.
Out of band control – when the Annealer hands off work, Crowbar gives the worker implementation flexibility to do it on the node (using SSH) or remotely (using an API). Making agents optional means allows operators and developers make the best choices for the actions that they need to take.
Technical Debt Paydown – We’ve also updated the Crowbar infrastructure to use the latest libraries like Ruby 2, Rails 4, Chef 11. Even more importantly, we’re dramatically simplified the code structure including in repo documentation and a Docker based developer environment that makes building a working Crowbar environment fast and repeatable.

OpenCrowbar (CB2) vs Crowbar (CB1)?

Why change to OpenCrowbar? This new generation of Crowbar is structurally different from Crowbar 1 and we’ve investing substantially in refactoring the tooling, paying down technical debt and cleanup up documentation. Since Crowbar 1 is still being actively developed, splitting the repositories allow both versions to progress with less confusion. The majority of the principles and deployment code is very similar, I think of Crowbar as a single community.

Continue Reading > post 2

Just for fun, putting themes to OpenStack Conferences

Posted on May 23, 2014 by Rob H

I’ve been to every OpenStack summit and, in retrospect, each one has a different theme. I see these as community themes beyond the releases train that cover how the OpenStack ecosystem has changed.

The themes are, of course, highly subjective and intented to spark reflection and discussion.

City	Release	Theme	My Commentary
ATL	Ice House	Its my sandbox!	The new marketplace is great and there are also a lot of vendors who want to differentiate their offering and are not sure where to play.
HK	Havana	Project land grab	It felt like a PTL gold rush as lots of new projects where tossed into the ecosystem mix. I’m wary of perceived “anointed” projects that define “the way” to do things.
PDX	Grizzly	Shiny new things	We went from having a defined core set of projects to a much richer and varied platforms, environments and solutions.
SD	Folsom	Breaking up is hard to do	Nova began to fragment (cinder & ~~quantum~~ neutron)
SF	Essex	New kids are here	Move over Rackspace. Lots of new operating systems, providers, consulting and hosting companies participating. Stackalytics makes it into a real commit race.
BOS	Diablo	Race to be the first	Everyone was trying to show that OpenStack could be used for real work. Lots of startups launched.
SJC	Cactus	Oh, you like us! We need some process	This is real so everyone was exporing OpenStack. We clearly needed to figure out how to work together. This is where we migrated to git.
SA	Bexar	We’re going to take over the world	We handed out rose-colored classes that mostly turned out pretty accurate; however. many some top names from that time are not in the community now (Citrix, NASA, Accenture, and others).
ATX	Austin	We choose “none of the above”	There was a building sense of potential energy while companies figured out that 1) there was a gap and 2) they wanted to fill it together.

OpenStack ATL Recap to the 11s: the danger of drama + 5 challenges & 5 successes

Posted on May 22, 2014 by Rob H

I’ve come to accept that the “Hallway Track” is my primary session at OpenStack events. I want to thank the many people in the community who make that the best track. It’s not only full of deep technical content; there are also healthy doses of intrigue, politics and “let’s fix that” in the halls.

I think honest reflection is critical to OpenStack growth (reflections from last year). My role as a Board member must not translate into pom-pom waving robot cheerleader.

What I heard that’s working:

Foundation event team did a great job on the logistics and many appreciate the user and operator focus. There’s is no doubt that OpenStack is being deployed at scale and helping transform cloud infrastructure. I think that’s a great message.
DefCore criteria were approved by the Board. The overall process and impact was talked about positively at the summit. To accelerate, we need +1s and feedback because “crickets” means we need to go slower. I’ll have to dedicate a future post to next steps and “designated sections.”
Marketplace! Great turn out by vendors of all types, but I’m not hearing about them making a lot of money from OpenStack (which is needed for them to survive). I like the diversity of the marketplace: consulting, aaServices, installers, networking, more networking, new distros, and ecosystem tools.
There’s some real growth in aaS services for openstack (database, load balancer, dns, etc). This is the ecosystem that many want OpenStack to drive because it helps displace Amazon cloud. I also heard concerns that to be sure they are pluggable so companies can complete on implementation.
Lots of process changes to adapt to growing pains. People felt that the community is adapting (yeah!) but were concerned having to re-invent tooling (meh).

There are also challenges that people brought to me:

Our #1 danger is drama. Users and operators want collaboration and friendly competition. They are turned off by vendor conflict or strong-arming in the community (e.g.: the WSJ Red Hat article and fallout). I’d encourage everyone to breathe more and react less.
Lack of product management is risking a tragedy of the commons. Helping companies work together and across projects is needed for our collaboration processes to work. I’ll be exploring this with Sean Roberts in future posts.
Making sure there’s profit being generated from shared code. We need to remember that most of the development is corporate funded so we need to make sure that companies generate revenue. The trend of everyone creating unique distros may indicate a problem.
We need to be more operator friendly. I know we’re trying but we create distance with operators when we insist on creating new tools instead of using the existing ecosystem. That also slows down dealing with upgrades, resilient architecture and other operational concerns.
Anointed projects concerns have expanded since Hong Kong. There’s a perception that Heat (orchestration), Triple0 (provisioning), Solum (platform) are considered THE only way OpenStack solves those problems and other approaches are not welcome. While that encourages collaboration, it also chills competition and discussion.
There’s a lot of whispering about the status of challenged projects: neutron (works with proprietary backends but not open, may not stay integrated) and openstack boot-strap (state of TripleO/Ironic/Heat mix). The issue here is NOT if they are challenged but finding ways to discuss concerns openly (see anointed projects concern).

I’d enjoy hearing more about success and deeper discussion around concerns. I use community feedback to influence my work in the community and on the board. If you think I’ve got it right or wrong then please let me know.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Author Archives: Rob H