OpenCrowbar Design Principles: The Ops Challenge [Series 2 of 6]

Posted on May 28, 2014 by Rob H

This is part 2 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar. The content is reposted from the OpenCrowbar docs repo.

The operations challenge

A deployment framework is key to solving the problems of deploying, configuring, and scaling open source clusters for cloud computing.

Deploying an open source cloud can be a complex undertaking. Manual processes, can take days or even weeks working to get a cloud fully operational. Even then, a cloud is never static, in the real world cloud solutions are constantly on an upgrade or improvement path. There is continuous need to deploy new servers, add management capabilities, and track the upstream releases, while keeping the cloud running, and providing reliable services to end users. Service continuity requirements dictate a need for automation and orchestration. There is no other way to reduce the cost while improving the uptime reliability of a cloud.

These were among the challenges that drove the development of the OpenCrowbar software framework from it’s roots as an OpenStack installer into a much broader orchestration tool. Because of this evolution, OpenCrowbar has a number of architectural features to address these challenges:

Abstraction Around OrchestrationOpenCrowbar is designed to simplify the operations of large scale cloud infrastructure by providing a higher level abstraction on top of existing configuration management and orchestration tools based on a layered deployment model.
Web ArchitectureOpenCrowbar is implemented as a web application server, with a full user interface and a predictable and consistent REST API.
Platform Agnostic ImplementationOpenCrowbar is designed to be platform and operating system agnostic. It supports discovery and provisioning from a bare metal state, including hardware configuration, updating and configuring BIOS and BMC boards, and operating system installation. Multiple operating systems and heterogeneous operating systems are supported. OpenCrowbar enables use of time-honored tools, industry standard tools, and any form of scriptable facility to perform its state transition operations.
Modular ArchitectureOpenCrowbar is designed around modular plug-ins called Barclamps. Barclamps allow for extensibility and customization while encapsulating layers of deployment in manageable units.
State Transition Management EngineThe core of OpenCrowbar is based on a state machine (we call it the Annealer) that tracks nodes, roles, and their relationships in groups called deployments. The state machine is responsible for analyzing dependencies and scheduling state transition operations (transitions).
Data modelOpenCrowbar uses a dedicated database to track system state and data. As discovery and deployment progresses, system data is collected and made available to other components in the system. Individual components can access and update this data, reducing dependencies through a combination of deferred binding and runtime attribute injection.
Network AbstractionOpenCrowbar is designed to support a flexible network abstraction, where physical interfaces, BMC’s, VLANS, binding, teaming, and other low level features are mapped to logical conduits, which can be referenced by other components. Networking configurations can be created dynamically to adapt to changing infrastructure.

Continue Reading > post 3

OpenCrowbar Design Principles: Reintroduction [Series 1 of 6]

Posted on May 28, 2014 by Rob H

While “ready state” as a concept has been getting a lot of positive response, I forget that much of the innovation and learning behind that concept never surfaced as posts here. The Anvil (2.0) release included the OpenCrowbar team cataloging our principles in docs. Now it’s time to repost the team’s work into a short series over the next three days.

In architecting the Crowbar operational model, we’ve consistently ~~twisted~~ adapted traditional computer science concepts like late binding, simulated annealing, emergent behavior, attribute injection and functional programming to create a repeatable platform for sharing open operations practice (post 2).

Functional DevOps aka “FuncOps”

Ok, maybe that’s not going to be the 70’s era hype bubble name, but… the operational model behind Crowbar is entering its third generation and its important to understand the state isolation and integration principles behind that model is closer to functional than declarative programming.

Parliament is Crowbar’s official FuncOps sound track

The model is critical because it shapes how Crowbar approaches the infrastructure at a fundamental level so it makes it easier to interact with the platform if you see how we are approaching operations. Crowbar’s goal is to create emergent services.

We’ll expore those topics in this series to explain Crowbar’s core architectural principles. Before we get into that, I’d like to review some history.

The Crowbar Objective

Crowbar delivers repeatable best practice deployments. Crowbar is not just about installation: we define success as a sustainable operations model where we continuously improve how people use their infrastructure. The complexity and pace of technology change is accelerating so we must have an approach that embraces continuous delivery.

Crowbar’s objective is to help operators become more efficient, stable and resilient over time.

Background

When Greg Althaus (github @GAlhtaus) and Rob “zehicle” Hirschfeld (github @CloudEdge) started the project, we had some very specific targets in mind. We’d been working towards using organic emergent swarming (think ants) to model continuous application deployment. We had also been struggling with the most routine foundational tasks (bios, raid, o/s install, networking, ops infrastructure) when bringing up early scale cloud & data applications. Another key contributor, Victor Lowther (github @VictorLowther) has critical experience in Linux operations, networking and dependency resolution that lead to made significant contributions around the Annealing and networking model. These backgrounds heavily influenced how we approached Crowbar.

First, we started with best of field DevOps infrastructure: Opscode Chef. There was already a remarkable open source community around this tool and an enthusiastic following for cloud and scale operators . Using Chef to do the majority of the installation left the Crowbar team to focus on

Key Features

Heterogeneous Operating Systems – chose which operating system you want to install on the target servers.
CMDB Flexibility (see picture) – don’t be locked in to a devops toolset. Attribute injection allows clean abstraction boundaries so you can use multiple tools (Chef and Puppet, playing together).
Ops Annealer –the orchestration at Crowbar’s heart combines the best of directed graphs with late binding and parallel execution. We believe annealing is the key ingredient for repeatable and OpenOps shared code upgrades
Upstream Friendly – infrastructure as code works best as a community practice and Crowbar use upstream code
without injecting “crowbarisms” that were previously required. So you can share your learning with the broader DevOps community even if they don’t use Crowbar.
Node Discovery (or not) – Crowbar maintains the same proven discovery image based approach that we used before, but we’ve streamlined and expanded it. You can use Crowbar’s API outside of the PXE discovery system to accommodate Docker containers, existing systems and VMs.
Hardware Configuration – Crowbar maintains the same optional hardware neutral approach to RAID and BIOS configuration. Configuring hardware with repeatability is difficult and requires much iterative testing. While our approach is open and generic, the team at Dell works hard to validate a on specific set of gear: it’s impossible to make statements beyond that test matrix.
Network Abstraction – Crowbar dramatically extended our DevOps network abstraction. We’ve learned that a networking is the key to success for deployment and upgrade so we’ve made Crowbar networking flexible and concise. Crowbar networking works with attribute injection so that you can avoid hardwiring networking into DevOps scripts.
Out of band control – when the Annealer hands off work, Crowbar gives the worker implementation flexibility to do it on the node (using SSH) or remotely (using an API). Making agents optional means allows operators and developers make the best choices for the actions that they need to take.
Technical Debt Paydown – We’ve also updated the Crowbar infrastructure to use the latest libraries like Ruby 2, Rails 4, Chef 11. Even more importantly, we’re dramatically simplified the code structure including in repo documentation and a Docker based developer environment that makes building a working Crowbar environment fast and repeatable.

OpenCrowbar (CB2) vs Crowbar (CB1)?

Why change to OpenCrowbar? This new generation of Crowbar is structurally different from Crowbar 1 and we’ve investing substantially in refactoring the tooling, paying down technical debt and cleanup up documentation. Since Crowbar 1 is still being actively developed, splitting the repositories allow both versions to progress with less confusion. The majority of the principles and deployment code is very similar, I think of Crowbar as a single community.

Continue Reading > post 2

Just for fun, putting themes to OpenStack Conferences

Posted on May 23, 2014 by Rob H

I’ve been to every OpenStack summit and, in retrospect, each one has a different theme. I see these as community themes beyond the releases train that cover how the OpenStack ecosystem has changed.

The themes are, of course, highly subjective and intented to spark reflection and discussion.

City	Release	Theme	My Commentary
ATL	Ice House	Its my sandbox!	The new marketplace is great and there are also a lot of vendors who want to differentiate their offering and are not sure where to play.
HK	Havana	Project land grab	It felt like a PTL gold rush as lots of new projects where tossed into the ecosystem mix. I’m wary of perceived “anointed” projects that define “the way” to do things.
PDX	Grizzly	Shiny new things	We went from having a defined core set of projects to a much richer and varied platforms, environments and solutions.
SD	Folsom	Breaking up is hard to do	Nova began to fragment (cinder & ~~quantum~~ neutron)
SF	Essex	New kids are here	Move over Rackspace. Lots of new operating systems, providers, consulting and hosting companies participating. Stackalytics makes it into a real commit race.
BOS	Diablo	Race to be the first	Everyone was trying to show that OpenStack could be used for real work. Lots of startups launched.
SJC	Cactus	Oh, you like us! We need some process	This is real so everyone was exporing OpenStack. We clearly needed to figure out how to work together. This is where we migrated to git.
SA	Bexar	We’re going to take over the world	We handed out rose-colored classes that mostly turned out pretty accurate; however. many some top names from that time are not in the community now (Citrix, NASA, Accenture, and others).
ATX	Austin	We choose “none of the above”	There was a building sense of potential energy while companies figured out that 1) there was a gap and 2) they wanted to fill it together.

OpenCrowbar: ready to fly as OpenOps neutral platform – Dell stepping back

Posted on May 20, 2014 by Rob H

Two of Crowbar Founders: me with Greg Althaus [taken Jan 2013]

With the Anvil release in the bag, Dell announced on the community list yesterday that it has stopped active contribution on the Crowbar project. This effectively relaunches Crowbar as a truly vendor-neutral physical infrastructure provisioning tool.

While I cannot speak for my employer, Dell, about Crowbar; I continue serve in my role as a founder of the Crowbar Project. I agree with Eric S Raymond that founders of open source projects have a responsibility to sustain their community and ensure its longevity.

In the open DevOps bare metal provisioning market, there is nothing that matches the capabilities developed in either Crowbar v1 or OpenCrowbar. The operations model and system focused approach is truly differentiated because no other open framework has been able to integrate networking, orchestration, discovery, provisioning and configuration management like Crowbar.

It is time for the community to take Crowbar beyond the leadership of a single hardware vendor, OS vendor, workload or CMDB tool. OpenCrowbar offers operations freedom and flexibility to build upon an abstracted physical infrastructure (what I’ve called “ready state“).

We have the opportunity to make open operations a reality together.

As a Crowbar founder and its acting community leader, you are welcome to contact me directly or through the crowbar list about how to get engaged in the Crowbar community or help get connected to like-minded Crowbar resources.

Parable of Kitten Taming

Posted on May 19, 2014 by Rob H

It’s time to return to story of Barney and Bailum. Last year, I wrote about their separate paths through the circus business: Bailum succeeding with a lean model and Barney failing with a “go big” strategy. This parable opens with Bailum taking pity on Barney and bringing him into her thriving animal training business.

Bailum had grown her Lion taming business from the ground up. She started from humble beginnings with untrained dogs; consequently, she’d learned about building rapport and trust with her performers. She never considered them to be animals. To her, everyone in her organization (especially the animals) was a valued contributor. She’d seen first-hand that just one bad link in the chain could cause a great performance team to turn sour. Her acts won awards and she was proud to have them in the spotlight while she focused on building trust and a sustainable culture.

Unfortunately, Barney did not share his sister’s experience or values. He only saw the name that she’d built for the company and felt that he could use his position and relationship to promote himself. Even though he knew nothing of animal training, he was eager to redirect his staff into new areas. Reading market data and without consulting his trainers, he decided that a cute kitten acts would attract more business than the company’s successful dogs acts.

Overnight, he released the dogs and acquired kittens from a local shelter. Some of his trainers simply quit while others made an attempt to follow the new direction. Barney was impatient for success and started watching the trainers learning to work with the frisky felines. Progress was slow and Barney vented his frustration by yelling at the trainers and ultimately putting shock collars on the kittens. In short order, the trainers had left and Barney was being sprayed, scratched and bitten by the cats.

When Bailum learned about her brother’s management approach she was mortified; unfortunately, he had also signed contracts promising kitten acts to their customers. After restructuring her familial entanglements, she took a personal interest in training the kittens. She immediately recognized that cats require independence instead of direction compared to dogs. Starting from careful handling, then bringing in her lion tamers and rewarding positive results, she created working troop. The final results were so effective (and logistics so much easier) that Bailum ultimately transformed her business to focus on them exclusively.

Moral: you can’t force cats to bark but, with the right approach, kittens can outperform lions

Open Operations [4/4 series on Operating Open Source Infrastructure]

Posted on May 8, 2014 by Rob H

This post is the final in a 4 part series about Success factors for Operating Open Source Infrastructure.

tl;dr Note: This is really TWO tightly related posts: 
  part 1 is OpenOps background. 
  part 2 is about OpenStack, Tempest and DefCore.

One of the substantial challenges of large-scale deployments of open source software is that it is very difficult to come up with a best practice, or a reference implementation that can be widely explained or described by the community.

Having a best practice deployment is essential for the growth of the community because it enables multiple people to deploy the software in a repeatable, stable way. This, in turn, fosters community growth so that more people can adopt software in a consistent way. It does little good if operators have no consistent pattern for deployment, because that undermines the developers’ abilities to extend, the testers’ abilities to ensure quality, and users’ ability to repeat the success of others.

Fundamentally, the goal of an open source project, from a user’s perspective, is that they can quickly achieve and repeat the success of other people in the community.

When we look at these large-scale projects we really try to create a pattern of success that can be repeated over and over again. This ensures growth of the user base, and it also helps the developer reduce time spent troubleshooting problems.

That does not mean that every single deployment should be identical, but there is substantial value in having a limited number of success patterns. Customers can then be assured not only of quick time to value with these projects., They can also get help without having everybody else in the community attempt to untangle how one person created a site-specific. This is especially problematic if someone created an unnecessarily unique scenario. That simply creates noise and confusion in the environment., Noise is a huge cost for the community, and needs to be eliminated nor an open source project to flourish.

This isn’t any different from in proprietary software but most of these activities are hidden. A proprietary project vendor can make much stronger recommendations and install guidance because they are the only source of truth in that project. In an open source project, there are multiple sources of truth, and there are very few people who are willing to publish their exact reference implementation or test patterns. Consequently, my team has taken a strong position on creating a repeatable reference implementation for Openstack deployments, based on extensive testing. We have found that our test patterns and practices are grounded in successful customer deployments and actual, physical infrastructure deployments. So, they are very pragmatic, repeatable, and sustained.

We found that this type of testing, while expensive, is also a significant value to our customers, and something that they appreciate and have been willing to pay for.

OpenStack as an Example: Tempest for Reference Validation

The Crowbar project incorporated OpenStack Tempest project as an essential part of every OpenStack deployment. From the earliest introduction of the Tempest suite, we have understood the value of a baselining test suite for OpenStack. We believe that using the same tests the developers use for a single node test is a gate for code acceptance against a multi-node deployment, and creates significant value both for our customers and the OpenStack project as a whole. This was part of my why I embraced the suggestion of basing DefCore on tests.

While it is important to have developer tests that gate code check-ins, the ultimate goal for OpenStack is to create scale-out multi-node deployments. This is a fundamental design objective for OpenStack.

With developers and operators using the same test suite, we are able to proactively measure the success of the code in the scale deployments in a way that provides quick feedback for the developers. If Tempest tests do not pass a multi-node environment, they are not providing significant value for developers to ensure that their code is operating against best practice scenarios. Our objective is to continue to extend the Tempest suite of tests so that they are an accurate reflection of the use cases that are encountered in a best practice, referenced deployment.

Along these lines, we expect that the community will continue to expand the Tempest test suite to match actual deployment scenarios reflected in scale and multi-node configurations. Having developers be responsible for passing these tests as part of their day-to-day activities ensures that development activities do not disrupt scale operations. Ultimately, making proactive gating tests ensures that we are creating scenarios in which code quality is continually increasing, as is our ability to respond and deploy the OpenStack infrastructure.

I am very excited and optimistic that the expanding the Tempest suite holds the key to making OpenStack the most stable, reliable, performance cloud implementation available in the market. The fact that this test suite can be extended in the community, and contributed to by a broad range of implementations, only makes that test suite more valuable and more likely to fully encompass all use cases necessary for reference implementations.

Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure]

Posted on May 5, 2014 by Rob H

This post is the third in a 4 part series about Success factors for Operating Open Source Infrastructure.

In an automated configuration deployment scenario, problems surface very quickly. They prevent deployment and force resolution before progress can be made. Unfortunately, many times this appears to be a failure within the deployment automation. My personal experience has been exactly the opposite: automation creates a “fail fast” environment in which critical issues are discovered and resolved during provisioning instead of sleeping until later.

Our ability to detect and stop until these issues are resolved creates exactly the type of repeatable, successful deployment that is essential to long-term success. When we look at these deployments, the most important success factors are that the deployment is consistent, known and predictable. Our ability to quickly identify and resolve issues that do not match those patterns dramatically improves the long-term stability of the system by creating an environment that has been benchmarked against a known reference.

Benchmarking against a known reference is ultimately the most significant value that we can provide in helping customers bring up complex solutions such as Openstack and Hadoop. Being successful with these deployments over the long term means that you have established a known configuration, and that you have maintained it in a way that is explainable and reference-able to other places.

Reference Implementation

The concept of a reference implementation provides tremendous value in deployment. Following a pattern that is a reference implementation enables you to compare notes, get help and ultimately upgrade and change deployment in known, predictable ways. Customers who can follow and implement a vendors’ reference, or the community’s reference implementation, are able to ask for help on the mailing lists, call in for help and work with the community in ways that are consistent and predictable.

Let’s explore what a reference implementation looks like.

In a reference implementation you have a consistent, known state of your physical infrastructure that has been implemented based upon a RA. That implementation follows a known best practice using standard gear in a consistent, known configuration. You can therefore explain your configuration to a community of other developers, or other people who have similar configuration, and can validate that your problem is not the physical configuration. Fundamentally, everything in a reference implementation is driving towards the elimination of possible failure cause. In this case, we are making sure that the physical infrastructure is not causing problems (getting to a ready state), because other people are using a similar (or identical) physical infrastructure configuration.

The next components of a reference implementation are the underlying software configurations for operating system management monitoring network configuration, IP networking stacks. Pretty much the entire component of the application is riding on. There are a lot of moving parts and complexity in this scenario, witha high likelihood of causing failures. Implementing and deploying the software stacks in an automated way, has enabled us to dramatically reduce the potential for problems caused by misconfiguration. Because the number of permutations of software in the reference stack is so high, it is essential that successful deployment tightly manages what exactly is deployed, in such a way that they can identify, name, and compare notes with other deployments.

Achieving Repeatable Deployments

In this case, our referenced deployment consists of the exact composition of the operating system, infrastructure tooling, and capabilities for the deployment. By having a reference capability, we can ensure that we have the same:

Operating system
Monitoring
Configuration stacks
Security tooling
Patches
Network stack (including bridges and VLAN, IP table configurations)

Each one of these components is a potential failure point in a deployment. By being able to configure and maintain that configuration automatically, we dramatically increase the opportunities for success by enabling customers to have a consistent configuration between sites.

Repeatable reference deployments enable customers to compare notes with Dell and with others in the community. It enables us to take and apply what we have learned from one site to another. For example, if a new patch breaks functionality, then we can quickly determine how that was caused. We can then fix the solution, add in the complimentary fix, and deploy it at that one site. If we are aware that 90% of our other sites have exactly the same configuration, it enables those other sites to avoid a similar problem. In this way, having both a pattern and practice referenced deployment enables the community to absorb or respond much more quickly, and be successful with a changing code base. We found that it is impractical to expect things not to change.

The only thing that we can do is build resiliency for change into these deployments. Creating an automated and tested referenceable deployment is the best way to cope with change.

Networking in Cloud Environments, SDN, NFV, and why it matters [part 1 of 2]

Posted on May 1, 2014 by Rob H

Scott Jensen is an Engineering Director and colleague of mine from Dell with deep networking and operations experience. He had first hand experience deploying OpenStack and Hadoop and has a critical role in defining Dell’s Reference Architectures in those areas. When I saw this writeup about cloud networking, I asked if it would be OK to share it with you.

Guest Post 1 of 2 by Scott Jensen:

Having a basis in enterprise data center networking, Cloud computing I have many conversations with customers implementing a cloud infrastructure. Their design the networking infrastructure can and should be different from a classic network configuration and many do not understand why. Either due to a lack of knowledge in networking or due to a lack of understanding as to why cloud computing is different from virtualization. Once you have an understanding of both of these areas you can begin to see why emerging technologies such as SDN (Software Defined Networking) and NFV (Network Function Virtualization) begin to address some of the issues that Cloud Computing can cause with your network.

Networking is all about traffic flows. In order to properly design your infrastructure you need to understand where traffic is originating, where it is going and how much traffic will be following a specific route and at what times.

There are many differences between Cloud Computing and virtualization. In many cases people I will talk to think of Cloud as virtualization in a different environment. Of course this will work just fine however it does not take advantage of the goodness that a Cloud infrastructure can bring. Some of the major differences between Virtualization and Cloud Computing have profound effects on how the network is utilized. This all has to do with the application. That is really what it is all about anyway. Rob Hirschfeld has a great post on the difference between Pets and Cattle which describes this well.

Pets and Cattle as a workload evolution

In typical virtualized infrastructures, the applications have a fairly common pattern. Many people describe these as Pets and are managed largely the same as a physical system. They have a name, they are one of a kind, they are cared for, and when the die it can be traumatic (I know I have been there).

They run on large stateful VMs
They have a lifecycle which is typically very long such as years
The applications themselves are not designed to tolerate failures. Other technologies are brought in to ensure uptime.
The application is scaled up when demands increase. This is done by adding more memory or CPU to the VM.

Cloud applications are different. Some people describe them as cattle and they are treated like cattle in many ways. They do not necessarily have a name and if one dies it is sad but not a really big deal. We should probably figure out what killed it but life goes on.

They run on smaller stateless VMs
They have a lifecycle measured in hours or months. Sometimes even less than an hour.
The application is designed to expect failures
The application scales out by increasing the number of instances which is running when the demand increases.

In his follow-up post next week, Scott discusses how this impacts the network and how SDN and NFV promises to help.

OpenCrowbar.Anvil released – hammering out a gold standard in open bare metal provisioning

Posted on April 30, 2014 by Rob H

I’m excited to be announcing OpenCrowbar’s first release, Anvil, for the community. Looking back on our original design from June 2012, we’ve accomplished all of our original objectives and more.

Now that we’ve got the foundation ready, our next release (OpenCrowbar Broom) focuses on workload development on top of the stable Anvil base. This means that we’re ready to start working on OpenStack, Ceph and Hadoop. So far, we’ve limited engagement on workloads to ensure that those developers would not also be trying to keep up with core changes. We follow emergent design so I’m certain we’ll continue to evolve the core; however, we believe the Anvil release represents a solid foundation for workload development.

There is no more comprehensive open bare metal provisioning framework than OpenCrowbar. The project’s focus on a complete operations model that comprehends hardware and network configuration with just enough orchestration delivers on a system vision that sets it apart from any other tool. Yet, Crowbar also plays nicely with others by embracing, not replacing, DevOps tools like Chef and Puppet.

Now that the core is proven, we’re porting the Crowbar v1 RAID and BIOS configuration into OpenCrowbar. By design, we’ve kept hardware support separate from the core because we’ve learned that hardware generation cycles need to be independent from the operations control infrastructure. Decoupling them eliminates release disruptions that we experienced in Crowbar v1 and makes it much easier to use to incorporate hardware from a broad range of vendors.

Here are some key components of Anvil

UI, CLI and API stable and functional
Boot and discovery process working PLUS ability to handle pre-populating and configuration
Chef and Puppet capabilities including Birk Shelf v3 support to pull in community upstream DevOps scripts
Docker, VMs and Physical Servers
Crowbar’s famous “late-bound” approach to configuration and, critically, networking setup
IPv6 native, Ruby 2, Rails 4, preliminary scale tuning
Remarkably flexible and transparent orchestration (the Annealer)
Multi-OS Deployment capability, Ubuntu, CentOS, or Different versions of the same OS

Getting the workloads ported is still a tremendous amount of work but the rewards are tremendous. With OpenCrowbar, the community has a new way to collaborate and integration this work. It’s important to understand that while our goal is to start a quarterly release cycle for OpenCrowbar, the workload release cycles (including hardware) are NOT tied to OpenCrowbar. The workloads choose which OpenCrowbar release they target. From Crowbar v1, we’ve learned that Crowbar needed to be independent of the workload releases and so we want OpenCrowbar to focus on maintaining a strong ops platform.

This release marks four years of hard-earned Crowbar v1 deployment experience and two years of v2 design, redesign and implementation. I’ve talked with DevOps teams from all over the world and listened to their pains and needs. We have a long way to go before we’re deploying 1000 node OpenStack and Hadoop clusters, OpenCrowbar Anvil significantly moves the needle in that direction.

Thanks to the Crowbar community (Dell and SUSE especially) for nurturing the project, and congratulations to the OpenCrowbar team getting us this to this amazing place.

DevOps for Non-Profits?! The Miracle Foundation does IRL Puppies v. Cattle

Posted on April 30, 2014 by Rob H

In what’s become an annual tradition, I’m taking a post to think about the intersection of Cloud and Non-profits using my better-half’s employer, The Miracle Foundation, as my inspiration (and to help support their Mothers’ Day campaign).

Their deceptively simple sounding mission is to nurture children – they’ve just added some minor wrinkles like the children are orphans, in economically challenged areas generally tucked away in remote areas of India half way around the world from their Austin HQ. That does nothing to dampen their tenacious drive to ensure that these children have the benefits of food, health care, housing, education and, most critically, nurturing caregivers.

How does that relate to the Puppies & Cattle analogy?

Like any scalable operation, they need to create highly repeatable processes to deliver their service. The Miracle Foundation service, environments where house mothers nurture children, is by its very nature a “puppy” since each child must be treated uniquely; however, everything leading up to the point of delivery must be “cattle-like” to they can scale the care they give. For example, unique lesson plan is good while a unique chart of accounts is not.

Last year, I talked about how the Miracle Foundation was using quantitative measures to evaluate quality of care. They’ve used these metrics very effectively in their operations to identify places where they must standardize (like accounting practices, health care regimens and dietary requirements) and high touch places where they cannot (selecting and promoting homes out of incubation). Exactly like cloud deployments, success means finding places where variation creates complexity (cattle) and ones where it increases value (puppies).

I’ve been impressed to see how the Miracle Foundation identified the need for standardized house-mother training curriculum as part of this analysis. Their years of experience across a breath of orphanages has shown that giving clear guidance and setting standards for the people in direct contact with the children nets tremendous results; however, just making sure this training is delivered means building up a lot of other process and standardization.

If you think your job of building DevOps scripts and practice is hard then you need to step away from the keyboard for a while. This organization, and other non-profits like it, are taking on similar challenges with real people across distances that are more than just a few router hops from your desktop. I’m inspired by how they take on these challenges and fascinated at how much commonality there is between my work and theirs.

If you’re interested in their mission, please visit them for more details.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: Development