DevOps for Non-Profits?! The Miracle Foundation does IRL Puppies v. Cattle

In what’s become an annual tradition, I’m taking a post to think about the intersection of Cloud and Non-profits using my better-half’s employer, The Miracle Foundation, as my inspiration (and to help support their Mothers’ Day campaign).

TMF girl with puppyTheir deceptively simple sounding mission is to nurture children – they’ve just added some minor wrinkles like the children are orphans, in economically challenged areas generally tucked away in remote areas of India half way around the world from their Austin HQ.  That does nothing to dampen their tenacious drive to ensure that these children have the benefits of food, health care, housing, education and, most critically, nurturing caregivers.

How does that relate to the Puppies & Cattle analogy?

Like any scalable operation, they need to create highly repeatable processes to deliver their service.    The Miracle Foundation service, environments where house mothers nurture children, is by its very nature a “puppy” since each child must be treated uniquely; however, everything leading up to the point of delivery must be “cattle-like” to they can scale the care they give.  For example, unique lesson plan is good while a unique chart of accounts is not.

Last year, I talked about how the Miracle Foundation was using quantitative measures to evaluate quality of care.  They’ve used these metrics very effectively in their operations to identify places where they must standardize (like accounting practices, health care regimens and dietary requirements) and high touch places where they cannot (selecting and promoting homes out of incubation).  Exactly like cloud deployments, success means finding places where variation creates complexity (cattle) and ones where it increases value (puppies).

I’ve been impressed to see how the Miracle Foundation identified the need for standardized house-mother training curriculum as part of this analysis.  Their years of experience across a breath of orphanages has shown that giving clear guidance and setting standards for the people in direct contact with the children nets tremendous results; however, just making sure this training is delivered means building up a lot of other process and standardization.

If you think your job of building DevOps scripts and practice is hard then you need to step away from the keyboard for a while.  This organization, and other non-profits like it, are taking on similar challenges with real people across distances that are more than just a few router hops from your desktop.  I’m inspired by how they take on these challenges and fascinated at how much commonality there is between my work and theirs.

If you’re interested in their mission, please visit them for more details.

DevOps Concept: “Ready State” Infrastructure as hand-off milestone

Working for Dell, it’s no surprise that I have a lot of discussions around building up and maintaining the physical infrastructure to run a data centers at scale.  Generally the context is around OpenCrowbar, Hadoop or OpenStack Ironic/TripleO/Heat but the concerns are really universal in my cloud operations experience.

Three Teams

Typically, deployments have three distinct phases: 1) mechanically plug together the systems, 2) get the systems ready to the OS and network level andthen 3) install the application.  Often these phases are so distinct that they are handled by completely different teams!

That’s a problem because errors or unexpected changes from one phase are very expensive to address once you change teams.  The solution has been to become more and more prescriptive about what the system looks like between the second (“ready”) and third (“installed”) phase.  I’ve taken to calling this hand-off a achieving a ready state infrastructure.

I define a “ready state” infrastructure as having been configured so that the application lay down steps are simple and predictable.

In my experience, most application deployment guides start with a ready state assumption.  They read like “Step 0: rack, configure, provision and tweak the nodes and network to have this specific starting configuration.”   If you are really lucky then “specific configuration” is actually a documented and validated reference architecture.

The magic of cloud IaaS is that it always creates ready state infrastructure.  If I request 10 servers with 2 NICs running Ubuntu 14.04 then that’s exactly what I get.  The fact that cloud always provisions a ready state infrastructure has become an essential operating assumption for cloud orchestration and configuration management.

Unfortunately, hardware provisioning is messy.  It takes significant effort to configure a physical system into a ready state.  This is caused by a number of factors

  1. You can’t alter physical infrastructure with programming (an API) – for example, if the server enumerates the NICs differently than you expected, you have to adapt to that.
  2. You have to respect the physical topology of the system – for example, production deployments used teamed NICs that have to be use different switches for redundancy.  You can’t make assumptions, you have to setup the team based on the specific configuration.
  3. You have to build up the configuration in sequence – for example, you can’t setup the RAID configuration after the operating system is installed.  If you made a bad choice then you’ll likely have to repeat the whole sequence of the deployment and some bad choices (like using the wrong subnets) result in a total system rebuild.
  4. Hardware fails and is non-uniform – for example, in any order of sufficient size you will have NIC failures due to everything from simple mechanical card seating issues to BIOS interface mismatches.  Troubleshooting these issues can occupy significant time.
  5. Component configurations are interlocked – for example, a change to the switch settings could result in DHCP failures when systems are rebooted (real experience).  You cannot always work node-to-node, you must deal with the infrastructure as an integrated system.

Being consistent at turning discovered state into ready state is a complex and unique problem space.  As I explore this bare metal provisioning space in the community, I am more and more convinced that it has a distinct architecture from applications built for ready state operations.

My hope in this post is test if the concept of “ready state” infrastructure is helpful in describing the transition point between provisioning and installation.  Please let me know what you think!

Can’t Contain(erize) the Hype – is Docker real or a bubble?

Editorial Note: This was written in April 2014.  Check out how we are using Docker in our latest architectures.

The new application portability darling, Docker, was so popular at this week’s Red Hat Summit that I was expecting Miley Cyrus’ flock of paparazzi to abandon in her favor of Ben Golub.

Personally, I find Docker to be a useful tool and we’ve been embedding it into our dev and test processes in useful ways for DefCore TCUP (at Conference), OpenCrowbar Admin and Dev Nodes.  To me, these are concrete and clear use cases.

There are clearly a lot more great use-cases for Docker, but I can’t help but feel like it’s being thrown into architectural layer cakes and markitectures as a substitute for the non-words “cloud”, “amazing” and “revolutionary.”

How do I distinguish hot from hype?  I look for places where Docker is solving just one problem set instead being a magic wand solution to a raft of systemic issues.

Places where I think Docker is potent and disruptive

  • Creating a portable and consistent environment for dev, test and delivery
  • Helping Linux distros keep updating the kernel without breaking user space (RHEL 7 anyone?)
  • Reducing the virtualization overhead of tenant isolation (containers are lighter)
  • Reducing the virtualization overhead for DevOps developers testing multi-node deployments

But I’m concerned that we’re expecting too many silver bullets

  • Packaging is still tricky:  Creating a locked box helps solve part of downstream problem (you know what you have) but not the upstream problem (you don’t know what you depend on).
  • Container sprawl: Breaking deployments into more functional discrete parts is smart, but that means we have MORE PARTS to manage.   There’s an inflection point between separation of concerns and sprawl.
  • PaaS Adoption: Docker helps with PaaS but it does not solve neither the “you have to model your apps for a PaaS” nor the “PaaS needs scalable data services” problems

Speaking of Miley Cyrus, it’s not the container that matters, but what’s on the inside.  Docker can take a lesson from Miley: attention is great but you’ve still got to be able to sing.    I’m not sure about Miley, but I am digging the tracks that Docker is laying down.  Docker is worth putting on your play list.

Rocking Docker – OpenCrowbar builds solid foundation & life-cycle [VIDEOS]

Docker has been gathering a substantial about of interest as an additional way to solve application portability and dependency hell.  We’ve been enthusiastic participants in this fledgling community (Docker in OpenStack) and my work in DefCore’s Tempest in a Container (TCUP).

flying?  not flying!In OpenCrowbar, we’ve embedded Docker much deeper to solve a few difficult & critical problems: speeding up developing multi-node deployments and building the environment for the containers.  Check out my OpenCrowbar does Docker video or the community demo!

Bootstrapping Docker into a DevOps management framework turns out to be non-trivial because integrating new nodes into a functioning operating environment is very different on Docker than using physical servers or a VMs.  Containers don’t PXE boot and have more limited configuration options.

How did we do this?  Unlike other bare metal provisioning frameworks, we made sure that Crowbar did not require DHCP+PXE as the only node discovery process.  While we default to and fully support PXE with our sledgehammer discovery image, we also allow operators to pre-populate the Crowbar database using our API and make configuration adjustments before the node is discovered/created.

We even went a step farther and enabled the Crowbar dependency graph to take alternate routes (we call it the “provides” role).  This enhancement is essential for dealing with “alike but different” infrastructure like Docker.

The result is that you can request Docker nodes in OpenCrowbar (using the API only for now) and it will automatically create the containers and attach them into Crowbar management.  It’s important to stress that we are not adding existing containers to Crowbar by adding an agent; instead, Crowbar manages the container’s life-cycle and then then work inside the container.

Getting around the PXE cycle using containers as part of Crowbar substantially improves Ops development cycle time because we don’t have to wait for boot > discovery > reboot > install to create a clean environment.  Bringing fresh Docker containers into a dev system takes seconds instead,

The next step is equally powerful: Crowbar should be able to configure the Docker host environment on host nodes (not just the Admin node as we are now demonstrating).  Setting up the host can be very complex: you need to have the correct RAID, BIOS, Operating System and multi-NIC networking configuration.  All of these factors must be done with a system perspective that match your Ops environment.  Luckily, this is exactly Crowbar’s sweet spot!

Until we’ve got that pulled together, OpenCrowbar’s ability to use upstream cookbooks and this latest Dev/Test focused step provides remarkable out of the gate advantages for everyone build multi-node DevOps tools.

Enjoy!

PS: It’s worth noting that we’ve already been using Docker to run & develop the Crowbar Admin server.  This extra steps makes Crowbar even more Dockeriffic.

Mayflies and Dinosaurs (extending Puppies and Cattle)

Dont Be FragileJosh McKenty and I were discussing the common misconception of the “Puppies and Cattle” analogy. His position is not anti-puppy! He believes puppies are sometimes unavoidable and should be isolated into portable containers (VMs) so they can be shuffled around seamlessly. His more provocative point is that we want our underlying infrastructure to be cattle so it remains highly elastic and flexible. More cattle means a more resilient system. To me, this is a fundamental CloudOps design objective.

We realized that the perfect cloud infrastructure would structurally discourage the creation of puppies.

Imagine a cloud in which servers were automatically decommissioned after a week of use. In a sort of anti-SLA, any VM running for more than 168 hours would be (gracefully) terminated. This would force a constant churn of resources within the infrastructure that enables true cattle-like management. This cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.

We called these servers mayflies due to their limited life span.

While this approach requires a high degree of automation, the most successful cloud operators I have met are effectively building workloads with this requirement. If we require application workloads to be elastic and fault-resilient then we have a much higher degree of flexibility with the underlying infrastructure. I’ve seen this in practice with several OpenStack clouds: operators with helped applications deploy using automation were able to decommission “old” clouds much more gracefully. They effectively turned their entire cloud into a cow. Sadly, the ones without that investment puppified™ the ops infrastructure and created a much more brittle environment.

The opposite of a mayfly is the dinosaur: a server that is so brittle and locked that the slightest disturbance wipes out everything it touches.

Dinosaurs are puppies grown into a T-Rex with rows of massive razor sharp teeth and tiny manicured hands. These are systems that are so unique and historical that there’s no way to recreate them if there’s a failure. The original maintainers exit happy hour was celebrated by people who were laid-off two CEOs ago. The impact of dinosaurs goes beyond their operational risk; they are typically impossible to extend or maintain and, consequently, ossify other server around them. This type of server drains elasticity from your ops team.

Puppies do not grow up to become dogs, they become dinosaurs.

It’s a classic lean adage to do hard things more frequently. Perhaps it’s time to start creating mayflies in your ops infrastructure.

OpenCrowbar reaches critical milestone – boot, discover and forge on!

OpenCrowbarWe started the Crowbar project because we needed to make OpenStack deployments to be fast, repeatable and sharable.  We wanted a tool that looked at deployments as a system and integrated with our customers’ operations environment.  Crowbar was born as an MVP and quickly grew into a more dynamic tool that could deploy OpenStack, Hadoop, Ceph and other applications, but most critically we recognized that our knowledge gaps where substantial and we wanted to collaborate with others on the learning.  The result of that learning was a rearchitecture effort that we started at OSCON in 2012.

After nearly two years, I’m proud to show off the framework that we’ve built: OpenCrowbar addresses the limitations of Crowbar 1.x and adds critical new capabilities.

So what’s in OpenCrowbar?  Pretty much what we targeted at the launch and we’ve added some wonderful surprises too:

  • Heterogeneous Operating Systems – chose which operating system you want to install on the target servers.
  • CMDB Flexibility – don’t be locked in to a devops toolset.  Attribute injection allows clean abstraction boundaries so you can use multiple tools (Chef and Puppet, playing together).
  • Ops Annealer –the orchestration at Crowbar’s heart combines the best of directed graphs with late binding and parallel execution.  We believe annealing is the key ingredient for repeatable and OpenOps shared code upgrades
  • Upstream Friendly – infrastructure as code works best as a community practice and Crowbar use upstream code without injecting “crowbarisms” that were previously required.  So you can share your learning with the broader DevOps community even if they don’t use Crowbar.
  • Node Discovery (or not) – Crowbar maintains the same proven discovery image based approach that we used before, but we’ve streamlined and expanded it.  You can use Crowbar’s API outside of the PXE discovery system to accommodate Docker containers, existing systems and VMs.
  • Hardware Configuration – Crowbar maintains the same optional hardware neutral approach to RAID and BIOS configuration.  Configuring hardware with repeatability is difficult and requires much iterative testing.  While our approach is open and generic, my team at Dell works hard to validate a on specific set of gear: it’s impossible to make statements beyond that test matrix.
  • Network Abstraction – Crowbar dramatically extended our DevOps network abstraction.  We’ve learned that a networking is the key to success for deployment and upgrade so we’ve made Crowbar networking flexible and concise.  Crowbar networking works with attribute injection so that you can avoid hardwiring networking into DevOps scripts.
  • Out of band control – when the Annealer hands off work, Crowbar gives the worker implementation flexibility to do it on the node (using SSH) or remotely (using an API).  Making agents optional means allows operators and developers make the best choices for the actions that they need to take.
  • Technical Debt Paydown – We’ve also updated the Crowbar infrastructure to use the latest libraries like Ruby 2, Rails 4, Chef 11.  Even more importantly, we’re dramatically simplified the code structure including in repo documentation and a Docker based developer environment that makes building a working Crowbar environment fast and repeatable.

Why change to OpenCrowbar?  This new generation of Crowbar is structurally different from Crowbar 1 and we’ve investing substantially in refactoring the tooling, paying down technical debt and cleanup up documentation.  Since Crowbar 1 is still being actively developed, splitting the repositories allow both versions to progress with less confusion.  The majority of the principles and deployment code is very similar, I think of Crowbar as a single community.

Interested?  Our new Docker Admin node is quick to setup and can boot and manage both virtual and physical nodes.

5 differences between Cloud ops and Bare Metal ops

OpenStack SummitCloud APIs are about abstracting operations to simplify deployment.  We want users of our cloud infrastructure to operate with blissful unawareness of the underlying networking topology, storage configuration and physical infrastructure.  For their perspective, the cloud is perfectly elastic, totally configurable and wonderfully consistent. Cloud Admins on the other hand need visibility and controls that expose the complexity while keeping it rational. These are profoundly different concerns.

Maintaining the illusion of clean and simple Cloud ops infrastructure is very valuable; however, it’s just an illusion.  The black metal box behind those APIs is complex, messy, unpredictable and dynamic.

1. Metal Ops has to deal with network topology and details like if an operating system enumerates the NICs correctly, bonding the correct NIC pair and which 10g network to use for the storage traffic. In networking, the topology determines how much traffic you can subscribe to a link and how to provide resliency. Networking does not exist in isolation: you must consider the boundary firewalls and routers to either block or allow traffic because without connectivity the cloud is useless. Details like the access and registration in DNS, NTP and DHCP provide foundations our stable operations. These details are (and should be) hidden from the cloud user.

2. Metal Ops has to deal with firmware issues at every level.  It matters to the server if it boots into BIOS or UEFI mode.  We have to manage the fact that RAID partitions need to be optimized based on the workload and type of drive.  We have to consider if there are specialized drivers and caches to manage and security features (like Intel TXT) to activate.  These details are (and should be) hidden from the cloud user.

3. Metal Ops have to consider the security of their infrastructure.  We have to manage where the admin control network crosses security domains.  It matters which layer 2 networks have access to which parts of the infrastructure.  Separation of responsibility for network vs. storage vs. compute is a reality that it not going away. These details are (and should be) hidden from the cloud user.

4. Metal Ops have to manage operating system compatibility.  I know personally that vendors test and certify their operating systems on an enormous matrix of silicon.  I also have learned that the matrix of possible combinations is far larger and fundamentally impossible at the edges.  There’s a reason that operators seek homogeneous environments and LTS releases. These details are (and should be) hidden from the cloud user.

5. Metal Ops have to deal with hardware failures. By simple statistics, the larger the system the more things will break and metal ops have to cope with this reality. We have to expose failure zones and boundaries to make intelligent responses (like moving data from a failed drive to a non-adjacent one) that require intimate knowledge of system topography that are intentionally hidden in cloud ops. Further, we have to have monitoring and management tooling that knows how to identify which NIC in a bond failed or flash the lights on the failed drive of an array. These details are (and should be) hidden from the cloud user.

Cloud’s power is being able to abstract away this complexity.  Dealing with it gracefully behind the scenes requires transparency and details that make Metal Ops job fundamentally different.

While both can be highly automated and pass my “Cloud is Infrastructure with an API” test, their objectives are different.

Crowbar lays it all out: RAID & BIOS configs officially open sourced

MediaToday, Dell (my employer) announced a plethora of updates to our open source derived solutions (OpenStack and Hadoop). These solutions include the latest bits (Grizzly and Cloudera) for each project. And there’s another important notice for people tracking the Crowbar project: we’ve opened the remainder of its provisioning capability.

Yes, you can now build the open version of Crowbar and it has the code to configure a bare metal server.

Let me be very specific about this… my team at Dell tests Crowbar on a limited set of hardware configurations. Specifically, Dell server versions R720 + R720XD (using WSMAN and iIDRAC) and C6220 + C8000 (using open tools). Even on those servers, we have a limited RAID and NIC matrix; consequently, we are not positioned to duplicate other field configurations in our lab. So, while we’re excited to work with the community, caveat emptor open source.

Another thing about RAID and BIOS is that it’s REALLY HARD to get right. I know this because our team spends a lot of time testing and tweaking these, now open, parts of Crowbar. I’ve learned that doing hard things creates value; however, it’s also means that contributors to these barclamps need to be prepared to get some silicon under their fingernails.

I’m proud that we’ve reached this critical milestone and I hope that it encourages you to play along.

PS: It’s worth noting is that community activity on Crowbar has really increased. I’m excited to see all the excitement.

7 takeaways from DevOps Days Austin

Block Tables

I spent Tuesday and Wednesday at DevOpsDays Austin and continue to be impressed with the enthusiasm and collaborative nature of the DOD events.  We also managed to have a very robust and engaged twitter backchannel thanks to an impressive pace set by Gene Kim!

I’ve still got a 5+ post backlog from the OpenStack summit, but wanted to do a quick post while it’s top of mind.

My takeaways from DevOpsDays Austin:

  1. DevOpsDays spends a lot of time talking about culture.  I’m a huge believer on the importance of culture as the foundation for the type of fundamental changes that we’re making in the IT industry; however, it’s also a sign that we’re still in the minority if we have to talk about culture evangelism.
  2. Process and DevOps are tightly coupled.  It’s very clear that Lean/Agile/Kanban are essential for DevOps success (nice job by Dominica DeGrandis).  No one even suggested DevOps+Waterfall as a joke (but Patrick Debois had a picture of a xeroxed butt in his preso which is pretty close).
  3. Still need more Devs people to show up!  My feeling is that we’ve got a lot of operators who are engaging with developers and fewer developers who are engaging with operators (the “opsdev” people).
  4. Chef Omnibus installer is very compelling.  This approach addresses issues with packaging that were created because we did not have configuration management.  Now that we have good tooling we separate the concerns between bits, configuration, services and dependencies.  This is one thing to watch and something I expect to see in Crowbar.
  5. The old mantra still holds: If something is hard, do it more often.
  6. Eli Goldratt’s The Goal is alive again thanks to Gene Kims’s smart new novel, The Phoenix project, about DevOps and IT  (I highly recommend both, start with Kim).
  7. Not DevOps, but 3D printing is awesome.  This is clearly a game changing technology; however, it takes some effort to get right.  Dell brought a Solidoodle 3D printer to the event to try and print OpenStack & Crowbar logos (watch for this in the future).

I’d be interested in hearing what other people found interesting!  Please comment here and let me know.

OpenStack steps toward Interopability with Temptest, RAs & RefStack.org

Pipes are interoperableI’m a cautious supporter of OpenStack leading with implementation (over API specification); however, it clearly has risks. OpenStack has the benefit of many live sites operating at significant scale. The short term cost is that those sites were not fully interoperable (progress is being made!). Even if they were, we are lack the means to validate that they are.

The interoperability challenge was a major theme of the Havana Summit in Portland last week (panel I moderated) .  Solving it creates significant benefits for the OpenStack community.  These benefits have significant financial opportunities for the OpenStack ecosystem.

This is a journey that we are on together – it’s not a deliverable from a single company or a release that we will complete and move on.

There were several themes that Monty and I presented during Heat for Reference Architectures (slides).  It’s pretty obvious that interop is valuable (I discuss why you should care in this earlier post) and running a cloud means dealing with hardware, software and ops in equal measures.  We also identified lots of important items like Open OperationsUpstreamingReference Architecture/Implementation and Testing.

During the session, I think we did a good job stating how we can use Heat for an RA to make incremental steps.   and I had a session about upgrade (slides).

Even with all this progress, Testing for interoperability was one of the largest gaps.

The challenge is not if we should test, but how to create a set of tests that everyone will accept as adequate.  Approach that goal with standardization or specification objective is likely an impossible challenge.

Joshua McKenty & Monty Taylor found a starting point for interoperability FITS testing: “let’s use the Tempest tests we’ve got.”

We should question the assumption that faithful implementation test specifications (FITS) for interoperability are only useful with a matching specification and significant API coverage.  Any level of coverage provides useful information and, more importantly, visibility accelerates contributions to the test base.

I can speak from experience that this approach has merit.  The Crowbar team at Dell has been including OpenStack Tempest as part of our reference deployment since Essex and it runs as part of our automated test infrastructure against every build.  This process does not catch every issue, but passing Tempest is a very good indication that you’ve got the a workable OpenStack deployment.