About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

Networking in Cloud Environments, SDN, NFV, and why it matters [part 1 of 2]

Posted on May 1, 2014 by Rob H

Scott Jensen is an Engineering Director and colleague of mine from Dell with deep networking and operations experience. He had first hand experience deploying OpenStack and Hadoop and has a critical role in defining Dell’s Reference Architectures in those areas. When I saw this writeup about cloud networking, I asked if it would be OK to share it with you.

Guest Post 1 of 2 by Scott Jensen:

Having a basis in enterprise data center networking, Cloud computing I have many conversations with customers implementing a cloud infrastructure. Their design the networking infrastructure can and should be different from a classic network configuration and many do not understand why. Either due to a lack of knowledge in networking or due to a lack of understanding as to why cloud computing is different from virtualization. Once you have an understanding of both of these areas you can begin to see why emerging technologies such as SDN (Software Defined Networking) and NFV (Network Function Virtualization) begin to address some of the issues that Cloud Computing can cause with your network.

Networking is all about traffic flows. In order to properly design your infrastructure you need to understand where traffic is originating, where it is going and how much traffic will be following a specific route and at what times.

There are many differences between Cloud Computing and virtualization. In many cases people I will talk to think of Cloud as virtualization in a different environment. Of course this will work just fine however it does not take advantage of the goodness that a Cloud infrastructure can bring. Some of the major differences between Virtualization and Cloud Computing have profound effects on how the network is utilized. This all has to do with the application. That is really what it is all about anyway. Rob Hirschfeld has a great post on the difference between Pets and Cattle which describes this well.

Pets and Cattle as a workload evolution

In typical virtualized infrastructures, the applications have a fairly common pattern. Many people describe these as Pets and are managed largely the same as a physical system. They have a name, they are one of a kind, they are cared for, and when the die it can be traumatic (I know I have been there).

They run on large stateful VMs
They have a lifecycle which is typically very long such as years
The applications themselves are not designed to tolerate failures. Other technologies are brought in to ensure uptime.
The application is scaled up when demands increase. This is done by adding more memory or CPU to the VM.

Cloud applications are different. Some people describe them as cattle and they are treated like cattle in many ways. They do not necessarily have a name and if one dies it is sad but not a really big deal. We should probably figure out what killed it but life goes on.

They run on smaller stateless VMs
They have a lifecycle measured in hours or months. Sometimes even less than an hour.
The application is designed to expect failures
The application scales out by increasing the number of instances which is running when the demand increases.

In his follow-up post next week, Scott discusses how this impacts the network and how SDN and NFV promises to help.

OpenCrowbar.Anvil released – hammering out a gold standard in open bare metal provisioning

Posted on April 30, 2014 by Rob H

I’m excited to be announcing OpenCrowbar’s first release, Anvil, for the community. Looking back on our original design from June 2012, we’ve accomplished all of our original objectives and more.

Now that we’ve got the foundation ready, our next release (OpenCrowbar Broom) focuses on workload development on top of the stable Anvil base. This means that we’re ready to start working on OpenStack, Ceph and Hadoop. So far, we’ve limited engagement on workloads to ensure that those developers would not also be trying to keep up with core changes. We follow emergent design so I’m certain we’ll continue to evolve the core; however, we believe the Anvil release represents a solid foundation for workload development.

There is no more comprehensive open bare metal provisioning framework than OpenCrowbar. The project’s focus on a complete operations model that comprehends hardware and network configuration with just enough orchestration delivers on a system vision that sets it apart from any other tool. Yet, Crowbar also plays nicely with others by embracing, not replacing, DevOps tools like Chef and Puppet.

Now that the core is proven, we’re porting the Crowbar v1 RAID and BIOS configuration into OpenCrowbar. By design, we’ve kept hardware support separate from the core because we’ve learned that hardware generation cycles need to be independent from the operations control infrastructure. Decoupling them eliminates release disruptions that we experienced in Crowbar v1 and makes it much easier to use to incorporate hardware from a broad range of vendors.

Here are some key components of Anvil

UI, CLI and API stable and functional
Boot and discovery process working PLUS ability to handle pre-populating and configuration
Chef and Puppet capabilities including Birk Shelf v3 support to pull in community upstream DevOps scripts
Docker, VMs and Physical Servers
Crowbar’s famous “late-bound” approach to configuration and, critically, networking setup
IPv6 native, Ruby 2, Rails 4, preliminary scale tuning
Remarkably flexible and transparent orchestration (the Annealer)
Multi-OS Deployment capability, Ubuntu, CentOS, or Different versions of the same OS

Getting the workloads ported is still a tremendous amount of work but the rewards are tremendous. With OpenCrowbar, the community has a new way to collaborate and integration this work. It’s important to understand that while our goal is to start a quarterly release cycle for OpenCrowbar, the workload release cycles (including hardware) are NOT tied to OpenCrowbar. The workloads choose which OpenCrowbar release they target. From Crowbar v1, we’ve learned that Crowbar needed to be independent of the workload releases and so we want OpenCrowbar to focus on maintaining a strong ops platform.

This release marks four years of hard-earned Crowbar v1 deployment experience and two years of v2 design, redesign and implementation. I’ve talked with DevOps teams from all over the world and listened to their pains and needs. We have a long way to go before we’re deploying 1000 node OpenStack and Hadoop clusters, OpenCrowbar Anvil significantly moves the needle in that direction.

Thanks to the Crowbar community (Dell and SUSE especially) for nurturing the project, and congratulations to the OpenCrowbar team getting us this to this amazing place.

DevOps for Non-Profits?! The Miracle Foundation does IRL Puppies v. Cattle

Posted on April 30, 2014 by Rob H

In what’s become an annual tradition, I’m taking a post to think about the intersection of Cloud and Non-profits using my better-half’s employer, The Miracle Foundation, as my inspiration (and to help support their Mothers’ Day campaign).

Their deceptively simple sounding mission is to nurture children – they’ve just added some minor wrinkles like the children are orphans, in economically challenged areas generally tucked away in remote areas of India half way around the world from their Austin HQ. That does nothing to dampen their tenacious drive to ensure that these children have the benefits of food, health care, housing, education and, most critically, nurturing caregivers.

How does that relate to the Puppies & Cattle analogy?

Like any scalable operation, they need to create highly repeatable processes to deliver their service. The Miracle Foundation service, environments where house mothers nurture children, is by its very nature a “puppy” since each child must be treated uniquely; however, everything leading up to the point of delivery must be “cattle-like” to they can scale the care they give. For example, unique lesson plan is good while a unique chart of accounts is not.

Last year, I talked about how the Miracle Foundation was using quantitative measures to evaluate quality of care. They’ve used these metrics very effectively in their operations to identify places where they must standardize (like accounting practices, health care regimens and dietary requirements) and high touch places where they cannot (selecting and promoting homes out of incubation). Exactly like cloud deployments, success means finding places where variation creates complexity (cattle) and ones where it increases value (puppies).

I’ve been impressed to see how the Miracle Foundation identified the need for standardized house-mother training curriculum as part of this analysis. Their years of experience across a breath of orphanages has shown that giving clear guidance and setting standards for the people in direct contact with the children nets tremendous results; however, just making sure this training is delivered means building up a lot of other process and standardization.

If you think your job of building DevOps scripts and practice is hard then you need to step away from the keyboard for a while. This organization, and other non-profits like it, are taking on similar challenges with real people across distances that are more than just a few router hops from your desktop. I’m inspired by how they take on these challenges and fascinated at how much commonality there is between my work and theirs.

If you’re interested in their mission, please visit them for more details.

Reference Deployments are Critical [2/4 series on Operating Open Source Infrastructure]

Posted on April 29, 2014 by Rob H

This post is the second in a 4 part series about Success factors for Operating Open Source Infrastructure.

When we look at reference deployments, there are several things that make a good referenced deployment; and ones that are useful by the community.

First, a referenced deployment needs to be specific and useful. They have to be identified as solving a specific problem using the software. And they have to have a specific configuration that can be described in a way that creates a workable scenario for that. There may be multiple useful reference implementations. And in that case, each one needs to be identified as the – by the expected behavior. For example, our deployments include a compute centric configuration that has hardware configurations and network configurations adapted to compute focused applications.

They also have storage focused applications that are specifically targeted at enabling cheap and deep storage nodes for that type of situation. Both configurations are important and valid but they require different implementations, different details and different reference architectures. As long as it is clear that there are multiple patterns, the community is perfectly able to absorb and use these patterns.

Establishment of a widely adopted best practice is a central success criteria for any project.

Best practices ensure that deployers of the technology cannot only purchase implementations that will be successful, but they can also compare notes to work with their community. A significant adoption curve happens after the establishment of these best practices because at that point, the risk of purchase dramatically drops, and the ability to support radically increases. The next thing that’s important in the establishment of these technologies is that that reference implementation or the reference architecture has a way to be configured in a repeatable way.

Very often, this takes the form of deployment books from manuals. While useful in small deployments, in a hyperscale deployment the books really have diminishing value. This is because the level of human error – the chance of making a fundamental mistake during configuration – increases exponentially with the number of nodes, because each node is tightly interconnected with other nodes within the system.

My team at Dell launched the Crowbar project as a way to reduce or mitigate this effort substantially. We recognized that the number one cause of delays and impacts in time to value in a hyperscale deployment is configuration and set-up. Any simple mistake made during configuration, even down to ordering of the gear, or physical defects within the infrastructure, will create dramatic delays in troubleshooting and diagnosing those issues. By automating the process, we have ensured that we can bootstrap the system quickly.

The goal of automated best practice is to bootstrap in a conforming and repeatable way. This enables the community to work together immediately towards return on investment, and greatly reduces the risk of problems caused by human error. For example, it’s typical within a site for us to find that network configurations do not match the specifications. In many cases, we find issues with the core networking infrastructure not matching the way it was originally designed. We also find failures on physical infrastructure, disk failures, system mismatches,and unanticipated configuration. Any one of these problems with a human setup might be missed or overlooked.

Validated reference architectures, while valuable, are no longer sufficient. Automated reference configurations have become the key to successfully delivered solutions.

Interested in more? Read part 3

Success Factors of Operating Open Source Infrastructure [Series Intro]

Posted on April 29, 2014 by Rob H

Building a best practices platform is essential to helping companies share operations knowledge. In the fast-moving world of open source software, sharing documentation about what to do is not sufficient. We must share the how to do it also because the operations process is tightly coupled to achieving ongoing success.

Further, since change is constant, we need to change our definition of “stability” to reflect a much more iterative and fluid environment.

Baseline testing is an essential part of this platform. It enables customers to ensure not only fast time to value, but the tests are consistently conforming with industry best practices, even as the system is upgraded and migrates towards a continuous deployment infrastructure.

The details are too long for a single post so I’m going to explore this as three distinct topics over the next two weeks.

Reference Deployments talks about needed an automated way to repeat configuration between sites.
Ops Validation using Development Tests talks about having a way to verify that everyone uses a common reference platform
Shared Open Operatons / DevOps (pending) talks about putting reference deployment and common validation together to create a true open operations practice.

OpenStack, Hadoop, Ceph, Docker and other open source projects are changing the landscape for information technology. Customers seeking to become successful with these evolving platforms must look beyond the software bits, and consider both the culture and operations. The culture is critical because interacting with the open source projects community (directly or through a proxy) can help ensure success using the software. Operations are critical because open source projects expect the community to help find and resolve issues. This results in more robust and capable products. Consequently, users of open source software must operate in a more fluid environment.

My team at Dell saw this need as we navigated the early days of OpenStack. The Crowbar project started because we saw that the community needed a platform that could adapt and evolve with the open source projects that our advanced customers were implementing. Our ability to deliver an open operations platform enables the community to collaborate, and to skip over routine details to refocus on shared best practices.

My recent focus on the OpenStack DefCore work reinforces these original goals. Using tests to help provide a common baseline is a concrete, open and referenceable way to promote interoperability. I hope that this in turn drives a dialog around best practices and shared operations because those help mature the community.

Why I’m learning open source best practice from Middle School Students

Posted on April 28, 2014 by Rob H

Engineering in open source projects is a different skillset than most of us have ever been trained for; happily, there is a rising cohort of engineers and scientists who have been learning to work in exactly the ways that industry is now demanding. Here’s the background…

I’ve been helping mentor two FIRST Robotics teams (FLL & FTC) this season and had the privilege to accompany the FLL team (which includes my daughter) to the FIRST World Festival where a global mix of students from 6 to 18 competed, collaborated and celebrated for a wide range of awards and recognition. The experience is humbling – these students are taking on challenges (for fun) that would scare off most adults.

While I could go on and on about my experience and the FIRST mission, I’d rather share some of what my 12 year old daughter wrote to her coach after the competition:

Thank you Coach for all of the lessons and advice you have shared with me this season. I really appreciate all of the time and effort you have put into making this team the best we could be. You have taught us so much and we will definitely walk away from this season with the new skills and experiences. You were an amazing coach and not only did you help and support us, you also gave us the freedom to be independent so we can learn skills like leadership, time management and meeting with busy schedules. I loved being on this team and I hope this will not be the last of the Hedgehogs.

FIRST designs the program so that these experiences are the norm, not the exception.

Here are some of the critical open source engineering skills I observed first hand at all levels of the competition.

Collaboration: at all levels, participants are strongly rewarded for collaborating, mentoring and working together. Team simply cannot advance without mastering this skill.
Consensus: judges actively test and watch for consensus behavior. This is actively coached and encouraged because the teams quickly learn to appreciate a diversity of strengths.
Risk Taking with Delivery: the very nature of competition encourages teams to think big and balance risk with delivery.
Celebration: this has to be experienced but the competitions are often compared to rock concerts. Everyone is involved and every aspect is celebrated. FIRST is a culture.
Situational Judgment: this competition is fast and intense so participants learn to think on their feet. This type of experience is amazingly valuable and hard to get in class room settings.

In my experience, everyone in open source needs more practice and experience DOING open source work. I suggest getting involved in these programs as a mentor, judge or volunteer because it’s the most effective hands-on open source training I can imagine.

DevOps Concept: “Ready State” Infrastructure as hand-off milestone

Posted on April 25, 2014 by Rob H

Working for Dell, it’s no surprise that I have a lot of discussions around building up and maintaining the physical infrastructure to run a data centers at scale. Generally the context is around OpenCrowbar, Hadoop or OpenStack Ironic/TripleO/Heat but the concerns are really universal in my cloud operations experience.

Typically, deployments have three distinct phases: 1) mechanically plug together the systems, 2) get the systems ready to the OS and network level andthen 3) install the application. Often these phases are so distinct that they are handled by completely different teams!

That’s a problem because errors or unexpected changes from one phase are very expensive to address once you change teams. The solution has been to become more and more prescriptive about what the system looks like between the second (“ready”) and third (“installed”) phase. I’ve taken to calling this hand-off a achieving a ready state infrastructure.

I define a “ready state” infrastructure as having been configured so that the application lay down steps are simple and predictable.

In my experience, most application deployment guides start with a ready state assumption. They read like “Step 0: rack, configure, provision and tweak the nodes and network to have this specific starting configuration.” If you are really lucky then “specific configuration” is actually a documented and validated reference architecture.

The magic of cloud IaaS is that it always creates ready state infrastructure. If I request 10 servers with 2 NICs running Ubuntu 14.04 then that’s exactly what I get. The fact that cloud always provisions a ready state infrastructure has become an essential operating assumption for cloud orchestration and configuration management.

Unfortunately, hardware provisioning is messy. It takes significant effort to configure a physical system into a ready state. This is caused by a number of factors

You can’t alter physical infrastructure with programming (an API) – for example, if the server enumerates the NICs differently than you expected, you have to adapt to that.
You have to respect the physical topology of the system – for example, production deployments used teamed NICs that have to be use different switches for redundancy. You can’t make assumptions, you have to setup the team based on the specific configuration.
You have to build up the configuration in sequence – for example, you can’t setup the RAID configuration after the operating system is installed. If you made a bad choice then you’ll likely have to repeat the whole sequence of the deployment and some bad choices (like using the wrong subnets) result in a total system rebuild.
Hardware fails and is non-uniform – for example, in any order of sufficient size you will have NIC failures due to everything from simple mechanical card seating issues to BIOS interface mismatches. Troubleshooting these issues can occupy significant time.
Component configurations are interlocked – for example, a change to the switch settings could result in DHCP failures when systems are rebooted (real experience). You cannot always work node-to-node, you must deal with the infrastructure as an integrated system.

Being consistent at turning discovered state into ready state is a complex and unique problem space. As I explore this bare metal provisioning space in the community, I am more and more convinced that it has a distinct architecture from applications built for ready state operations.

My hope in this post is test if the concept of “ready state” infrastructure is helpful in describing the transition point between provisioning and installation. Please let me know what you think!

OpenStack automated high-availability deploy reality, SUSE shows off chops with Crowbar

Posted on April 23, 2014 by Rob H

While I’ve been focused on delivering next-generation kick-aaS-i-ness with Crowbar v2 (now called OpenCrowbar) and helping the Dell and Red Hat co-engineer a OpenStack Powered Cloud, SUSE has been continuing to expand and polish the OpenStack deployment on Crowbar v1. I’m always impressed by commit activity (SUSE is the top committer in the Crowbar project) and was excited to see their Havana launch announcement.

Using Crowbar v1, SUSE is delivering a seriously robust automated OpenStack Havana implementation. They have taken the time to build high availability (HA) across the framework including for Neutron, Heat and Ceilometer.

As an OpenStack Foundation board member, I hear a lot of hand-wringing in the community about ops practices and asking “is OpenStack is ready for the enterprise?” While I’m not sure how to really define “enterprise,” I do know that SUSE Cloud Havana release version also) shows that it’s possible to deliver a repeatable and robust OpenStack deployment.

This effort shows some serious DevOps automation chops and, since Crowbar is open, everyone in the community can benefit from their tuning. Of course, I’d love to see these great capabilities migrate into the very active StackForge Chef OpenStack cookbooks that OpenCrowbar is designed to leverage.

Creating HA automation is a great achievement and an important milestone in capturing the true golden fleece – automated release-to-release upgrades. We built the OpenCrowbar annealer with this objective in mind and I feel like it’s within reach.

Can’t Contain(erize) the Hype – is Docker real or a bubble?

Posted on April 18, 2014 by Rob H

Editorial Note: This was written in April 2014. Check out how we are using Docker in our latest architectures.

The new application portability darling, Docker, was so popular at this week’s Red Hat Summit that I was expecting Miley Cyrus’ flock of paparazzi to abandon in her favor of Ben Golub.

Personally, I find Docker to be a useful tool and we’ve been embedding it into our dev and test processes in useful ways for DefCore TCUP (at Conference), OpenCrowbar Admin and Dev Nodes. To me, these are concrete and clear use cases.

There are clearly a lot more great use-cases for Docker, but I can’t help but feel like it’s being thrown into architectural layer cakes and markitectures as a substitute for the non-words “cloud”, “amazing” and “revolutionary.”

How do I distinguish hot from hype? I look for places where Docker is solving just one problem set instead being a magic wand solution to a raft of systemic issues.

Places where I think Docker is potent and disruptive

Creating a portable and consistent environment for dev, test and delivery
Helping Linux distros keep updating the kernel without breaking user space (RHEL 7 anyone?)
Reducing the virtualization overhead of tenant isolation (containers are lighter)
Reducing the virtualization overhead for DevOps developers testing multi-node deployments

But I’m concerned that we’re expecting too many silver bullets

Packaging is still tricky: Creating a locked box helps solve part of downstream problem (you know what you have) but not the upstream problem (you don’t know what you depend on).
Container sprawl: Breaking deployments into more functional discrete parts is smart, but that means we have MORE PARTS to manage. There’s an inflection point between separation of concerns and sprawl.
PaaS Adoption: Docker helps with PaaS but it does not solve neither the “you have to model your apps for a PaaS” nor the “PaaS needs scalable data services” problems

Speaking of Miley Cyrus, it’s not the container that matters, but what’s on the inside. Docker can take a lesson from Miley: attention is great but you’ve still got to be able to sing. I’m not sure about Miley, but I am digging the tracks that Docker is laying down. Docker is worth putting on your play list.

Rocking Docker – OpenCrowbar builds solid foundation & life-cycle [VIDEOS]

Posted on April 14, 2014 by Rob H

Docker has been gathering a substantial about of interest as an additional way to solve application portability and dependency hell. We’ve been enthusiastic participants in this fledgling community (Docker in OpenStack) and my work in DefCore’s Tempest in a Container (TCUP).

In OpenCrowbar, we’ve embedded Docker much deeper to solve a few difficult & critical problems: speeding up developing multi-node deployments and building the environment for the containers. Check out my OpenCrowbar does Docker video or the community demo!

Bootstrapping Docker into a DevOps management framework turns out to be non-trivial because integrating new nodes into a functioning operating environment is very different on Docker than using physical servers or a VMs. Containers don’t PXE boot and have more limited configuration options.

How did we do this? Unlike other bare metal provisioning frameworks, we made sure that Crowbar did not require DHCP+PXE as the only node discovery process. While we default to and fully support PXE with our sledgehammer discovery image, we also allow operators to pre-populate the Crowbar database using our API and make configuration adjustments before the node is discovered/created.

We even went a step farther and enabled the Crowbar dependency graph to take alternate routes (we call it the “provides” role). This enhancement is essential for dealing with “alike but different” infrastructure like Docker.

The result is that you can request Docker nodes in OpenCrowbar (using the API only for now) and it will automatically create the containers and attach them into Crowbar management. It’s important to stress that we are not adding existing containers to Crowbar by adding an agent; instead, Crowbar manages the container’s life-cycle and then then work inside the container.

Getting around the PXE cycle using containers as part of Crowbar substantially improves Ops development cycle time because we don’t have to wait for boot > discovery > reboot > install to create a clean environment. Bringing fresh Docker containers into a dev system takes seconds instead,

The next step is equally powerful: Crowbar should be able to configure the Docker host environment on host nodes (not just the Admin node as we are now demonstrating). Setting up the host can be very complex: you need to have the correct RAID, BIOS, Operating System and multi-NIC networking configuration. All of these factors must be done with a system perspective that match your Ops environment. Luckily, this is exactly Crowbar’s sweet spot!

Until we’ve got that pulled together, OpenCrowbar’s ability to use upstream cookbooks and this latest Dev/Test focused step provides remarkable out of the gate advantages for everyone build multi-node DevOps tools.

Enjoy!

PS: It’s worth noting that we’ve already been using Docker to run & develop the Crowbar Admin server. This extra steps makes Crowbar even more Dockeriffic.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Author Archives: Rob H