Physical Ops = Plumbers of the Internet. Celebrating dirty IT jobs 8 bit style

I must be crazy because I like to make products that take on the hard and thankless jobs in IT.  Its not glamorous, but someone needs to do them.

marioAnalogies are required when explaining what I do to most people.  For them, I’m not a specialist in physical data center operations, I’m an Internet plumber who is part of the team you call when your virtual toilet backs up.  I’m good with that – it’s work that’s useful, messy and humble.

Plumbing, like the physical Internet, disappears from most people’s conscious once it’s out of sight under the floor, cabinet or modem closet.  And like plumbers, we can’t do physical ops without getting dirty.  Unlike cloud-based ops with clean APIs and virtual services, you can’t do physical ops without touching something physical.  Even if you’ve got great telepresence, you cannot get away from physical realities like NIC and SATA enumeration, BIOS management and network topology.  I’m delighted that cloud has abstracted away that layer for most people but that does not mean we can ignore it.

Physical ops lacks the standardization of plumbing.  There are many cross-vendor standards but innovation and vendor variation makes consistency as unlikely as a unicorn winning the Rainbow Triple Crown.

493143-donkey_kong_1For physical ops, it feels like we’re the internet’s most famous plumber, Mario, facing Donkey Kong.  We’ve got to scale ladders, jump fireballs and swing between chains.  The job is made harder because there’s no half measures.  Sometimes you can find the massive hammer and blast your way through but that’s just a short term thing.

Unfortunately, there’s a real enemy here: complexity.

Just like Donkey Kong keeps dashing off with the princess, operations continue to get more and more complex.  Like with Mario, the solution is not to bypass the complexity; it’s to get better and faster at navigating the obstacles that get thrown at you.  Physical ops is about self-reliance and adaptability.  In that case, there are a lot of lessons to be learned from Mario.

If I’m an internet plumber then I’m happy to embrace Mario as my mascot.  Plumbers of the internet unite!

Ops is Ops, except when it ain’t. Breaking down the impedance mismatches between physical and cloud ops.

We’ve made great strides in ops automation, but there’s no one-size-fits-all approach to ops because abstractions have limitations.

IMG_20141108_035537967Perhaps it’s my Industrial Engineering background, I’m a huge fan of operational automation and tooling. I can remember my first experience with VMware ESX and thinking that it needed tooling automation.  Since then, I’ve watched as cloud ops has revolutionized application development and deployment.  We are just at the beginning of the cloud automation curve and our continuous deployment tooling and platform services deliver exponential increases in value.

These cloud breakthroughs are fundamental to Ops and uncovered real best practices for operators.  Unfortunately, much of the cloud specific scripts and tools do not translate well to physical ops.  How can we correct that?

Now that I focus on physical ops, I’m in awe of the capabilities being unleashed by cloud ops. Looking at Netflix chaos monkeys pattern alone, we’ve reached a point where it’s practical to inject artificial failures to improve application robustness.  The idea of breaking things on purpose as an optimization is both terrifying and exhilarating.

In the last few years, I’ve watched (and lead) an application of these cloud tool chains down to physical infrastructure.  Fundamentally, there’s a great fit between DevOps configuration management (Chef, Puppet, Salt, Ansible) tooling and physical ops.  Most of the configuration and installation work (post-ready state) is fundamentally the same regardless if the services are physical, virtual or containerized.  Installing PostgreSQL is pretty much the same for any platform.

But pretty much the same is not exactly the same.  The differences between platforms often prevent us from translating useful work between frames.  In physics, we’d call that an impedance mismatch: where similar devices cannot work together dues to minor variations.

An example of this Ops impedance mismatch is networking.  Virtual systems present interfaces and networks that are specific to the desired workload while physical systems present all the available physical interfaces plus additional system interfaces like VLANs, bridges and teams.  On a typical server, there at least 10 available interfaces and you don’t get to choose which ones are connected – you have to discover the topology.  To complicate matters, the interface list will vary depending on both the server model and the site requirements.

It’s trivial in virtual by comparison, you get only the NICs you need and they are ordered consistently based on your network requests.  While the basic script is the same, it’s essential that it identify the correct interface.  That’s simple in cloud scripting and highly variable for physical!

Another example is drive configuration.  Hardware presents limitless options of RAID, JBOD plus SSD vs HDD.  These differences have dramatic performance and density implications that are, by design, completely obfuscated in cloud resources.

The solution is to create functional abstractions between the application configuration and the networking configuration.  The abstraction isolates configuration differences between the scripts.  So the application setup can be reused even if the networking is radically different.

With some of our OpenCrowbar latest work, we’re finally able to create practical abstractions for physical ops that’s repeatable site to site.  For example, we have patterns that allow us to functionally separate the network from the application layer.  Using that separation, we can build network interfaces in one layer and allow the next to assume the networking is correct as if it was a virtual machine.  That’s a very important advance because it allows us to finally share and reuse operational scripts.

We’ll never fully eliminate the physical vs cloud impedance issue, but I think we can make the gaps increasingly small if we continue to 1) isolate automation layers with clear APIs and 2) tune operational abstractions so they can be reused.

Mayflies and Dinosaurs (extending Puppies and Cattle)

Dont Be FragileJosh McKenty and I were discussing the common misconception of the “Puppies and Cattle” analogy. His position is not anti-puppy! He believes puppies are sometimes unavoidable and should be isolated into portable containers (VMs) so they can be shuffled around seamlessly. His more provocative point is that we want our underlying infrastructure to be cattle so it remains highly elastic and flexible. More cattle means a more resilient system. To me, this is a fundamental CloudOps design objective.

We realized that the perfect cloud infrastructure would structurally discourage the creation of puppies.

Imagine a cloud in which servers were automatically decommissioned after a week of use. In a sort of anti-SLA, any VM running for more than 168 hours would be (gracefully) terminated. This would force a constant churn of resources within the infrastructure that enables true cattle-like management. This cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.

We called these servers mayflies due to their limited life span.

While this approach requires a high degree of automation, the most successful cloud operators I have met are effectively building workloads with this requirement. If we require application workloads to be elastic and fault-resilient then we have a much higher degree of flexibility with the underlying infrastructure. I’ve seen this in practice with several OpenStack clouds: operators with helped applications deploy using automation were able to decommission “old” clouds much more gracefully. They effectively turned their entire cloud into a cow. Sadly, the ones without that investment puppified™ the ops infrastructure and created a much more brittle environment.

The opposite of a mayfly is the dinosaur: a server that is so brittle and locked that the slightest disturbance wipes out everything it touches.

Dinosaurs are puppies grown into a T-Rex with rows of massive razor sharp teeth and tiny manicured hands. These are systems that are so unique and historical that there’s no way to recreate them if there’s a failure. The original maintainers exit happy hour was celebrated by people who were laid-off two CEOs ago. The impact of dinosaurs goes beyond their operational risk; they are typically impossible to extend or maintain and, consequently, ossify other server around them. This type of server drains elasticity from your ops team.

Puppies do not grow up to become dogs, they become dinosaurs.

It’s a classic lean adage to do hard things more frequently. Perhaps it’s time to start creating mayflies in your ops infrastructure.

Do Be Dense! Dell C8000 unit merges best of bladed and rackable servers

“Double wide” is not a term I’ve commonly applied to servers, but that’s one of the cool things about this new class of servers that Dell, my employer, started shipping today.

My team has been itching for the chance to start cloud and big data reference architectures using this super dense and flexible chassis. You’ll see it included in our next Apache Hadoop release and we’ve already got customers who are making it the foundation of their deployments (Texas Adv Computing Center case study).

If you’re tracking the latest big data & cloud hardware then the Dell PowerEdge C8000 is worth some investigation.

Basically, the Dell C8000 is a chassis that holds a flexible configuration of compute or storage sleds. It’s not a blade frame because the sleds minimize shared infrastructure. In our experience, cloud customers like the dedicated i/o and independence of sleds (as per the Bootstrapping clouds white paper). Those attributes are especially well suited for Hadoop and OpenStack because they support a “flat edges” and scale out design. While i/o independence is valued, we also want shared power infrastructure and density for efficiency reasons. Using a chassis design seems to capture the best of both worlds.

The novelty for the Dell PowerEdge C8000 is that the chassis are scary flexible. You are not locked into a pre-loaded server mix.

There are a plethora of sled choices so that you can mix choices for power, compute density and spindle counts. That includes double-wide sleds positively brimming with drives and expanded GPU processers. Drive density is important for big data configurations that are disk i/o hungry; however, our experience is the customer deployments are highly varied based on the planned workload. There are also significant big data trends towards compute, network, and balanced hardware configurations. Using the C8000 as a foundation is powerful because it can cater to all of these use-case mixes.

That reminds me! Mike Pittaro (our team’s Hadoop lead architect) did an excellent Deploy Hadoop using Crowbar video.

Interested in more opinions about the C8000? Check out Barton George & David Meyer.

The real workloads begin: Crowbar’s Sophomore Year

Given Crowbar‘s frenetic Freshman year, it’s impossible to predict everything that Crowbar could become. I certainly aspire to see the project gain a stronger developer community and the seeds of this transformation are sprouting. I also see that community driven work is positioning Crowbar to break beyond being platforms for OpenStack and Apache Hadoop solutions that pay the bills for my team at Dell to invest in Crowbar development.

I don’t have to look beyond the summer to see important development for Crowbar because of the substantial goals of the Crowbar 2.0 refactor.

Crowbar 2.0 is really just around the corner so I’d like to set some longer range goals for our next year.

  • Growing acceptance of Crowbar as an in data center extension for DevOps tools (what I call CloudOps)
  • Deeper integration into more operating environments beyond the core Linux flavors (like virtualization hosts, closed and special purpose operating systems.
  • Improvements in dynamic networking configuration
  • Enabling more online network connected operating modes
  • Taking on production ops challenges of scale, high availability and migration
  • Formalization of our community engagement with summits, user groups, and broader developer contributions.

For example, Crowbar 2.0 will be able to handle downloading packages and applications from the internet. Online content is not a major benefit without being able to stage and control how those new packages are deployed; consequently, our goals remains tightly focused improvements in orchestration.

These changes create a foundation that enables a more dynamic operating environment. Ultimately, I see Crowbar driving towards a vision of fully integrated continuous operations; however, Greg & Rob’s Crowbar vision is the topic for tomorrow’s post.

Seven Cloud Success Criteria to consider before you pick a platform

From my desk at Dell, I have a unique perspective.   In addition to a constant stream of deep customer interactions about our many cloud solutions (even going back pre-OpenStack to Joyent & Eucalyptus), I have been an active advocate for OpenStack, involved in many discussions with and about CloudStack and regularly talk shop with Dell’s VIS Creator (our enterprise focused virtualization products) teams.  And, if you go back ten years to 2002, patented the concept of hybrid clouds with Dave McCrory.

Rather than offering opinions in the Cloud v. Cloud fray, I’m suggesting that cloud success means taking a system view.

Platform choice is only part of the decision: operational readiness, application types and organization culture are critical foundations before platform.

Over the last two years at Dell, I found seven points outweigh customers’ choice of platform.

  1. Running clouds requires building operational expertise both at the application and infrastructure layers.  CloudOps is real.
  2. Application architectures matter for cloud deployment because they can redefine the SLA requirements and API expectations
  3. Development community and collaboration is a significant value because sharing around open operations offers significant returns.
  4. We need to build an accelerating pace of innovation into our core operating principles
  5. There are still significant technology gaps to fill (networking & storage) and we will discover new gaps as we go
  6. We can no longer discuss public and private clouds as distinct concepts.   True hybrid clouds are not here yet, but everyone can already see their massive shadow.
  7. There is always more than one right technological answer.  Avoid analysis paralysis by making incrementally correct decisions (committing, moving forward, learning and then re-evaluating).

Dell Crowbar Project: Open Source Cloud Deployer expands into the Community

Note: Cross posted on Dell Tech Center Blogs.

Background: Crowbar is an open source cloud deployment framework originally developed by Dell to support our OpenStack and Hadoop powered solutions.  Recently, it’s scope has increased to include a DevOps operations model and other deployments for additional cloud applications.

It’s only been a matter of months since we open sourced the Dell Crowbar Project at OSCON in June 2011; however, the progress and response to the project has been over whelming.  Crowbar is transforming into a community tool that is hardware, operating system, and application agnostic.  With that in mind, it’s time for me to provide a recap of Crowbar for those just learning about the project.

Crowbar started out simply as an installer for the “Dell OpenStack™-Powered Cloud Solution” with the objective of deploying a cloud from unboxed servers to a completely functioning system in under four hours.  That meant doing all the BIOS, RAID, Operations services (DNS, NTP, DHCP, etc.), networking, O/S installs and system configuration required creating a complete cloud infrastructure.  It was a big job, but one that we’d been piecing together on earlier cloud installation projects.  A key part of the project involved collaborating with Opscode Chef Server on the many system configuration tasks.  Ultimately, we met and exceeded the target with a complete OpenStack install in less than two hours.

In the process of delivering Crowbar as an installer, we realized that Chef, and tools like it, were part of a larger cloud movement known as DevOps.

The DevOps approach to deployment builds up systems in a layered model rather than using packaged images.  This layered model means that parts of the system are relatively independent and highly flexible.  Users can choose which components of the system they want to deploy and where to place those components.  For example, Crowbar deploys Nagios by default, but users can disable that component in favor of their own monitoring system.  It also allows for new components to identify that Nagios is available and automatically register themselves as clients and setup application specific profiles.  In this way, Crowbar’s use of a DevOps layered deployment model provides flexibility for BOTH modularized and integrated cloud deployments.

We believe that operations that embrace layered deployments are essential for success because they allow our customers to respond to the accelerating pace of change.  We call this model for cloud data centers “CloudOps.”

Based on the flexibility of Crowbar, our team decided to use it as the deployment model for our Apache™ Hadoop™ project (“Dell | Apache Hadoop Solution”).  While a good fit, adding Hadoop required expanding Crowbar in several critical ways.

  1. We had to make major changes in our installation and build processes to accommodate multi-operating system support (RHEL 5.6 and Ubuntu 10.10 as of Oct 2011).
  2. We introduced a modularization concept that we call “barclamps” that package individual layers of the deployment infrastructure.  These barclamps reach from the lowest system levels (IPMI, BIOS, and RAID) to the highest (OpenStack and Hadoop).

Barclamps are a very significant architecture pattern for Crowbar:

  1. They allow other applications to plug into the framework and leverage other barclamps in the solution.  For example, VMware created a Cloud Foundry barclamp and Dream Host has created a Ceph barclamp.  Both barclamps are examples of applications that can leverage Crowbar for a repeatable and predictable cloud deployment.
  2. They are independent modules with their own life cycle.  Each one has its own code repository and can be imported into a live system after initial deployment.  This allows customers to expand and manage their system after initial deployment.
  3. They have many components such as Chef Cookbooks, custom UI for configuration, dependency graphs, and even localization support.
  4. They offer services that other barclamps can consume.  The Network barclamp delivers many essential services for bootstrapping clouds including IP allocation, NIC teaming, and node VLAN configuration.
  5. They can provide extensible logic to evaluate a system and make deployment recommendations.  So far, no barclamps have implemented more than the most basic proposals; however, they have the potential for much richer analysis.

Making these changes was a substantial investment by Dell, but it greatly expands the community’s ability to participate in Crowbar development.  We believe these changes were essential to our team’s core values of open and collaborative development.

Most recently, our team moved Crowbar development into the open.  This change was reflected in our work on OpenStack Diablo (+ Keystone and Dashboard) with contributions by Opscode and Rackspace Cloud Builders.  Rather than work internally and push updates at milestones, we are now coding directly from the Crowbar repositories on Github.  It is important to note that for licensing reasons, Dell has not open sourced the optional BIOS and RAID barclamps.  This level of openness better positions us to collaborate with the crowbar community.

For a young project, we’re very proud of the progress that we’ve made with Crowbar.  We are starting a new chapter that brings new challenges such as expanding community involvement, roadmap transparency, and growing Dell support capabilities.  You will also begin to see optional barclamps that interact with proprietary and licensed hardware and software.  All of these changes are part of growing Crowbar in framework that can support a vibrant and rich ecosystem.

We are doing everything we can to make it easy to become part of the Crowbar community.  Please join our mailing list, download the open source code or ISO, create a barclamp, and make your voice heard.  Since Dell is funding the core development on this project, contacting your Dell salesperson and telling them how much you appreciate our efforts goes a long way too.