Short lived VM (Mayflies) research yields surprising scheduling benefit

Last semester, Alex Hirschfeld (my son) did a simulation to explore the possible efficiency benefits of the Mayflies concept proposed by Josh McKenty and me.

Mayflies swarming from Wikipedia

In the initial phase of the research, he simulated a data center using load curves designed to oversubscribe the resources (he’s still interesting in actual load data).  This was sufficient to test the theory and find something surprising: mayflies can really improve scheduling.

Alex found an unexpected benefit comes when you force mayflies to have a controlled “die off.”  It allows your scheduler to be much smarter.

Let’s assume that you have a high mayfly ratio (70%), that means every day 10% of your resources would turn over.  If you coordinate the time window and feed that information into your scheduler, then it can make much better load distribution decisions.  Alex’s simulation showed that this approach basically eliminated hot spots and server over-crowding.

Here’s a snippet of his report explaining the effect in his own words:

On a system that is more consistent and does not have a massive virtual machine through put, Mayflies may not help with balancing the systems load, but with the social engineering aspect, it can increase the stability of the system.

Most of the time, the requests for new virtual machines on a cloud are immutable. They came in at a time and need to be fulfilled in the order of their request. Mayflies has the potential to change that. If a request is made, it has the potential to be added to a queue of mayflies that need to be reinitialized. This creates a queue of virtual machine requests that any load balancing algorithm can work with.

Mayflies can make load balancing a system easier. Knowing the exact size of the virtual machine that is going to be added and knowing when it will die makes load balancing for dynamic systems trivial.

Physical Ops = Plumbers of the Internet. Celebrating dirty IT jobs 8 bit style

I must be crazy because I like to make products that take on the hard and thankless jobs in IT.  Its not glamorous, but someone needs to do them.

marioAnalogies are required when explaining what I do to most people.  For them, I’m not a specialist in physical data center operations, I’m an Internet plumber who is part of the team you call when your virtual toilet backs up.  I’m good with that – it’s work that’s useful, messy and humble.

Plumbing, like the physical Internet, disappears from most people’s conscious once it’s out of sight under the floor, cabinet or modem closet.  And like plumbers, we can’t do physical ops without getting dirty.  Unlike cloud-based ops with clean APIs and virtual services, you can’t do physical ops without touching something physical.  Even if you’ve got great telepresence, you cannot get away from physical realities like NIC and SATA enumeration, BIOS management and network topology.  I’m delighted that cloud has abstracted away that layer for most people but that does not mean we can ignore it.

Physical ops lacks the standardization of plumbing.  There are many cross-vendor standards but innovation and vendor variation makes consistency as unlikely as a unicorn winning the Rainbow Triple Crown.

493143-donkey_kong_1For physical ops, it feels like we’re the internet’s most famous plumber, Mario, facing Donkey Kong.  We’ve got to scale ladders, jump fireballs and swing between chains.  The job is made harder because there’s no half measures.  Sometimes you can find the massive hammer and blast your way through but that’s just a short term thing.

Unfortunately, there’s a real enemy here: complexity.

Just like Donkey Kong keeps dashing off with the princess, operations continue to get more and more complex.  Like with Mario, the solution is not to bypass the complexity; it’s to get better and faster at navigating the obstacles that get thrown at you.  Physical ops is about self-reliance and adaptability.  In that case, there are a lot of lessons to be learned from Mario.

If I’m an internet plumber then I’m happy to embrace Mario as my mascot.  Plumbers of the internet unite!

Ops is Ops, except when it ain’t. Breaking down the impedance mismatches between physical and cloud ops.

We’ve made great strides in ops automation, but there’s no one-size-fits-all approach to ops because abstractions have limitations.

IMG_20141108_035537967Perhaps it’s my Industrial Engineering background, I’m a huge fan of operational automation and tooling. I can remember my first experience with VMware ESX and thinking that it needed tooling automation.  Since then, I’ve watched as cloud ops has revolutionized application development and deployment.  We are just at the beginning of the cloud automation curve and our continuous deployment tooling and platform services deliver exponential increases in value.

These cloud breakthroughs are fundamental to Ops and uncovered real best practices for operators.  Unfortunately, much of the cloud specific scripts and tools do not translate well to physical ops.  How can we correct that?

Now that I focus on physical ops, I’m in awe of the capabilities being unleashed by cloud ops. Looking at Netflix chaos monkeys pattern alone, we’ve reached a point where it’s practical to inject artificial failures to improve application robustness.  The idea of breaking things on purpose as an optimization is both terrifying and exhilarating.

In the last few years, I’ve watched (and lead) an application of these cloud tool chains down to physical infrastructure.  Fundamentally, there’s a great fit between DevOps configuration management (Chef, Puppet, Salt, Ansible) tooling and physical ops.  Most of the configuration and installation work (post-ready state) is fundamentally the same regardless if the services are physical, virtual or containerized.  Installing PostgreSQL is pretty much the same for any platform.

But pretty much the same is not exactly the same.  The differences between platforms often prevent us from translating useful work between frames.  In physics, we’d call that an impedance mismatch: where similar devices cannot work together dues to minor variations.

An example of this Ops impedance mismatch is networking.  Virtual systems present interfaces and networks that are specific to the desired workload while physical systems present all the available physical interfaces plus additional system interfaces like VLANs, bridges and teams.  On a typical server, there at least 10 available interfaces and you don’t get to choose which ones are connected – you have to discover the topology.  To complicate matters, the interface list will vary depending on both the server model and the site requirements.

It’s trivial in virtual by comparison, you get only the NICs you need and they are ordered consistently based on your network requests.  While the basic script is the same, it’s essential that it identify the correct interface.  That’s simple in cloud scripting and highly variable for physical!

Another example is drive configuration.  Hardware presents limitless options of RAID, JBOD plus SSD vs HDD.  These differences have dramatic performance and density implications that are, by design, completely obfuscated in cloud resources.

The solution is to create functional abstractions between the application configuration and the networking configuration.  The abstraction isolates configuration differences between the scripts.  So the application setup can be reused even if the networking is radically different.

With some of our OpenCrowbar latest work, we’re finally able to create practical abstractions for physical ops that’s repeatable site to site.  For example, we have patterns that allow us to functionally separate the network from the application layer.  Using that separation, we can build network interfaces in one layer and allow the next to assume the networking is correct as if it was a virtual machine.  That’s a very important advance because it allows us to finally share and reuse operational scripts.

We’ll never fully eliminate the physical vs cloud impedance issue, but I think we can make the gaps increasingly small if we continue to 1) isolate automation layers with clear APIs and 2) tune operational abstractions so they can be reused.

Mayflies and Dinosaurs (extending Puppies and Cattle)

Dont Be FragileJosh McKenty and I were discussing the common misconception of the “Puppies and Cattle” analogy. His position is not anti-puppy! He believes puppies are sometimes unavoidable and should be isolated into portable containers (VMs) so they can be shuffled around seamlessly. His more provocative point is that we want our underlying infrastructure to be cattle so it remains highly elastic and flexible. More cattle means a more resilient system. To me, this is a fundamental CloudOps design objective.

We realized that the perfect cloud infrastructure would structurally discourage the creation of puppies.

Imagine a cloud in which servers were automatically decommissioned after a week of use. In a sort of anti-SLA, any VM running for more than 168 hours would be (gracefully) terminated. This would force a constant churn of resources within the infrastructure that enables true cattle-like management. This cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.

We called these servers mayflies due to their limited life span.

While this approach requires a high degree of automation, the most successful cloud operators I have met are effectively building workloads with this requirement. If we require application workloads to be elastic and fault-resilient then we have a much higher degree of flexibility with the underlying infrastructure. I’ve seen this in practice with several OpenStack clouds: operators with helped applications deploy using automation were able to decommission “old” clouds much more gracefully. They effectively turned their entire cloud into a cow. Sadly, the ones without that investment puppified™ the ops infrastructure and created a much more brittle environment.

The opposite of a mayfly is the dinosaur: a server that is so brittle and locked that the slightest disturbance wipes out everything it touches.

Dinosaurs are puppies grown into a T-Rex with rows of massive razor sharp teeth and tiny manicured hands. These are systems that are so unique and historical that there’s no way to recreate them if there’s a failure. The original maintainers exit happy hour was celebrated by people who were laid-off two CEOs ago. The impact of dinosaurs goes beyond their operational risk; they are typically impossible to extend or maintain and, consequently, ossify other server around them. This type of server drains elasticity from your ops team.

Puppies do not grow up to become dogs, they become dinosaurs.

It’s a classic lean adage to do hard things more frequently. Perhaps it’s time to start creating mayflies in your ops infrastructure.

Do Be Dense! Dell C8000 unit merges best of bladed and rackable servers

“Double wide” is not a term I’ve commonly applied to servers, but that’s one of the cool things about this new class of servers that Dell, my employer, started shipping today.

My team has been itching for the chance to start cloud and big data reference architectures using this super dense and flexible chassis. You’ll see it included in our next Apache Hadoop release and we’ve already got customers who are making it the foundation of their deployments (Texas Adv Computing Center case study).

If you’re tracking the latest big data & cloud hardware then the Dell PowerEdge C8000 is worth some investigation.

Basically, the Dell C8000 is a chassis that holds a flexible configuration of compute or storage sleds. It’s not a blade frame because the sleds minimize shared infrastructure. In our experience, cloud customers like the dedicated i/o and independence of sleds (as per the Bootstrapping clouds white paper). Those attributes are especially well suited for Hadoop and OpenStack because they support a “flat edges” and scale out design. While i/o independence is valued, we also want shared power infrastructure and density for efficiency reasons. Using a chassis design seems to capture the best of both worlds.

The novelty for the Dell PowerEdge C8000 is that the chassis are scary flexible. You are not locked into a pre-loaded server mix.

There are a plethora of sled choices so that you can mix choices for power, compute density and spindle counts. That includes double-wide sleds positively brimming with drives and expanded GPU processers. Drive density is important for big data configurations that are disk i/o hungry; however, our experience is the customer deployments are highly varied based on the planned workload. There are also significant big data trends towards compute, network, and balanced hardware configurations. Using the C8000 as a foundation is powerful because it can cater to all of these use-case mixes.

That reminds me! Mike Pittaro (our team’s Hadoop lead architect) did an excellent Deploy Hadoop using Crowbar video.

Interested in more opinions about the C8000? Check out Barton George & David Meyer.

The real workloads begin: Crowbar’s Sophomore Year

Given Crowbar‘s frenetic Freshman year, it’s impossible to predict everything that Crowbar could become. I certainly aspire to see the project gain a stronger developer community and the seeds of this transformation are sprouting. I also see that community driven work is positioning Crowbar to break beyond being platforms for OpenStack and Apache Hadoop solutions that pay the bills for my team at Dell to invest in Crowbar development.

I don’t have to look beyond the summer to see important development for Crowbar because of the substantial goals of the Crowbar 2.0 refactor.

Crowbar 2.0 is really just around the corner so I’d like to set some longer range goals for our next year.

  • Growing acceptance of Crowbar as an in data center extension for DevOps tools (what I call CloudOps)
  • Deeper integration into more operating environments beyond the core Linux flavors (like virtualization hosts, closed and special purpose operating systems.
  • Improvements in dynamic networking configuration
  • Enabling more online network connected operating modes
  • Taking on production ops challenges of scale, high availability and migration
  • Formalization of our community engagement with summits, user groups, and broader developer contributions.

For example, Crowbar 2.0 will be able to handle downloading packages and applications from the internet. Online content is not a major benefit without being able to stage and control how those new packages are deployed; consequently, our goals remains tightly focused improvements in orchestration.

These changes create a foundation that enables a more dynamic operating environment. Ultimately, I see Crowbar driving towards a vision of fully integrated continuous operations; however, Greg & Rob’s Crowbar vision is the topic for tomorrow’s post.

Seven Cloud Success Criteria to consider before you pick a platform

From my desk at Dell, I have a unique perspective.   In addition to a constant stream of deep customer interactions about our many cloud solutions (even going back pre-OpenStack to Joyent & Eucalyptus), I have been an active advocate for OpenStack, involved in many discussions with and about CloudStack and regularly talk shop with Dell’s VIS Creator (our enterprise focused virtualization products) teams.  And, if you go back ten years to 2002, patented the concept of hybrid clouds with Dave McCrory.

Rather than offering opinions in the Cloud v. Cloud fray, I’m suggesting that cloud success means taking a system view.

Platform choice is only part of the decision: operational readiness, application types and organization culture are critical foundations before platform.

Over the last two years at Dell, I found seven points outweigh customers’ choice of platform.

  1. Running clouds requires building operational expertise both at the application and infrastructure layers.  CloudOps is real.
  2. Application architectures matter for cloud deployment because they can redefine the SLA requirements and API expectations
  3. Development community and collaboration is a significant value because sharing around open operations offers significant returns.
  4. We need to build an accelerating pace of innovation into our core operating principles
  5. There are still significant technology gaps to fill (networking & storage) and we will discover new gaps as we go
  6. We can no longer discuss public and private clouds as distinct concepts.   True hybrid clouds are not here yet, but everyone can already see their massive shadow.
  7. There is always more than one right technological answer.  Avoid analysis paralysis by making incrementally correct decisions (committing, moving forward, learning and then re-evaluating).