Short lived VM (Mayflies) research yields surprising scheduling benefit

Last semester, Alex Hirschfeld (my son) did a simulation to explore the possible efficiency benefits of the Mayflies concept proposed by Josh McKenty and me.

Mayflies swarming from Wikipedia

In the initial phase of the research, he simulated a data center using load curves designed to oversubscribe the resources (he’s still interesting in actual load data).  This was sufficient to test the theory and find something surprising: mayflies can really improve scheduling.

Alex found an unexpected benefit comes when you force mayflies to have a controlled “die off.”  It allows your scheduler to be much smarter.

Let’s assume that you have a high mayfly ratio (70%), that means every day 10% of your resources would turn over.  If you coordinate the time window and feed that information into your scheduler, then it can make much better load distribution decisions.  Alex’s simulation showed that this approach basically eliminated hot spots and server over-crowding.

Here’s a snippet of his report explaining the effect in his own words:

On a system that is more consistent and does not have a massive virtual machine through put, Mayflies may not help with balancing the systems load, but with the social engineering aspect, it can increase the stability of the system.

Most of the time, the requests for new virtual machines on a cloud are immutable. They came in at a time and need to be fulfilled in the order of their request. Mayflies has the potential to change that. If a request is made, it has the potential to be added to a queue of mayflies that need to be reinitialized. This creates a queue of virtual machine requests that any load balancing algorithm can work with.

Mayflies can make load balancing a system easier. Knowing the exact size of the virtual machine that is going to be added and knowing when it will die makes load balancing for dynamic systems trivial.

Research showing that Short Lived Servers (“mayflies”) create efficiency at scale [DATA REQUESTED]

Last summer, Josh McKenty and I extended the puppies and cattle metaphor to limited life cattle we called “mayflies.” It was an attempt to help drive the cattle mindset (I think of it as social engineering, or maybe PsychOps) by forcing churn. I’ve come to think of it a step in between cattle and chaos monkeys (see Adrian Cockcroft).

While our thoughts were on mainly ops patterns, I’ve heard that there could be a real operational benefit from encouraging this behavior. The increased turn over in the environment improves scheduler optimization, planned load drains and coping with platform/environment migration.

Now we have a chance to quantify this benefit: a college student (disclosure: he’s my son) has created a data center emulation to see if Mayflies help with utilization. His model appears to work.

Now, he needs some real world data, here’s his request for assistance [note: he needs data by 1/20 to be included in this term]:


I am Alexander Hirschfeld, a freshman at Rose-Hulman Institute of Technology. I am working on an independent study about Mayflies, a new idea in virtual machine management in cloud computing. Part of this management is load balancing and resource allocation for virtual machines across a collection of servers. The emulation that I am working on needs a realistic set of data to be the most accurate when modeling the results of using the methods outlined by the theory of mayflies.

Mayflies are an extension of the puppies verses cattle approach to machines, they are the extreme version of cattle as they have a known limited lifespan, such as 7 days. This requires the users of the cloud to build inherently more automated and fault-resistant applications. If you could send me a collection of the requests for new virtual machines(per standard unit of time and their requested specs/size), as well as an average lifetime for the virtual machines (or a graph or list of designated/estimated life times), and a basic summary of the collection of servers running the virtual machines(number, ram, cores), I would be better able to understand how Mayflies can affect a cloud.

Alexander Hirschfeld, twitter: @d-qoi

Needless to say, I’m really excited about the progress on demonstrating some the impact of this practice and am looking forward to posting about his results in the near future.

If you post in the comments, I will make sure you are connected to Alex.

OpenCrowbar bootstrap positions SSH Keys for hand-offs

I was reading a ComputerWorld article about how Google and Amazon achieve scale.  The theme: you must do better than linear cost scale and the only way to achieve that is to automate and commoditize hardware.  I find interesting parallels in the Crowbar physical devops effort.

KeysAs the OpenCrowbar team continues to explore the concepts around “ready state,” I discover more and more small ops nuisances that need to be included in the build up before installing software.  These small items quickly add up at scale breaking the rule above.

I’ve already posted about the performance benefit of building a Squid Proxy fabric as part of the underlying ops environment.  As we work on Chef Metal, SaltStack and Packstack integrations (private beta), we’ve rediscovered the importance of management/population of SSH public keys.

In cloud infrastructure, key injection is taken for granted; however, it’s not an automatic behavior in the physical ops.  Since OpenCrowbar handles keys by default but other tools (like Cobbler or Razor) expect that you will use kickstart to inject your SSH keys when you install the Operating System..

Including keys in kickstart (which I’m using generically instead of preseed, auto-yast, jumpstart, etc) hand generated scripts is a potentially dangerous security practice since it makes it difficult to propagate and manage your keys.  It also means that every time a new operating system update is released that you may have to update and retest your kickstarts.  OpenCrowbar has the same challenge but our approach allows everyone can share in the work because our bootstrapping files are scripted and generic.

OpenCrowbar takes care of these ready state configurations in our integrations with these DevOps platforms.  Our experience has been that little items like SSH keys and proxy configurations can make a disproportionate advantage in running scale ops or during iterative development.

Apply, Rinse, Repeat! How do I get that DevOps conditioner out of my hair?

I’ve been trying to explain the pain Tao of physical ops in a way that’s accessible to people without scale ops experience.   It comes down to a yin-yang of two elements: exploding complexity and iterative learning.

Science = Explosions!Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications.  Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.

But that’s not the real challenge!  The bigger problem is that it’s impossible to design for all those parameters in advance.

When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system.  After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”

Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.

  1. The first cycle was completely generic to provide a working baseline and validate the physical environment.
  2. The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.
  3. The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment
  4. The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.

Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer!  Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.

In scale-out infrastructure, tools & automation matter

WiseScale out platforms like Hadoop have different operating rules.  I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.

In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.

Finding and correcting this type of failure can be difficult.  While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause.   If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.

Consequently, I’ve found that management tooling and automation are essential for success.  Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.

I’d also like to give a shout out to benchmarks as part of your tooling suite.  Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.

Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic.  A single person can have a disproportionate impact on overall team performance.

CAP Chasm: why clouds say “no SANank you” to SANs

My personal bias against SANs in cloud architectures is well documented; however, I am in the minority at my employer (Dell) and few enterprise IT shops share my view.  In his recent post about CAP theorem, Dave McCrory has persuaded me to look beyond their failure to bask in my flawless reasoning.  Apparently, this crazy CAP thing explains why some people loves SANs (enterprise) and others don’t (clouds).

The deal with CAP is that you can only have two of Consistency, Availability, or Partitioning Tolerance.  Since everyone wants Availablity, the choice is really between Consitency or Partitioning.  Seeking Availability you’ve got two approaches:

  1. Legacy applications tried to eliminate faults to achieve Consistency with physically redundant scale up designs. 
  2. Cloud applications assume faults to achieve Partitioning Tolerance with logically redundant scale out design.

According to CAP, Legacy and cloud approaches are so fundamentally different that they create a “CAP Chasm” in which the very infrastructure fabric needed to deploy these applications is different.

As a cloud geek, I consider the inherent cost and scale limitations of a CA approach much too limited.   My first hand experience is that our customers and partners share my view: they have embraced AP patterns.  These patterns make more efficient use of resources, dictate simpler infrastructure layout, scale like hormone-crazed rabbits at a carrot farm, and can be deployed on less expensive commodity hardware.

As a CAP theorem enlightened IT professional, I can finally accept that there are other intellectually valid infrastructure models. 

See Mom?  I can play nicely with others after all.

Are Clouds using Dark Cycles?

Or “Darth Vader vs Godzilla”

Way way back in January, I’d heard loud and clear that companies where not expecting to mix cloud computing loads.  I was treated like a three-eyed Japanese tree slug for suggesting that we could mixing HPC and Analytics loads with business applications in the same clouds.  The consensus was that companies would stand up independent clouds for each workload.  The analysis work was too important to interrupt and the business applications too critical to risk.

It has always rankled me that all those unused compute cycles (“the dark cycles”) could be put to good use.  It’s appeals to my eco-geek side to make best possible use of all those idle servers.   Dave McCrory and I even wrote some cloud patents around this.

However, I succumbed to the scorn and accepted the separation.

Now all of a sudden, this idea seems to be playing Godzilla to a Tokyo shaped cloud data center.  I see several forces merging together to resurrect mixing workloads.

  1. Hadoop (and other map-reduce Analytics) are becoming required business tools
  2. Public clouds are making it possible to quickly (if not cheaply) setup analytic clouds
  3. Governance of virtualization is getting better
  4. Companies want to save some $$$

This trend will only continue as Moore’s Law improves the compute density for hardware.  Since our designs are leading towards scale out designs that distribute applications over multiple nodes; it is not practical to expect an application to consume all the power of a single computer.

That leaves a lot of lonely dark cycles looking for work.

Now all of a sudden, this idea seems to be playing Godzilla to a Tokyo shaped cloud data center.  I see several forces merging together to resurrect mixing workloads.

  1. Hadoop (and other map-reduce Analytics) are becoming required business tools
  2. Public clouds are making it possible to quickly (if not cheaply) setup analytic clouds
  3. Governance of virtualization is getting better
  4. Companies want to save some $$$

This trend will only continue as Moore’s Law improves the compute density for hardware.  Since our designs are leading towards scale out designs that distribute applications over multiple nodes; it is not practical to expect an application to consume all the power of a single computer.

That leaves a lot of lonely dark cycles looking for work.

Just Striping in the RAIN

Or “behold the power of the unreliable”

In a previous post, I discussed the concept of a Redundant Array of Inexpensive Nodes (RAIN) as a way to create more reliable and scalable applications.   Deploying a RAIN application is like being the House in Vegas – it’s about having enough size that the odds come out in your favor.  Even if one player is on a roll, you can predict that nearly everyone else is paying your rent.  Imagine what would happen if all the winning gamblers were in your casino!  If you don’t want to go bankrupt when deploying a RAIN app, then ensure that the players spread out all over the Strip.

One of my core assumptions is that you’ll deploy a RAIN application on a cloud.   This is a significant because we’re assuming that your nodes are

  1. idle most of the time because your traffic loads are cyclic
  2. unreliable because the cloud provider does not offer much SLA
  3. divisible because renting ten 1/10ths of a server costs roughly the same as a whole one
  4. burstable because 1/10th servers can sometimes consume that extra 9/10th server

The burstable concept is a dramatic power multiplier that tips the whole RAIN equation heavily towards clouds. 

Bursting means that under load, your 10 1/10th servers (roughly 1 server in cost) could instantly expand to the power of 10 full servers!  That reflects an order of magnitude spike in demand without any change in your application.

In the past, we’ve racked extra servers to handle this demand.  That meant that we had a lot of extra capacity taking up rack space, clubbing innocent migratory electrons for their soft velvety fur, and committing over provisioning atrocities.  

Today, multi-tenant clouds allow us to average out these bursts by playing the odds on when application bursts will occur.  In this scenario, the infrastructure provider benefits from the fact that applications need to be over provisioned.  The application author benefits because they can instantly tap more resources on demand.  Now, that is what I would call a win-win synergy!

All this goodness depends on

  • standard patterns & practices that developers use to scale out RAIN applications
  • platform improvements in cloud infrastructure to enable smooth scale out
  • commercial models that fairly charge for multi-tenant over-subscription
  • workable security that understands the underlying need for co-mingling workloads

The growing dominance of cloud deployments looks different when you understand the multiplying interplay between multi-tenant clouds and cloud-ready RAIN applications.

Cloud Application Life Cycle

Or “you learn by doing, and doing, and doing”

One of the most consistent comments I hear about cloud applications is that it fundamentally changes the way applications are written.  I’m not talking about the technologies, but the processes and infrastructure.

Since our underlying assumption of a cloud application is that node failure is expected then our development efforts need to build in that assumption before any code is written.  Consequently, cloud apps should be written directly on cloud infrastructure.

In old school development, I would have all the components for my application on my desktop.  That’s necessary for daily work, but does not give me a warm fuzzy for success in production.

Today’s scale production environments involve replicated data with synchronization lags, shared multi-writer memcache, load balancers, and mixed code versions.  There is no way that I can simulate that on my desktop!   There is no way I can fully anticipate how that will behave all together!

The traditional alternative is to wait.  Wait for QA to try and find bugs through trial and error.  Or (more likely) wait for users to discover the problem post deployment.

My alternative is to constantly deploy the application to a system that matches production.    As a bonus, I then attack the deployment with integration tests and simulators.

If you’re thinking that is too much effort then you are no thinking deeply enough.  This model forces developers to invest in install and deployment automation.  That means that you will be able to test earlier in the cycle.  It means you will be able to fix issues more quickly.  And that you’ll be able to ship more often.  It means that you can involve operations and networking specialists well before production.  You may even see more collaboration between your development, quality, and operations teams.   

Forget about that last one – if those teams actually worked together you might accidently ship product on time.  Gasp!