Short lived VM (Mayflies) research yields surprising scheduling benefit

Last semester, Alex Hirschfeld (my son) did a simulation to explore the possible efficiency benefits of the Mayflies concept proposed by Josh McKenty and me.

Mayflies swarming from Wikipedia

In the initial phase of the research, he simulated a data center using load curves designed to oversubscribe the resources (he’s still interesting in actual load data).  This was sufficient to test the theory and find something surprising: mayflies can really improve scheduling.

Alex found an unexpected benefit comes when you force mayflies to have a controlled “die off.”  It allows your scheduler to be much smarter.

Let’s assume that you have a high mayfly ratio (70%), that means every day 10% of your resources would turn over.  If you coordinate the time window and feed that information into your scheduler, then it can make much better load distribution decisions.  Alex’s simulation showed that this approach basically eliminated hot spots and server over-crowding.

Here’s a snippet of his report explaining the effect in his own words:

On a system that is more consistent and does not have a massive virtual machine through put, Mayflies may not help with balancing the systems load, but with the social engineering aspect, it can increase the stability of the system.

Most of the time, the requests for new virtual machines on a cloud are immutable. They came in at a time and need to be fulfilled in the order of their request. Mayflies has the potential to change that. If a request is made, it has the potential to be added to a queue of mayflies that need to be reinitialized. This creates a queue of virtual machine requests that any load balancing algorithm can work with.

Mayflies can make load balancing a system easier. Knowing the exact size of the virtual machine that is going to be added and knowing when it will die makes load balancing for dynamic systems trivial.

Research showing that Short Lived Servers (“mayflies”) create efficiency at scale [DATA REQUESTED]

Last summer, Josh McKenty and I extended the puppies and cattle metaphor to limited life cattle we called “mayflies.” It was an attempt to help drive the cattle mindset (I think of it as social engineering, or maybe PsychOps) by forcing churn. I’ve come to think of it a step in between cattle and chaos monkeys (see Adrian Cockcroft).

While our thoughts were on mainly ops patterns, I’ve heard that there could be a real operational benefit from encouraging this behavior. The increased turn over in the environment improves scheduler optimization, planned load drains and coping with platform/environment migration.

Now we have a chance to quantify this benefit: a college student (disclosure: he’s my son) has created a data center emulation to see if Mayflies help with utilization. His model appears to work.

Now, he needs some real world data, here’s his request for assistance [note: he needs data by 1/20 to be included in this term]:

Hello!

I am Alexander Hirschfeld, a freshman at Rose-Hulman Institute of Technology. I am working on an independent study about Mayflies, a new idea in virtual machine management in cloud computing. Part of this management is load balancing and resource allocation for virtual machines across a collection of servers. The emulation that I am working on needs a realistic set of data to be the most accurate when modeling the results of using the methods outlined by the theory of mayflies.

Mayflies are an extension of the puppies verses cattle approach to machines, they are the extreme version of cattle as they have a known limited lifespan, such as 7 days. This requires the users of the cloud to build inherently more automated and fault-resistant applications. If you could send me a collection of the requests for new virtual machines(per standard unit of time and their requested specs/size), as well as an average lifetime for the virtual machines (or a graph or list of designated/estimated life times), and a basic summary of the collection of servers running the virtual machines(number, ram, cores), I would be better able to understand how Mayflies can affect a cloud.

Thanks,
Alexander Hirschfeld, twitter: @d-qoi

Needless to say, I’m really excited about the progress on demonstrating some the impact of this practice and am looking forward to posting about his results in the near future.

If you post in the comments, I will make sure you are connected to Alex.

OpenStack DefCore Enters Execution Phase. Help!

OpenStack DefCore Committee has established the principles and first artifacts required for vendors using the OpenStack trademark.  Over the next release cycle, we will be applying these to the Ice House and Juno releases.

Like a rockLearn more?  Hear about it LIVE!  Rob will be doing two sessions about DefCore next week (will be recorded):

  1. Tues Dec 16 at 9:45 am PST- OpenStack Podcast #14 with Jeff Dickey
  2. Thurs Dec 18 at 9:00 am PST – Online Meetup about DefCore with Rafael Knuth (optional RSVP)

At the December 2014 OpenStack Board meeting, we completed laying the foundations for the DefCore process that we started April 2013 in Portland. These are a set of principles explaining how OpenStack will select capabilities and code required for vendors using the name OpenStack. We also published the application of these governance principles for the Havana release.

  1. The OpenStack Board approved DefCore principles to explain
    the landscape of core including test driven capabilities and designated code (approved Nov 2013)
  2. the twelve criteria used to select capabilities (approved April 2014)
  3. the creation of component and framework layers for core (approved Oct 2014)
  4. the ten principles used to select designated sections (approved Dec 2014)

To test these principles, we’ve applied them to Havana and expressed the results in JSON format: Havana Capabilities and Havana Designated Sections. We’ve attempted to keep the process transparent and community focused by keeping these files as text and using the standard OpenStack review process.

DefCore’s work is not done and we need your help!  What’s next?

  1. Vote about bylaws changes to fully enable DefCore (change from projects defining core to capabilities)
  2. Work out going forward process for updating capabilities and sections for each release (once authorized by the bylaws, must be approved by Board and TC)
  3. Bring Havana work forward to Ice House and Juno.
  4. Help drive Refstack process to collect data from the field

OpenStack DefCore Process Flow: Community Feedback Cycles for Core [6 points + chart]

If you’ve been following my DefCore posts, then you already know that DefCore is an OpenStack Foundation Board managed process “that sets base requirements by defining 1) capabilities, 2) code and 3) must-pass tests for all OpenStack™ products. This definition uses community resources and involvement to drive interoperability by creating the minimum standards for products labeled OpenStack™.”

In this post, I’m going to be very specific about what we think “community resources and involvement” entails.

The draft process flow chart was provided to the Board at our OSCON meeting without additional review.  It below boils down to a few key points:

  1. We are using the documents in the Gerrit review process to ensure that we work within the community processes.
  2. Going forward, we want to rely on the technical leadership to create, cluster and describe capabilities.  DefCore bootstrapped this process for Havana.  Further, Capabilities are defined by tests in Tempest so test coverage gaps (like Keystone v2) translate into Core gaps.
  3. We are investing in data driven and community involved feedback (via Refstack) to engage the largest possible base for core decisions.
  4. There is a “safety valve” for vendors to deal with test scenarios that are difficult to recreate in the field.
  5. The Board is responsible for approving the final artifacts based on the recommendations.  By having a transparent process, community input is expected in advance of that approval.
  6. The process is time sensitive.  There’s a need for the Board to produce Core definition in a timely way after each release and then feed that into the next one.  Ideally, the definitions will be approved at the Board meeting immediately following the release.

DefCore Process Draft

Process shows how the key components: designated sections and capabilities start from the previous release’s version and the DefCore committee manages the update process.  Community input is a vital part of the cycle.  This is especially true for identifying actual use of the capabilities through the Refstack data collection site.

  • Blue is for Board activities
  • Yellow is or user/vendor community activities
  • Green is for technical community activities
  • White is for process artifacts

This process is very much in draft form and any input or discussion is welcome!  I expect DefCore to take up formal review of the process in October.

Patchwork Onion delivers stability & innovation: the graphics that explains how we determine OpenStack Core

This post was coauthored by the DefCore chairs, Rob Hirschfeld & Joshua McKenty.

The OpenStack board, through the DefCore committee, has been working to define “core” for commercial users using a combination of minimum required capabilities (APIs) and code (Designated Sections).  These minimums are decided on a per project basis so it can be difficult to visualize the impact on the overall effect on the Integrated Release.

Patchwork OnionWe’ve created the patchwork onion graphic to help illustrate how core relates to the integrated release.  While this graphic is pretty complex, it was important to find a visual way to show how different DefCore identifies distinct subsets of APIs and code from each project.  This graphic tries to show how that some projects have no core APIs and/or code.

For OpenStack to grow, we need to have BOTH stability and innovation.  We need to give clear guidance to the community what is stable foundation and what is exciting sandbox.  Without that guidance, OpenStack is perceived as risky and unstable by users and vendors. The purpose of defining “Core” is to be specific in addressing that need so we can move towards interoperability.

Interoperability enables an ecosystem with multiple commercial vendors which is one of the primary goals of the OpenStack Foundation.

Ecosystem OnionOriginally, we thought OpenStack would have “core” and “non-core” projects and we baked that expectation into the bylaws.  As we’ve progressed, it’s clear that we need a less binary definition.  Projects themselves have a maturity cycle (ecosystem -> incubated -> integrated) and within the project some APIs are robust and stable while others are innovative and fluctuating.

Encouraging this mix of stabilization and innovation has been an important factor in our discussions about DefCore.  Growing the user base requires encouraging stability and growing the developer base requires enabling innovation within the same projects.

The consequence is that we are required to clearly define subsets of capabilities (APIs) and implementation (code) that are required within each project.  Designating 100% of the API or code as Core stifles innovation because stability dictates limiting changes while designating 0% of the code (being API only) lessens the need to upstream.  Core reflects the stability and foundational nature of the code; unfortunately, many people incorrectly equate “being core” with the importance of the code, and politics ensues.

To combat the politics, DefCore has taken a transparent, principles-based approach to selecting core.   You can read about in Rob’s upcoming “Ugly Babies” post (check back on 8/14) .

Share the love & vote for OpenStack Paris Summit Sessions (closes Wed 8/6)

 

This is a friendly PSA that OpenStack Paris Summit session community voting ends on Wednesday 8/6.  There are HUNDREDS (I heard >1k) submissions so please set aside some time to review a handful.

Robot VoterMY PLEA TO YOU > There is a tendency for companies to “vote-up” sessions from their own employees.  I understand the need for the practice BUT encourage you to make time to review other sessions too.  Affiliation voting is fine, robot voting is not.

If you are interested topics that I discuss on this blog, here’s a list of sessions I’m involved in:

 

 

DefCore Advances at the Core > My take on the OSCON’14 OpenStack Board Meeting

Last week’s day-long Board Meeting (Jonathan’s summary) focused on three major topics: DefCore, Contribute Licenses (CLA/DCO) and the “Win the Enterprise” initiative. In some ways, these three topics are three views into OpenStack’s top issue: commercial vs. individual interests.

But first, let’s talk about DefCore!

DefCore took a major step with the passing of the advisory Havana Capabilities (the green items are required). That means that vendors in the community now have a Board approved minimum requirements.  These are not enforced for Havana so that the community has time to review and evaluate.

Designated Sections (1)For all that progress, we only have half of the Havana core definition complete. Designated Sections, the other component of Core, will be defined by the DefCore committee for Board approval in September. Originally, we expected the TC to own this part of the process; however, they felt it was related to commercial interested (not technical) and asked for the Board to manage it.

The coming meetings will resolve the “is Swift code required” question and that topic will require a dedicated post.  In many ways, this question has been the challenge for core definition from the start.  If you want to join the discussion, please subscribe to the DefCore list.

The majority of the board meeting was spent discussion other weighty topics that are work a brief review.

Contribution Licenses revolve around developer vs broader community challenge. This issue is surprisingly high stakes for many in the community. I see two primary issues

  1. Tension between corporate (CLA) vs. individual (DCO) control and approval
  2. Concern over barriers to contribution (sadly, there are many but this one is in the board’s controls)

Win the Enterprise was born from product management frustration and a fragmented user base. My read on this topic is that we’re pushing on the donkey. I’m hearing serious rumbling about OpenStack operability, upgrade and scale.  This group is doing a surprisingly good job of documenting these requirements so that we will have an official “we need this” statement. It’s not clear how we are going to turn that statement into either carrots or sticks for the donkey.

Overall, there was a very strong existential theme for OpenStack at this meeting: are we a companies collaborating or individuals contributing?  Clearly, OpenStack is both but the proportions remain unclear.

Answering this question is ultimately at the heart of all three primary topics. I expect DefCore will be on the front line of this discussion over the next few weeks (meeting 1, 2, and 3). Now is the time to get involved if you want to play along.