DefCore Core Capabilities Selection Criteria SIMPLIFIED -> how we are picking Core

I’ve posted about the early DefCore core capabilities selection process before and we’ve put them into application and discussed them with the community.  The feedback was simple: tl;dr.  You’ve got the right direction but make it simpler!

So we pulled the 12 criteria into four primary categories:

  1. Usage: the capability is widely used (Refstack will collect data)
  2. Direction: the capability advances OpenStack technically
  3. Community: the capability builds the OpenStack community experience
  4. System: the capability integrates with other parts of OpenStack

These categories summarize critical values that we want in OpenStack and so make sense to be the primary factors used when we select core capabilities.  While we strive to make the DefCore process objective and quantitive, we must recognize that these choices drive community behavior.

With this perspective, let’s review the selection criteria.  To make it easier to cross reference, we’ve given each criteria a shortened name:

Shows Proven Usage

  • Widely Deployed” Candidates are widely deployed capabilities.  We favor capabilities that are supported by multiple public cloud providers and private cloud products.
  • “Used by Tools” Candidates are widely used capabilities:Should be included if supported by common tools (RightScale, Scalr, CloudForms, …)
  • Used by Clients” Candidates are widely used capabilities: Should be included if part of common libraries (Fog, Apache jclouds, etc)
Aligns with Technical Direction
  • Future Direction” Should reflect future technical direction (from the project technical teams and the TC) and help manage deprecated capabilities.
  • “Stable” Test is required stable for >2 releases because we don’t want core capabilities that do not have dependable APIs.
  • “Complete” Where the code being tested has a designated area of alternate implementation (extension framework) as per the Core Principles, there should be parity in capability tested across extension implementations.  This also implies that the capability test is not configuration specific or locked to non-open technology.

Plays Well with Others

  • “Discoverable” Capability being tested is Service Discoverable (can be found in Keystone and via service introspection)
  • “Doc’d” Should be well documented, particularly the expected behavior.  This can be a very subjective measure and we expect to refine this definition over time.
  • “Core in Last Release”  A test that is a must-pass test should stay a must-pass test.  This make makes core capabilities sticky release per release.  Leaving Core is disruptive to the ecosystem

Takes a System View

  • Foundation” Test capabilities that are required by other must-pass tests and/or depended on by many other capabilities
  • “Atomic” Capabilities is unique and cannot be build out of other must-pass capabilities
  • “Proximity” (sometimes called a Test Cluster) selects for Capabilities that are related to Core Capabilities.  This helps ensure that related capabilities are managed together.

Note: The 13th “non-admin” criteria has been removed because Admin APIs cannot be used for interoperability and cannot be considered Core.

Networking in Cloud Environments, SDN, NFV, and why it matters [part 1 of 2]

scott_jensen2Scott Jensen is an Engineering Director and colleague of mine from Dell with deep networking and operations experience.  He had first hand experience deploying OpenStack and Hadoop and has a critical role in defining Dell’s Reference Architectures in those areas.  When I saw this writeup about cloud networking, I asked if it would be OK to share it with you.

Guest Post 1 of 2 by Scott Jensen:

Having a basis in enterprise data center networking, Cloud computing I have many conversations with customers implementing a cloud infrastructure.  Their design the networking infrastructure can and should be different from a classic network configuration and many do not understand why.  Either due to a lack of knowledge in networking or due to a lack of understanding as to why cloud computing is different from virtualization.  Once you have an understanding of both of these areas you can begin to see why emerging technologies such as SDN (Software Defined Networking) and NFV (Network Function Virtualization) begin to address some of the issues that Cloud Computing can cause with your network.

Networking is all about traffic flows.  In order to properly design your infrastructure you need to understand where traffic is originating, where it is going and how much traffic will be following a specific route and at what times.

There are many differences between Cloud Computing and virtualization.  In many cases people I will talk to think of Cloud as virtualization in a different environment.  Of course this will work just fine however it does not take advantage of the goodness that a Cloud infrastructure can bring.  Some of the major differences between Virtualization and Cloud Computing have profound effects on how the network is utilized.  This all has to do with the application.  That is really what it is all about anyway.  Rob Hirschfeld has a great post on the difference between Pets and Cattle which describes this well.

Pets and Cattle as a workload evolution

In typical virtualized infrastructures, the applications have a fairly common pattern.  Many people describe these as Pets and are managed largely the same as a physical system.  They have a name, they are one of a kind, they are cared for, and when the die it can be traumatic (I know I have been there).

  • They run on large stateful VMs
  • They have a lifecycle which is typically very long such as years
  • The applications themselves are not designed to tolerate failures.  Other technologies are brought in to ensure uptime.
  • The application is scaled up when demands increase.  This is done by adding more memory or CPU to the VM.

Cloud applications are different.  Some people describe them as cattle and they are treated like cattle in many ways.  They do not necessarily have a name and if one dies it is sad but not a really big deal.  We should probably figure out what killed it but life goes on.

  • They run on smaller stateless VMs
  • They have a lifecycle measured in hours or months.  Sometimes even less than an hour.
  • The application is designed to expect failures
  • The application scales out by increasing the number of instances which is running when the demand increases.

In his follow-up post next week, Scott discusses how this impacts the network and how SDN and NFV promises to help.

 

 

OpenCrowbar.Anvil released – hammering out a gold standard in open bare metal provisioning

OpenCrowbarI’m excited to be announcing OpenCrowbar’s first release, Anvil, for the community.  Looking back on our original design from June 2012, we’ve accomplished all of our original objectives and more.
Now that we’ve got the foundation ready, our next release (OpenCrowbar Broom) focuses on workload development on top of the stable Anvil base.  This means that we’re ready to start working on OpenStack, Ceph and Hadoop.  So far, we’ve limited engagement on workloads to ensure that those developers would not also be trying to keep up with core changes.  We follow emergent design so I’m certain we’ll continue to evolve the core; however, we believe the Anvil release represents a solid foundation for workload development.
There is no more comprehensive open bare metal provisioning framework than OpenCrowbar.  The project’s focus on a complete operations model that comprehends hardware and network configuration with just enough orchestration delivers on a system vision that sets it apart from any other tool.  Yet, Crowbar also plays nicely with others by embracing, not replacing, DevOps tools like Chef and Puppet.
Now that the core is proven, we’re porting the Crowbar v1 RAID and BIOS configuration into OpenCrowbar.  By design, we’ve kept hardware support separate from the core because we’ve learned that hardware generation cycles need to be independent from the operations control infrastructure.  Decoupling them eliminates release disruptions that we experienced in Crowbar v1 and­ makes it much easier to use to incorporate hardware from a broad range of vendors.
Here are some key components of Anvil
  • UI, CLI and API stable and functional
  • Boot and discovery process working PLUS ability to handle pre-populating and configuration
  • Chef and Puppet capabilities including Birk Shelf v3 support to pull in community upstream DevOps scripts
  • Docker, VMs and Physical Servers
  • Crowbar’s famous “late-bound” approach to configuration and, critically, networking setup
  • IPv6 native, Ruby 2, Rails 4, preliminary scale tuning
  • Remarkably flexible and transparent orchestration (the Annealer)
  • Multi-OS Deployment capability, Ubuntu, CentOS, or Different versions of the same OS
Getting the workloads ported is still a tremendous amount of work but the rewards are tremendous.  With OpenCrowbar, the community has a new way to collaborate and integration this work.  It’s important to understand that while our goal is to start a quarterly release cycle for OpenCrowbar, the workload release cycles (including hardware) are NOT tied to OpenCrowbar.  The workloads choose which OpenCrowbar release they target.  From Crowbar v1, we’ve learned that Crowbar needed to be independent of the workload releases and so we want OpenCrowbar to focus on maintaining a strong ops platform.
This release marks four years of hard-earned Crowbar v1 deployment experience and two years of v2 design, redesign and implementation.  I’ve talked with DevOps teams from all over the world and listened to their pains and needs.  We have a long way to go before we’re deploying 1000 node OpenStack and Hadoop clusters, OpenCrowbar Anvil significantly moves the needle in that direction.
Thanks to the Crowbar community (Dell and SUSE especially) for nurturing the project, and congratulations to the OpenCrowbar team getting us this to this amazing place.

 

Reference Deployments are Critical [2/4 series on Operating Open Source Infrastructure]

This post is the second in a 4 part series about Success factors for Operating Open Source Infrastructure.

plansWhen we look at reference deployments, there are several things that make a good referenced deployment; and ones that are useful by the community.

First, a referenced deployment needs to be specific and useful. They have to be identified as solving a specific problem using the software. And they have to have a specific configuration that can be described in a way that creates a workable scenario for that. There may be multiple useful reference implementations. And in that case, each one needs to be identified as the – by the expected behavior. For example, our deployments include a compute centric configuration that has hardware configurations and network configurations adapted to compute focused applications.

They also have storage focused applications that are specifically targeted at enabling cheap and deep storage nodes for that type of situation. Both configurations are important and valid but they require different implementations, different details and different reference architectures. As long as it is clear that there are multiple patterns, the community is perfectly able to absorb and use these patterns.

Establishment of a widely adopted best practice is a central success criteria for any project.

Best practices ensure that deployers of the technology cannot only purchase implementations that will be successful, but they can also compare notes to work with their community. A significant adoption curve happens after the establishment of these best practices because at that point, the risk of purchase dramatically drops, and the ability to support radically increases. The next thing that’s important in the establishment of these technologies is that that reference implementation or the reference architecture has a way to be configured in a repeatable way.

Very often, this takes the form of deployment books from manuals. While useful in small deployments, in a hyperscale deployment the books really have diminishing value. This is because the level of human error – the chance of making a fundamental mistake during configuration – increases exponentially with the number of nodes, because each node is tightly interconnected with other nodes within the system.

My team at Dell launched the Crowbar project as a way to reduce or mitigate this effort substantially. We recognized that the number one cause of delays and impacts in time to value in a hyperscale deployment is configuration and set-up. Any simple mistake made during configuration, even down to ordering of the gear, or physical defects within the infrastructure, will create dramatic delays in troubleshooting and diagnosing those issues. By automating the process, we have ensured that we can bootstrap the system quickly.

The goal of automated best practice is to bootstrap in a conforming and repeatable way. This enables the community to work together immediately towards return on investment, and greatly reduces the risk of problems caused by human error. For example, it’s typical within a site for us to find that network configurations do not match the specifications. In many cases, we find issues with the core networking infrastructure not matching the way it was originally designed. We also find failures on physical infrastructure, disk failures, system mismatches,and unanticipated configuration. Any one of these problems with a human setup might be missed or overlooked.

Validated reference architectures, while valuable, are no longer sufficient.   Automated reference configurations have become the key to successfully delivered solutions.

Interested in more?  Read part 3

 

 

 

 

 

 

Success Factors of Operating Open Source Infrastructure [Series Intro]

2012-10-28_14-13-24_502Building a best practices platform is essential to helping companies share operations knowledge.   In the fast-moving world of open source software, sharing documentation about what to do is not sufficient.  We must share the how to do it also because the operations process is tightly coupled to achieving ongoing success.

Further, since change is constant, we need to change our definition of “stability” to reflect a much more iterative and fluid environment.

Baseline testing is an essential part of this platform. It enables customers to ensure not only fast time to value, but the tests are consistently conforming with industry best practices, even as the system is upgraded and migrates towards a continuous deployment infrastructure.

The details are too long for a single post so I’m going to explore this as three distinct topics over the next two weeks.

  1. Reference Deployments talks about needed an automated way to repeat configuration between sites.
  2. Ops Validation using Development Tests talks about having a way to verify that everyone uses a common reference platform
  3. Shared Open Operatons / DevOps (pending) talks about putting reference deployment and common validation together to create a true open operations practice.

OpenStack, Hadoop, Ceph, Docker and other open source projects are changing the landscape for information technology. Customers seeking to become successful with these evolving platforms must look beyond the software bits, and consider both the culture and operations.  The culture is critical because interacting with the open source projects community (directly or through a proxy) can help ensure success using the software. Operations are critical because open source projects expect the community to help find and resolve issues. This results in more robust and capable products. Consequently, users of open source software must operate in a more fluid environment.

My team at Dell saw this need as we navigated the early days of OpenStack.  The Crowbar project started because we saw that the community needed a platform that could adapt and evolve with the open source projects that our advanced customers were implementing. Our ability to deliver an open operations platform enables the community to collaborate, and to skip over routine details to refocus on shared best practices.

My recent focus on the OpenStack DefCore work reinforces these original goals.  Using tests to help provide a common baseline is a concrete, open and referenceable way to promote interoperability.  I hope that this in turn drives a dialog around best practices and shared operations because those help mature the community.

OpenStack automated high-availability deploy reality, SUSE shows off chops with Crowbar

While I’ve been focused on delivering next-generation kick-aaS-i-ness with Crowbar v2 (now called OpenCrowbar) and helping the Dell and Red Hat co-engineer a OpenStack Powered Cloud, SUSE has been continuing to expand and polish the OpenStack deployment on Crowbar v1.  I’m always impressed by commit activity (SUSE is the top committer in the Crowbar project) and was excited to see their Havana launch announcement.

Using Crowbar v1, SUSE is delivering a seriously robust automated OpenStack Havana implementation.  They have taken the time to build high availability (HA) across the framework including for Neutron, Heat and Ceilometer.

As an OpenStack Foundation board member, I hear a lot of hand-wringing in the community about ops practices and asking “is OpenStack is ready for the enterprise?”  While I’m not sure how to really define “enterprise,” I do know that SUSE Cloud Havana release version also) shows that it’s possible to deliver a repeatable and robust OpenStack deployment.

This effort shows some serious DevOps automation chops and, since Crowbar is open, everyone in the community can benefit from their tuning.   Of course, I’d love to see these great capabilities migrate into the very active StackForge Chef OpenStack cookbooks that OpenCrowbar is designed to leverage.

Creating HA automation is a great achievement and an important milestone in capturing the true golden fleece – automated release-to-release upgrades.  We built the OpenCrowbar annealer with this objective in mind and I feel like it’s within reach.

Running with scissors > DefCore “must-pass” Road Show Starts [VIDEOS]

The OpenStack DefCore committee has been very active during this cycle turning the core definition principles into an actual list of “must-pass” capabilities (working page).  This in turn gives the community something tangible enough to review and evaluate.

Capabilities SelectionTL;DR!  We appreciate those in the community who have been patient enough to help define and learn the process we’re using the make selections; however, we also recognize that most people want to jump to the results.

This week, we started a “DefCore roadshow” with the goal of learning how to make this huge body of capabilities, process and impact easier to digest (draft write-up for review & Troy Toman’s notes).  So far we’ve had two great sessions on this topic.  We took notes and recorded at both meetups (San Francisco & Austin).

My takeaways of these initial meetups are:

  • Jump to the Capabilities right away, the process history is not needed up front
  • You need more graphics – specifically, one for the selection criteria (what do you think of my 1st attempt?)
  • Work from some examples of scored capabilities
  • Include some specific use-cases with a user, 2 types of private cloud and a public cloud to help show the impact

Overall, people like what they are hearing.  It makes sense and decisions are justified.

We need more feedback!  Please help us figure out how to explain this for the broader community.

Anyone else find picking OpenStack summit sessions overwhelming?!

Choices!The scope and diversity of sessions for the upcoming OpenStack conference in Atlanta are simply overwhelming.   As a board member, that’s a positive sign of our success as a community; however, it’s also a challenge as we attempt to pick topics.  That’s why we turn to you, the OpenStack community, to help sift and select the content.

Even if you are not attending, we need your help in selection!  Content from the summits is archived and has a much larger outside of the two conference days.  You’re voice matters for the community.

While it’s a simple matter to ask you to vote for my DefCore presentation and some excellent ones from my peers at Dell, I’d also like share some of my thoughts about general trends I saw illustrated by the offerings:

  • Swift has a strong following as a solution outside of other products
  • Ceph seems to emerging as a critical component with Cinder
  • Neutron has breath but not depth in practice
  • HA and Upgrades remain challenges
  • We are starting to see specializations emerge (like NFV)
  • OpenStack case studies!  There are many – some of uncertain utility as references
  • Some community members and companies are super prolific in submitting sessions.  Perhaps these sessions are all great but on first pass it seems out of balance.
  • Vendor pitch or conference session?  You often get both in the same session.  We’re still not certain how to balance this.

The number and diversity of sessions is staggering – we need your help on voting.

We also need you to be part of the dialog about the conference and summits to make sure they are meeting the community needs.  My review of the sessions indicates that we are trying to serve many different audiences in a very limited time window.  I’m interested in hearing yours!  Review some sessions and let me know.

OpenStack Board Elections: What I’ll do in 2014: DefCore, Ops, & Community

Rob HirschfeldOpenStack Community,

The time has come for you to choose who will fill the eight community seats on the Board (ballot links went out Sunday evening CST).  I’ve had the privilege to serve you in that capacity for 16 months and would like to continue.  I have leadership role in Core Definition and want to continue that work.

Here are some of the reasons that I am a strong board member:

  • Proven & Active Leadership on Board – I have been very active and vocal representing the community on the Board.  In addition to my committed leadership in Core Definition, I have played important roles shaping the Gold Member grooming process and trying to adjust our election process.  I am an outspoken yet pragmatic voice for the community in board meetings.
  • Technical Leader but not on the TC – The Board needs members who are technical yet detached from the individual projects enough to represent outside and contrasting views.
  • Strong User Voice – As the senior OpenStack technologist at Dell, I have broad reach in Dell and RedHat partnership with exposure to a truly broad and deep part of the community.  This makes me highly accessible to a lot of people both in and entering the community.
  • Operations Leadership – Dell was an early leader in OpenStack Operations (via OpenCrowbar) and continues to advocate strongly for key readiness activities like upgrade and high availability.  In addition, I’ve led the effort to converge advanced cookbooks from the OpenCrowbar project into the OpenStack StackForge upstreams.  This is not a trivial effort but the right investment to make for our community.
  • And there’s more… you can read about my previous Board history in my 2012 and 2013 “why vote for me” posts or my general OpenStack comments.

And now a plea to vote for other candidates too!

I had hoped that we could change the election process to limit blind corporate affinity voting; however, the board was not able to make this change without a more complex set of bylaws changes.  Based on the diversity and size of OpenStack community, I hope that this issue may no longer be a concern.  Even so, I strongly believe that the best outcome for the OpenStack Board is to have voters look beyond corporate affiliation and consider a range of factors including business vs. technical balance, open source experience, community exposure, and ability to dedicate time to OpenStack.

How are we picking the OpenStack DefCore “must pass” tests?

Fancy ElephantThis post comes with a WARNING LABEL… THE FOLLOWING SELECTION CRITERIA ARE PRELIMINARY TO GET FEEDBACK AND HELP VALIDATE THE PROCESS.
UPDATE 5/7/14 > see the OFFICIAL version.
ORIGINAL TEXT

As part of the DefCore work, we have the challenge of taking all the Tempest tests and figuring out which ones are the “must-pass” tests that will define core (our note pages).  We want to have a very transparent and objective process for picking the tests so we need to have well defined criteria and a selection process.

Figuring out the process will be iterative.  The list below represents a working set of selection criteria that are applied to the tests.  The DefCore committee will determine relative weights for the criteria after the tests have been scored because it was clear in discussion that not all of these criteria should have equal weight.
Once a test passes the minimum criteria score and becomes “must-pass” the criteria score does not matter – the criteria are only used for selecting tests. As per the Core principles, passing all “must-pass” test will be required to be considered core.
So what are these 13 preliminary criteria (source)?
1. Test is required stable for >2 releases (because things leaving Core are bad)
  • the least number/amount of must pass tests as possible (due to above)
  • but noting that the number will increase over time
  • least amount of change from current requirements as possible (nova, swift 2 versions)
  • (Acknowledge that deprecation is punted for now, but can be executed by TC)
2. Where the code being tested has an designed area of alternate implementation (extension framework) as per the Core Principles, there should be parity in capability tested across extension implementations
  • Test is not configuration specific (test cannot meet criteria if it requires a specific configuration)
  • Test does not require an non-open extension to pass (only the OpenStack code)
3. Capability being tested is Service Discoverable (can be found in Keystone and via service introspection) – MONTY TO FIX WORDING around REST/DOCS, etc.
  • Nearly core or “compatible” clouds need to be introspected to see what’s missing
  • Not clear at this point if it’s project or capability level enforced.  Perhaps for Elephant it’s project but it should move to capability for later
4A, 4B & 4C. Candidates are widely used capabilities
  • 4A favor capabilities that are supported by multiple public cloud providers and private cloud products
    • Allow the committee to use expert judgement to promote capabilities that need to resolve the “chicken-and-egg”
    • Goals are both diversity and quantity of users
  • 4B. Should be included if supported by common tools (Ecosystem products includes)
  • 4C. Should be included if part of common libraries (Fog, Apache jclouds, etc)
5. Test capabilities that are required by other must-pass tests and/or depended on by many other capabilities
6. Should reflect future technical direction (from the project technical teams and the TC)
  • Deprecated capabilities would be excluded (or phased out)
  • This could potentially become a “stick” if used incorrectly because we could force capabilities
7. Should be well documented, particularly the expected behavior.
  • includes the technical references for others in the project as well as documentation for the users and or developers accessing the feature or functionality
8. A test that is a must-pass test should stay a must-pass test (makes must-pass tests sticky release per release)
9. A test for a Capability with must-pass tests is more likely to be considered must-pass
10 Capabilities is unique and cannot be build out of other must-pass capabiliies
  • Candidates favor capabilities that users cannot implement if given the presence of other capabilities
  • consider the pain to users if a cloud doesn’t have the capability – not so much pain if they can run it themselves
  • “Unique capabilities that cannot be build out of other must-pass capabilities should not be considered as strongly”
11. Tests do not require administrative right to execute
We expect these criteria to change based on implementation experience and community input; however, we felt that further discussion without implementation was getting diminishing returns.  It’s import to remember that all of the criteria are not equal, they will have relative weights to help drive tune the results.