How are we picking the OpenStack DefCore “must pass” tests?

Fancy ElephantThis post comes with a WARNING LABEL… THE FOLLOWING SELECTION CRITERIA ARE PRELIMINARY TO GET FEEDBACK AND HELP VALIDATE THE PROCESS.
UPDATE 5/7/14 > see the OFFICIAL version.
ORIGINAL TEXT

As part of the DefCore work, we have the challenge of taking all the Tempest tests and figuring out which ones are the “must-pass” tests that will define core (our note pages).  We want to have a very transparent and objective process for picking the tests so we need to have well defined criteria and a selection process.

Figuring out the process will be iterative.  The list below represents a working set of selection criteria that are applied to the tests.  The DefCore committee will determine relative weights for the criteria after the tests have been scored because it was clear in discussion that not all of these criteria should have equal weight.
Once a test passes the minimum criteria score and becomes “must-pass” the criteria score does not matter – the criteria are only used for selecting tests. As per the Core principles, passing all “must-pass” test will be required to be considered core.
So what are these 13 preliminary criteria (source)?
1. Test is required stable for >2 releases (because things leaving Core are bad)
  • the least number/amount of must pass tests as possible (due to above)
  • but noting that the number will increase over time
  • least amount of change from current requirements as possible (nova, swift 2 versions)
  • (Acknowledge that deprecation is punted for now, but can be executed by TC)
2. Where the code being tested has an designed area of alternate implementation (extension framework) as per the Core Principles, there should be parity in capability tested across extension implementations
  • Test is not configuration specific (test cannot meet criteria if it requires a specific configuration)
  • Test does not require an non-open extension to pass (only the OpenStack code)
3. Capability being tested is Service Discoverable (can be found in Keystone and via service introspection) – MONTY TO FIX WORDING around REST/DOCS, etc.
  • Nearly core or “compatible” clouds need to be introspected to see what’s missing
  • Not clear at this point if it’s project or capability level enforced.  Perhaps for Elephant it’s project but it should move to capability for later
4A, 4B & 4C. Candidates are widely used capabilities
  • 4A favor capabilities that are supported by multiple public cloud providers and private cloud products
    • Allow the committee to use expert judgement to promote capabilities that need to resolve the “chicken-and-egg”
    • Goals are both diversity and quantity of users
  • 4B. Should be included if supported by common tools (Ecosystem products includes)
  • 4C. Should be included if part of common libraries (Fog, Apache jclouds, etc)
5. Test capabilities that are required by other must-pass tests and/or depended on by many other capabilities
6. Should reflect future technical direction (from the project technical teams and the TC)
  • Deprecated capabilities would be excluded (or phased out)
  • This could potentially become a “stick” if used incorrectly because we could force capabilities
7. Should be well documented, particularly the expected behavior.
  • includes the technical references for others in the project as well as documentation for the users and or developers accessing the feature or functionality
8. A test that is a must-pass test should stay a must-pass test (makes must-pass tests sticky release per release)
9. A test for a Capability with must-pass tests is more likely to be considered must-pass
10 Capabilities is unique and cannot be build out of other must-pass capabiliies
  • Candidates favor capabilities that users cannot implement if given the presence of other capabilities
  • consider the pain to users if a cloud doesn’t have the capability – not so much pain if they can run it themselves
  • “Unique capabilities that cannot be build out of other must-pass capabilities should not be considered as strongly”
11. Tests do not require administrative right to execute
We expect these criteria to change based on implementation experience and community input; however, we felt that further discussion without implementation was getting diminishing returns.  It’s import to remember that all of the criteria are not equal, they will have relative weights to help drive tune the results.

OpenStack Core Definition (DefCore) Progress in 6 key areas

DefCore Elephant Cycle

I’m excited to report about the OpenStack Board progress on defining OpenStack core.  At the Hong Kong summit, Joshua McKenty and I were asked to chair a new standing committee, now known as DefCore, to define “OpenStack Core” based on the core principles that we determined over the last 6 months (aka “the spider”).

Joshua and I took on the challenge with gusto and I’m proud to say that we’ve already made significant progress against an aggressive timeline to have the pilot must-pass tests for Havana defined before the Juno Summit in April 2014.  It’s important to remember that we’re moving from a project based definition of core to test-driven capabilities because this best addresses our interoperability objectives.

In the 8 weeks since the summit, we’ve had six very productive meetings (etherpads for Prep, DefCore.1, DefCore.2, Criteria 1 and 2) with detailed notes and recorded content. Here’s my summary of our results so far:

  1. An Aggressive Timeline for having pilot Havana must-pass tests approved by the Juno summit in May 2014.  That drives the schedule backward toward a preliminary list in March.  Once we have a pilot list for Havana, we expect to have Ice House done +90 days and Juno done at the Paris summit.

  2. Test Selection Criteria a preliminary set of 14 criteria (needs a stand alone post) that will be used to quantitatively score the current 700+ tests.  We also agreed to use a max 100 point weighting system for the criteria.  The weights and score requirement iteratively once we have done a first scoring pass.  Our objective is to make must-pass test selection as objective and transparent as possible (post with details).

  3. Distinction between Capability & Test is important because we recognize that individual tests may validate multiple capabilities and individual capabilities may have multiple tests.  Our hope is to present the results in terms of capabilities not individual tests.

  4. Holding Off on Bylaws Changes needed to clarify how OpenStack manage core definition.  It was widely expected that the DefCore committee would have to make changes to the OpenStack bylaws; however, we believe we can proceed without rushing changes.  We have an active subcommittee preparing changes in advance of the next DefCore cycle.

  5. Program vs. Project Definition efforts are needed to help take pressure off requests to have “projects promoted to core status” and how the OpenStack trademark is used for projects.  We are trying to clarify OpenStack Programs (e.g.: OpenStack™ Compute) carry to the trademark while OpenStack Projects (e.g.: Nova and Glace) are members of those programs and do not carry the OpenStack trademark directly.  Consequently, we’d expect people to say “OpenStack Compute Project Nova” instead of “OpenStack Nova.”  This approach addresses several issues that impact DefCore Board activities around trademark, core and brand.

  6. RefStack Development and Use Cases provide the framework for community reporting of test results.  We consider this infrastructure critical to getting community input about must-pass tests and also sharing interoperability information.  This effort is just beginning and needs help from the community.

For all this progress, we are only starting!  We’ve cleared the blocks preventing implementation and that will expose a new set issues to discuss.  Look for us to start applying the criteria to tests in the next months.  That will quickly expose the strengths and weaknesses of our criteria set.  We’ve also got to make progress on Program vs. Project and start RefStack coding.

We want community participation!  Please let us know what you think.

Mark Stouse’s “Making Predictions for 14” series

I was invited to be part of Mark Stouse’s 2014 big data & cloud predictions series.  His questions had me thinking deeply about the past year and I’m happy to repost them here with links to the other predictors too including (Robert ScobleShel Israel, and David H. Deans).

1.  Describe in one sentence what you do and why you’re good at it.

I specialize in architecture for infrastructure software for scale data center operations (aka “cloud”) and I have 14 years of battle scars that inform my designs.

 2.  Cloud Computing, Big Data or Consumerization: Which trend do you feel is having the most impact on IT today and why?

Cloud, Data & Consumerization are all connected, so there’s no one clear “most impactful” winner except that all three are forcing IT to rethink how we handle operations.   The pace of change for these categories (many of which are open source driven) is so fast that traditional IT governance cannot keep up.  I’m specifically talking about the DevOps and Lean Software Delivery paradigms.  These approaches do not mean that we’re trading speed for quality; in fact, I’ve seen that we’re adopting techniques that deliver both higher quality and speed.

 3.  What do you think is the biggest misconception about Cloud computing/Big Data/Consumerization?

That someone can purchase them as a SKU.  These are really architectural concepts that impact how we solve problems instead of specific products.  My experience is that customers overlook their need to understand how to change their business to take advantage of these technologies.  It’s the same classic challenge for ROI from most new technologies – they don’t exist apart from the business matching changes to the business to leverage them.

 4.  Which (Cloud Computing/Big Data/Consumerization) trend has surprised you most in the last five years?

Open source has surprised me because we’ve seen it transform from a cost concern into a supply chain concern.  When I started doing open source work for Dell, customers were very interested in innovation and controlling license costs.  This has really changed over the last few years.  Today, customers are more concerned with community participation and transparency of their product code base.  This surprised me until I realized that they are really seeking to ensure that they had maximum control and visibility into their “IT Supply Chain.”   It may seem like a paradox, but open source software is uniquely positioned to help companies maintain more control of their critical IT because they are not tightly coupled to a single vendor.

 5.  How has Cloud Computing/Big Data/Consumerization had the biggest impact in YOUR life to date?

Beyond it being my career, I believe these technologies have created a new degree of freedom for me.  I’m answering these questions from the SFO airport where I’m carrying all of the tools I need to do my job in a space small enough to fit under the seat in front of me plus a free Wifi connection.  I believe we are only just learning how access to information and portable computing will change our experience.  This learning process will be both liberating and painful as we work out the right balances between access, identity and privacy.

 6.  On a lighter note – If Cloud/Big Data/Consumerization could be personified by a superhero, which superhero would it be and why?

The Hulk.   Looks like a friendly geek but it’s going to crush you if you’re not careful.

 7.  What aspect of (Cloud Computing/Big Data/ Consumerization) are you most excited about in the future, and what excites you about it?

The Internet of Things (even if I hate the term) is very exciting because we’re moving into a place where we have real ways to connect our virtual and physical lives.  That translates into cool technologies like self-driving cars and smart power utilities.  I think it will also motivate a revolution in how people interact with computers and each other.  It’s going to open up a whole new dimension on our personal interaction with our surroundings.   I’m specifically thinking about a book “Rainbows End” by Vernor Vinge that paints this future in vivid detail.

Competition should be core to OpenStack Technical Meritocracy

In my work at Dell, Technical Meritocracy means that we recognize and promote demonstrated talent into leadership roles. As a leader, one has to make technical judgments (OK, informed opinions) that focus limited resources in the (hopefully) right places. Being promoted does not automatically make someone right all the time.

I believe that good leaders recognize the value of a diverse set of opinions and the learning value of lean deliverables.

OpenStack is an amazingly diverse and evolving community. Leading in OpenStack requires a level of humility that forces me to reconsider my organization hierarchical thinking around “technical meritocracy.” Instead of a hierarchy where leadership chooses right and wrong, rising in the community meritocracy is about encouraging technical learning and user participation.

OpenStack is a melting pot of many interests and companies. Some of them naturally aligned (customers+vendors) and others are otherwise competitive (vendors). The vast majority of contribution to OpenStack is sponsored – companies pay people to participate and fund the foundation that organizes events. That does not diminish our enthusiasm for the community or open values, but it adds an additional dimension

If we are really seeking a Technical Meritocracy, we must create a place where ideas, teams, projects and companies can pursue different approaches within OpenStack. This is essential to our long term success because it provides a clear way for people to experiment within the project. Pushing away alternate approaches is likely to lead to forking. Specifically, I believe that the mostly likely competitor to any current OpenStack project will be that project’s .next version!

Calls for a “benevolent dictator” imply that our meritocracy has a single person with perspective on right and wrong. Not only is OpenStack simply too complex, I see our central design tenant as enabling multiple approaches to work it out in the community. This is especially important because many aspects of OpenStack are not one-size-fits all. The target diversity of our community requires that we enable multiple approaches so we can expand our user base.

The risk of anointing a single person, approach or project as “the OpenStack way” may appear to streamline the project, but it really stifles innovation. We have a healthy ecosystem of vendors who gladly express opinions about the right way to implement OpenStack. They help us test OpenStack technical merit by finding out which opinions appeal to users. It is essential to our success to enable a vibrant diversity because I don’t think there’s a single right answer or approach.

In every case, those vendor opinions are based on focused markets and customer needs; consequently, our job in the community is to respect and incorporate these divergent needs and find consensus.

Spinning up OpenStack “DefCore” Committee by spotting elephants

ElephantsThis week, Joshua McKenty, me and a handful of interested individuals (board member Eileen Evans included) met to start organizing the DefCore* Committee.  This standing committee was established by an OpenStack Foundation resolution just before the Hong Kong Summit.  Joshua and I were nominated as co-chairs (and about half the board volunteered to be members).    This action was an immediate result of the unanimous passage of the 10 Principles that I was driving in the DefCore “Spider” cycle.

We heard overwhelmingly at the Hong Kong summit that defining core should be a major focus for the Board.

The good news is that we’re doing exactly that in CoreDef.  Our challenge is to go quickly but not get ahead of community consensus.  So far, that means eating the proverbial elephant in small bites and intentionally deferring topics where we cannot find consensus.

This meeting was primarily about Joshua and I figuring out how to drive DefCore quickly (go fast!) without exceeding the communities ability to review and discuss (build consensus!).  While we had future-post-worthy conceptual discussions, we had a substantial agenda of get-it-done in front of us too.

Here’s a summary of key outcomes from the meeting:

1)      We’ve established a tentative schedule for our first two meetings (12/3 and 12/17).

  1. We’ve started building agendas for these two meetings.
  2. We’ve also established rules for governance that include members to do homework!

2)      We’ve agreed it’s important to present a bylaws change to the committee for consideration by the board.

  1. This change is to address confusion around how core is defined and possibly move towards the bylaws defining a core process not a list of core projects.
  2. This is on an accelerated track because we’d like to include it with the Community Board Member elections.

3)      We’ve broken DefCore into clear “cycles” so we can be clearer about concrete objectives and what items are out of scope for a cycle.  We’re using names to designate cycles for clarity.

  1. The first cycle, “Spider,” was about finding the connections between core issues and defining a process to resolve the tension in those connections.
  2. This cycle, “Elephant,” is about breaking the Core definition into
  3. The next cycle(s) will be named when we get there.  For now, they are all “Future”
  4. We agreed there is a lot of benefit from being clear to community about items that we “kick down the road” for future cycles.  And, yes, we will proactively cut off discussion of these items out of respect for time.

4)      We reviewed the timeline proposed at the end of Spider and added it to the agenda.

  1. The timeline assumes a staged introduction starting with Havana and accelerating for each release.
  2. We are working the timeline backwards to ensure time for Board, TC and community input.

5)      We agreed that consensus is going to be a focus for keeping things moving

  1. This will likely drive to a smaller core definition
  2. We will actively defer issues that cannot reach consensus in the Elephant cycle.

6)      We identified some concepts that may help guide the process in this cycle

  1. We likely need to create categories beyond “core” to help bucket tests
  2. Committee discussion is needed but debate will be time limited

7)      We identified the need to start on test criteria immediately

  1. Board member John Zannos (in absentia) offered to help lead this effort
  2. In defining test criteria, we are likely to have lively discussions about “OpenStack’s values”

8)      We identified some out of scope topics that are important but too big to solve.

  1. We are calling these “elephants” (or the elephant in the room).
  2. The list of elephants needs to be agreed by DefCore and clearly communicated
  3. We expect that the Elephant cycle will make discussing these topics more fruitful

9)      We talked about RefStack code features

  1. Allowing users to upload/post test results from their clouds to enable white box test reporting
  2. Allowing users who have uploaded results to +/- vote on tests they think are important
  3. We established a requirement that posting results requires an OpenStack ID
  4. We established a requirement that only a single Corporate designate (provided by the Foundation) can make a result official for their company.
  5. Collecting opt-in data with test results using tags for things like alternate implementations use, host operating system(s), deployment method, size of cloud, and hypervisor.
  6. We discussed (but did not resolve) that it could be possible to have people run RefStack against public cloud end points and post their results
  7. We agreed that RefStack needs to be able to run locally or as a hosted site.

10)   We identified a lot of missing communication channels

  1. We created a DefCore wiki page to be a home for information.
  2. Joshua and I (and others?) will work with the Foundation staff to create “what is core” video to help the community understand the Principles and objectives for the Elephant cycle.
  3. We are in the process of setting up mail lists, IRC, blog tags, etc.

Yikes!  That’s a lot of progress priming the pump for our first DefCore meeting!

* We picked “DefCore” for the core definition committee name.  One overriding reason for the name is that it has very clean search results.  Since the word “core” is so widely used, we wanted to make sure that commentary on this topic is easy to track against the noisy term core.  We also liked 1) the reference to DefCon and 2) that the Urban Dictionary defines it as going deaf from standing too close to the speakers.

Crowbar HK Hack Report

Purple Fuzzy H for Hackathon (and Havana)Overall, I’m happy with our three days of hacking on Crowbar 2.  We’ve reached the critical “deploys workload” milestone and I’m excited about well the design is working and how clearly we’ve been able to articulate our approach in code & UI.

Of course, it’s worth noting again that Crowbar 1 has also had significant progress on OpenStack Havana workloads running on Ubuntu, Centos/RHEL, and SUSE/SLES

Here are the focus items from the hack:

  • Documentation – cleaned up documentation specifically by updating the README in all the projects to point to the real documentation in an effort to help people find useful information faster.  Reminder: if unsure, put documentation in barclamp-crowbar/doc!
  • Docker Integration for Crowbar 2 progress.  You can now install Docker from internal packages on an admin node.  We have a strategy for allowing containers be workload nodes.
  • Ceph installed as workload is working.  This workload revealed the need for UI improvements and additional flags for roles (hello “cluster”)
  • Progress on OpenSUSE and Fedora as Crowbar 2 install targets.  This gets us closer to true multi-O/S support.
  • OpenSUSE 13.1 setup as a dev environment including tests.  This is a target working environment.
  • Being 12 hours offset from the US really impacted remote participation.

One thing that became obvious during the hack is that we’ve reached a point in Crowbar 2 development where it makes sense to move the work into distinct repositories.  There are build, organization and packaging changes that would simplify Crowbar 2 and make it easier to start using; however, we’ve been trying to maintain backwards compatibility with Crowbar 1.  This is becoming impossible; consequently, it appears time to split them.  Here are some items for consideration:

  1. Crowbar 2 could collect barclamps into larger “workload” repos so there would be far fewer repos (although possibly still barclamps within a workload).  For example, there would be a “core” set that includes all the current CB2 barclamps.  OpenStack, Ceph and Hadoop would be their own sets.
  2. Crowbar 2 would have a clearly named “build” or “tools” repo instead of having it called “crowbar”
  3. Crowbar 2 framework would be either part of “core” or called “framework”
  4. We would put these in a new organization (“Crowbar2” or “Crowbar-2”) so that the clutter of Crowbar’s current organization is avoided.

While we clearly need to break apart the repo, this suggestion needs community more discussion!

Looking to Leverage OpenStack Havana? Crowbar delivers 3xL!

openstack_havanaThe Crowbar community has a tradition of “day zero ops” community support for the latest OpenStack release at the summit using our pull-from-source capability.  This release we’ve really gone the extra mile by doing it one THREE Linux distros (Ubuntu, RHEL & SLES) in parallel with a significant number of projects and new capabilities included.

I’m especially excited about Crowbar implementation of Havana Docker support which required advanced configuration with Nova and Glance.  The community also added Heat and Celiometer in the last release cycle plus High Availability (“Titanium”) deployment work is in active development.  Did I mention that Crowbar is rocking OpenStack deployments?  No, because it’s redundant to mention that.  We’ll upload ISOs of this work for easy access later in the week.

While my team at Dell remains a significant contributor to this work, I’m proud to point out to SUSE Cloud leadership and contributions also (including the new Ceph barclamp & integration).  Crowbar has become a true multi-party framework!

 

Want to learn more?  If you’re in Hong Kong, we are hosting a Crowbar Developer Community Meetup on Monday, November 4, 2013, 9:00 AM to 12:00 PM (HKT) in the SkyCity Marriott SkyZone Meeting Room.  Dell, dotCloud/Docker, SUSE and others will lead a lively technical session to review and discuss the latest updates, advantages and future plans for the Crowbar Operations Platform. You can expect to see some live code demos, and participate in a review of the results of a recent Crowbar 2 hackathon.  Confirm your seat here – space is limited!  (I expect that we’ll also stream this event using Google Hangout, watch Twitter #Crowbar for the feed)

My team at Dell has a significant presence at the OpenStack Summit in Hong Kong (details about activities including sponsored parties).  Be sure to seek out my fellow OpenStack Board Member Joseph George, Dell OpenStack Product Manager Kamesh Pemmaraju and Enstratius/Dell Multi-Cloud Manager Founder George Reese.

Note: The work referenced in this post is about Crowbar v1.  We’ve also reached critical milestones with Crowbar v2 and will begin implementing Havana on that platform shortly.

OpenStack Havana provides foundation for XXaaS you need

Folsom SummitIt’s been a long time, and a lot of summits, since I posted how OpenStack was ready for workloads (back in Cactus!).  We’ve seen remarkable growth of both the platform technology and the community surrounding it.  So much growth that now we’re struggling to define “what is core” for the project and I’m proud be on the Foundation Board helping to lead that charge.

So what’s exciting in Havana?

There’s a lot I am excited about in the latest OpenStack release.

Complete Split of Compute / Storage / Network services

In the beginning, OpenStack IaaS was one service (Nova).  We’ve been breaking that monolith into distinct concerns (Compute, Network, Storage) for the last several releases and I think Havana is the first release where all of the three of the services are robust enough to take production workloads.

This is a major milestone for OpenStack because knowledge that the APIs were changing inhibited adoption.

ENABLING TECH INTEGRATION: Docker & Ceph

We’ve been hanging out with the Ceph and Docker teams, so you can expect to see some interesting.  These two are proof of the a fallacy that only OpenStack projects are critical to OpenStack because neither of these technologies are moving under the official OpenStack umbrella.  I am looking forward to seeing both have dramatic impacts in how cloud deployments.

Docker promises to make Linux Containers (LXC) more portable and easier to use.  This paravirtualization approach provides near bear metal performance without compromising VM portability.  More importantly, you can oversubscribe LXC much more than VMs.  This allows you to dramatically improve system utilization and unlocks some other interesting quality of service tricks.

Ceph is showing signs of becoming the scale out storage king.  Beyond its solid data dispersion algorithm, a key aspect of its mojo is that is delivers both block and object storage.  I’ve seen a lot of interest in consolidating both types of storage into a single service.  Ceph delivers on that plus performance and cost.  It’s a real winner.

Crowbar Integration & High Availability Configuration!

We’ve been making amazing strides in the Crowbar + OpenStack integration!  As usual, we’re planning our zero day community build (on the “Roxy” branch) to get people started thinking about operationalizing OpenStack.   This is going to be especially interesting because we’re introducing it first on Crowbar 1 with plans to quickly migrate to Crowbar 2 where we can leverage the attribute injection pattern that OpenStack cookbooks also use.  Ultimately, we expect those efforts to converge.  The fact that Dell is putting reference implementations of HA deployment best practices into the open community is a major win for OpenStack.

Tests, Tests, Tests & Continuous Delivery

OpenStack continue to drive higher standards for reviews, integration and testing.  I’m especially excited to the volume and activity around our review system (although backlogs in reviews are challenges).  In addition, the community continues to invest in the test suites like the Tempest project.  These are direct benefits to operators beyond simple code quality.  Our team uses Tempest to baseline field deployments.  This means that OpenStack test suites help validate live deployments, not just lab configurations.

We achieve a greater level of quality when we gate code check-ins on tests that matter to real deployments.   In fact, that premise is the basis for our “what is core” process.  It also means that more operators can choose to deploy OpenStack continuously from trunk (which I consider to be a best practice scale ops).

Where did we fall short?

With growth comes challenges, Havana is most complex release yet.  The number of projects that are part the OpenStack integrated release family continues to expand.  While these new projects show the powerful innovation engine at work with OpenStack, they also make the project larger and more difficult to comprehend (especially for n00bs).  We continue to invest in Crowbar as a way to serve the community by making OpenStack more accessible and providing open best practices.

We are still struggling to resolve questions about interoperability (defining core should help) and portability.  We spent a lot of time at the last two summits on interoperability, but I don’t feel like we are much closer than before.  Hopefully, progress on Core will break the log jam.

Looking ahead to Ice House?

I and many leaders from Dell will be at the Ice House Summit in Hong Kong listening and learning.

The top of my list is the family of XXaaS services (Database aaS, Load Balanacer aaS, Firewall aaS, etc) that have appeared.  I’m a firm believer that clouds are more than compute+network+storage.  With a stable core, OpenStack is ready to expand into essential platform services.

If you are at the summit, please join Dell (my employer) and Intel for the OpenStack Summit Welcome Reception (RSVP!) kickoff networking and social event on Tuesday November 5, 2013 from 6:30 – 8:30pm at the SkyBistro in the SkyCity Marriott.   My teammate, Kamesh Pemmaraju, has a complete list of all Dell the panels and events.

OpenStack Neutron using Linux Bridges (technical explanation)

chris_net3Apparently this is “Showcase Dell OpenStack/Crowbar Team Member Week” because today I’m proxy positioning for Dell OpenStack engineer Chris Dearborn.  Chris has been leading our OpenStack Neutron deployment for Grizzly and Havana.

If you’re familiar with the OpenStack Networking, skip over my introductory preamble and jump right down to the meat under “SDN Client Connection: Linux Bridge.”  Hopefully we can convince Chris to put together more in this series and cover GRE and VLAN configurations too.

OpenStack and Software Defined Network

Software Defined Networking (SDN) is an emerging concept that describes a family of functionality.  Like cloud, the exact meaning of SDN appears to be in the eye (or brochure) of the company providing the technology.  Overall, the concept for SDN is to have programmable networks that can be automatically provisioned.

Early approaches to this used the OpenFlow™ API to programmatically modify switch routing tables (aka OSI Layer 2) on a flow by flow basis across multiple switches.  While highly controlled, OpenFlow has proven difficult to implement at scale in dynamic environments; consequently, many SDN implementations are now using overlay networks based on inventoried VLANs  and/or dynamic tunnels.

Inventoried VLAN overlay networks create a stable base layer 2 infrastructure that can be inventoried and handed out dynamically on-demand.  Generally, the management infrastructure dynamically connects the end-points (typically virtual machines) to a dedicated existing layer 2 network.  This provides all of the isolation desired without having to thrash the underlying network switch infrastructure.

Dynamic tunnel overlay network also uses client connection points to isolate traffic but do not rely on switch layer 2.  Instead, they encrypt traffic before sending it over a shared network.  This avoids having to match dynamic networks to static inventory; however, it also adds substantial encryption overhead to the network communication.  Consequently, tunnels provide more flexibility and less up front-confirmation but with lower performance.

OpenStack Networking, project Neutron (previously Quantum), is responsible for connecting virtual machines setup by OpenStack Compute (aka Nova) to the software defined networking infrastructure.  By design, Neutron accommodates different implementation plug-ins.  That allows operators to choose between different approaches including the addition of commercial offerings.   While it is possible to use open source capabilities for small deployments and trials, most large scale deployments choose proprietary SDN technologies.

The Crowbar OpenStack installation allows operators to choose between “Open vSwitch GRE Tunnels” and “Linux Bridge VLAN” configuration.  The GRE option is more flexible and requires less up front configuration; however, the encryption used by GRE will degrade performance.  The Linux Bridge VLAN option requires more upfront configuration and design.

Since GRE works with minimal configuration, let’s explore what’s required to for Crowbar to setup OpenStack Neutron Linux Bridge VLAN networking.

Note: This review assumes that you already have a working knowledge of Crowbar and OpenStack.

Background

Before we dig into how OpenStack configures SDN , we need to understand how we connect between virtual machines running in the system and the physical network.  This connection uses Linux Bridges.  For GRE tunnels, Crowbar configures an Open vSwitch (aka OVS) on the node to create and manage the tunnels.

One challenge with SDN traffic isolation is that we can no longer assume that virtual machines with network access can reach destinations on our same network.  This means that the infrastructure must provide paths (aka gateways and routers) between the tenant and infrastructure networks.  A major part of the OpenStack configuration includes setting up these connections when new tenant networks are created.

Note: In the OpenStack Grizzly and earlier releases, open source code for network routers were not configured in a highly available or redundant way.  This problem is addressed in the Havana release.

For the purposes of this explanation, the “network node” is the shared infrastructure server that bridges networks.  The “compute node” is any one of the servers hosting guest virtual machines.  Traffic in the cloud can be between virtual machines within the cloud instance (internal) or between a virtual machine and something outside the OpenStack cloud instance (external).

Let’s make sure we’re on the same page with terminology.

  • OSI Layer 2 – just above physical connections (layer 1), Layer two manages traffic between servers including providing logical separation of traffic.
  • VLAN – Virtual Local Area Network are switch enforced isolation zones created by adding 1 of 4096 tags in the network traffic (aka tagged traffic).
  • Tenant – a group of users in a cloud that are logically isolated (cannot see other traffic or information) but still using shared resources.
  • Switch – a physical device used to provide layer 1 networking connections between end points.  May provide additional services on other OSI layers such as VLANs.
  • Network Node – an OpenStack infrastructure server that connects tenant networks to infrastructure networks.
  • Compute Node – an OpenStack server that runs user workloads in virtual machines or containers.

SDN Client Connection: Linux Bridge 

chris_net1

The VLAN range for Linux Bridge is configurable in /etc/quantum/quantum.conf by changing the network_vlan_ranges parameter.  Note that this parameter is set by the Crowbar Neutron chef recipe.  The VLAN range is configured to start at whatever the “vlan” attribute in the nova_fixed network in the bc-template-network.json is set to.  The VLAN range end is hard coded to end at the VLAN start plus 2000.

Reminder: The maximum VLAN tag is 4096 so the VLAN tag for nova_fixed should never be set to anything greater than 2095 to be safe.

Networks are assigned the next available VLAN tag as they are created.  For instance, the first manually created network will be assigned VLAN 501, the next VLAN 502, etc.  Note that this is independent of what tenant the new network resides in.

The convention in Linux Bridge is to name the various network constructs including the first 11 characters of the UUID of the associated Neutron object.  This allows you to run the quantum CLI command listing out the objects you are interested in, and grepping on the 11 uuid characters from the network construct name.  This shows what Neutron object a given network construct maps to.

chris_net2

Network Creation

When a network is created, a corresponding bridge is created and is given the name br<network_uuid>.  A subinterface of the NIC is also created and is named <interface_name>.<vlan_tag>.  This subinterface is slaved to the bridge.  Note that this only happens when the network is needed (when a VM is created on the network).

This occurs on both the network node and the compute nodes.

Additional Steps Taken On The Network Node During Network Creation

On the network node, a bridge and subinterface is created per network and the subinterface is slaved to the bridge as described above.  If the network is attached to the router, then a TAP interface that the router listens on is created and slaved to the bridge.  If DHCP is selected, then another TAP interface is created that the dnsmasq process talks to, and that interface is also slaved to the bridge.

VM Creation On A Compute Node

When a VM is created, a TAP interface is created and named tap<port_uuid>.  The port is the Neutron port that the VM is plugged in to.  This TAP interface is slaved to the bridge associated with the network that the user selected when creating the VM.  Note that this occurs on compute nodes only.

Determining the dnsmasq port/tap interface for a network

The TAP port associated with dnsmasq for a network can be determined by first getting the uuid of the network, then looking on the network node in /var/lib/quantum/dhcp/<network_uuid>/interface.  The interface will be named ns-.  Note that this is only the first 11 characters of the uuid.  The tap interface will be named tap.

chris_net3

Summary

Understanding OpenStack Networking is critical to operating a successful cloud deployment.  The Crowbar Team at Dell has invested significant effort to automate the configuration of Neutron.  This helps you eliminate the risk of manual configuration and leverage our extensive testing and field experience.

If you are interested in seeing the exact sequences used by Crowbar, please visit the Crowbar Github repository for the “Quantum Barclamp.

OpenStack Core Online Forum, Oct 16 13:30 UTC / Oct 22 0100 UTC

Go Online!OpenStack Community, you are invited on an online discussion about OpenStack Core on October 16th at UTC 13:30 (8:30 am US Central) and October 22nd at UTC 0100 (8:00 pm US Central)

At the next OpenStack Foundation Board meeting, we will be setting a timeline for implementing an OpenStack Core Definition process that promotes a clear and implementation driven metric for deciding which projects should be considered “required.”  This is your chance to review and influence the process!

We’ll review the OpenStack Core Definition process (20 minutes) and then open up the channel for discussion using the IRC (#openstack-meeting) & Google Hangout on Air (link posted in IRC).

The forum will be coordinated through the IRC channel for links and questions.

Can’t make it?  The session was recorded > here!