Nextcast #14 Transcription on OpenStack & Crowbar > “we can’t hand out trophies to everyone”

Last week, I was a guest on the NextCast OpenStack podcast hosted by Niki Acosta (EMC) [Jeff Dickey could not join].   I’ve taken some time to transcribe highlights.

We had a great discussion nextcastabout OpenStack, Ops and Crowbar.  I appreciate Niki’s insightful questions and an opportunity to share my opinions.  I feel that we covered years of material in just 1 hour and I appreciate the opportunity to appear on the podcast.

Video from full post (youtube) and the audio for download.

Plus, a FULL TRANSCRIPT!  Here’s my Next Cast #14 Short Transcripton

The objective of this transcription is to help navigate the recording, not replace it.  I did not provide complete context for remarks.

  • 04:30 Birth of Crowbar (to address Ops battle scars)
  • 08:00 The need for repeatable Ready State baseline to help community work together
  • 10:30 Should hardware matter in OpenStack? It has to, details and topology matters not vendor.
  • 11:20 OpenCompute – people are trying to open source hardware design
  • 11:50 When you are dealing with hardware, it matters. You have to get it right.
  • 12:40 Customers are hardware heterogeneous by design (and for ops tooling). Crowbar is neutral territory
  • 14:50 It’s not worth telling people they are wrong, because they are not. There are a lot of right ways to install OpenStack
  • 16:10 Sometimes people make expensive choices because it’s what they are comfortable with and it’s not helpful for me to them they a wrong – they are not.
  • 16:30 You get into a weird corner if you don’t tell anyone no. And an equally weird corner if you tell everyone yes.
  • 18:00 Aspirations of having an interoperable cloud was much harder than the actual work to build it
  • 18:30 Community want to say yes, “bring your code” but to operators that’s very frustrating because they want to be able to make substitutions
  • 19:30 Thinking that if something is included then it’s required – that’s not clear
  • 19:50 Interlock Dilemma [see my back reference]
  • 20:10 Orwell Animal Farm reference – “all animals equal but pigs are more equal”
  • 22:20 Rob defines DefCore, it’s not big and scary
  • 22:35 DefCore is about commercial use, not running the technical project
  • 23:35 OpenStack had to make money for the companies are paying for the developers who participate… they need to see ROI
  • 24:00 OpenStack is an infrastructure project, stability is the #1 feature
  • 24:40 You have to give a reason why you are saying no and a path to yes
  • 25:00 DefCore is test driven: quantitative results
  • 26:15 Balance between whole project and parts – examples are Swiftstack (wants Object only) and Dreamhost (wants Compute only)
  • 27:00 DefCore created core components vs platform levels
  • 27:30 No vendor has said they can implement DefCore without some effort
  • 28:10 We have outlets for vendors who do not want to implement the process
  • 28:30 The Board is not in a position to make technical call about what’s in, we had to build a process for community input
  • 29:10 We had to define something that could say, “this is it and we have to move on”
  • 29:50 What we want is for people to start with the core and then bring in the other projects. We want to know what people are adding so we can make that core in time
  • 30:10 This is not a recommendation is a base.
  • 30:35 OpenStack is a bubble – does not help if we just get together to pad each other on the back, we want to have a thriving ecosystem
  • 31:15 Question: “have vendors been selfish”
  • 31:35 Rob rephrased as “does OpenStack have a tragedy of the commons” problem
  • 32:30 We need to make sure that everyone is contributing back upstream
  • 32:50 Benefit of a Benevolent Dictator is that they can block features unless community needs are met
  • 33:10 We have NOT made it clear where companies should be contributing to the community. We are not doing a good job directing community efforts
  • 33:45 Hidden Influencers becomes OpenStack Product group
  • 34:55 Hidden Influencers were not connecting at the summit in a public way (like developers were)
  • 35:20 Developers could not really make big commitments of their time without the buy in from their managers (product and line)
  • 35:50 Subtle selfishness – focusing on your own features can disrupt the whole release where things would flow better if they helped others
  • 37:40 Rob was concerned that there was a lot of drift between developers and company’s product descriptions
  • 38:20 BYLAWS CHANGES – vote! here’s why we need to change
  • 38:50 Having whole projects designated as core sucks – code in core should be slower and less changing. Innovation at the core will break interoperability
  • 39:40 Hoping that core will help product managers understand where they are using the standard and adding values
  • 41:10 All babies are ugly > with core, that’s good. We are looking for the grown ups who can do work and deliver value. Babies are things you nurture and help grow because they have potential.
  • 42:00 We undermine our credibility in the community when we talk about projects that are babies as if they were ready.
  • 43:15 DefCore’s job was to help pick projects. If everyone is core then we look like a youth soccer team where everyone is getting a trophy
  • 44:30 Question: “What do you tell to users to instill confidence in OpenStack”
  • 44:50 first thing: focus on operations and automation. Table stakes (for any cloud) is getting your deployments automated. Puppies vs Cattle.
  • 45:25 People who were successful with early OpenStack were using automated deployments against the APIs.
  • 46:00 DevOps is a fundamental part of cloud computing – if you’re hand-built and not automated then you are old school IT.
  • 46:40 Niki references Gartner “Bimodal IT” [excellent reference, go read it!]
  • 47:20 VMWare is a great crutch for OpenStack. We can use VMWare for the puppies.
  • 47:45 OpenStack is not going to run on every servers (perhaps that’s heresy) but it does not make sense in every workload
  • 48:15 One size does not fit all – we need to be good at what we’re good at
  • 48:30 OpenStack needs to focus on doing something really well. That means helping people who want to bring automated workloads into the cloud
  • 49:20 Core was about sending a signal about what’s ready and people can rely on
  • 49:45 Back in 2011, I was saying OpenStack was ready for people who would make the operational investment
  • 50:30 We use Crowbar because it makes it easier to do automated deployments for infrastructure like Hadoop and Ceph where you want access to the physical media
  • 51:00 We should be encouraging people to use OpenStack for its use cases
  • 51:30 Existential question for OpenStack: are we a suite or product. The community is split here
  • 51:30 In comparing with Amazon, does OpenStack have to implement it or build an ecosystem to compete
  • 53:00 As soon as you make something THE OpenStack project (like Heat) you are sending a message that the alternates are not welcome
  • 54:30 OpenStack ends up in a trap if we pick a single project and make it the way that we are going do something. New implementations are going to surface from WITHIN the projects and we need to ready for that.
  • 55:15 new implementations are coming, we have to be ready for that. We can make ourselves vulnerable to splitting if we do not prepare.
  • 56:00 API vs Implementation? This is something that splits the community. Ultimately we to be an API spec but we are not ready for that. We have a lot of work to do first using the same code base.
  • 56:50 DefCore has taken a balanced approach using our diversity as a strength
  • 57:20 Bylaws did not allow for enough flexibility for what is core
  • 59:00 We need voters for the quorum!
  • 59:30 Rob recommended Rocky Grober (Huawei) and Shamail Tahir (EMC) for future shows

OpenStack DefCore Enters Execution Phase. Help!

OpenStack DefCore Committee has established the principles and first artifacts required for vendors using the OpenStack trademark.  Over the next release cycle, we will be applying these to the Ice House and Juno releases.

Like a rockLearn more?  Hear about it LIVE!  Rob will be doing two sessions about DefCore next week (will be recorded):

  1. Tues Dec 16 at 9:45 am PST- OpenStack Podcast #14 with Jeff Dickey
  2. Thurs Dec 18 at 9:00 am PST – Online Meetup about DefCore with Rafael Knuth (optional RSVP)

At the December 2014 OpenStack Board meeting, we completed laying the foundations for the DefCore process that we started April 2013 in Portland. These are a set of principles explaining how OpenStack will select capabilities and code required for vendors using the name OpenStack. We also published the application of these governance principles for the Havana release.

  1. The OpenStack Board approved DefCore principles to explain
    the landscape of core including test driven capabilities and designated code (approved Nov 2013)
  2. the twelve criteria used to select capabilities (approved April 2014)
  3. the creation of component and framework layers for core (approved Oct 2014)
  4. the ten principles used to select designated sections (approved Dec 2014)

To test these principles, we’ve applied them to Havana and expressed the results in JSON format: Havana Capabilities and Havana Designated Sections. We’ve attempted to keep the process transparent and community focused by keeping these files as text and using the standard OpenStack review process.

DefCore’s work is not done and we need your help!  What’s next?

  1. Vote about bylaws changes to fully enable DefCore (change from projects defining core to capabilities)
  2. Work out going forward process for updating capabilities and sections for each release (once authorized by the bylaws, must be approved by Board and TC)
  3. Bring Havana work forward to Ice House and Juno.
  4. Help drive Refstack process to collect data from the field

Ironic + Crowbar: United in Vision, Complementary in Approach

This post is co-authored by Devanda van der Veen, OpenStack Ironic PTL, and Rob Hirschfeld, OpenCrowbar Founder.  We discuss how Ironic and Crowbar work together today and into the future.

Normalizing the APIs for hardware configuration is a noble and long-term goal.  While the end result, a configured server, is very easy to describe; the differences between vendors’ hardware configuration tools are substantial.  These differences make it impossible challenging to create repeatable operations automation (DevOps) on heterogeneous infrastructure.

Illustration to show potential changes in provisioning control flow over time.

Illustration to show potential changes in provisioning control flow over time.

The OpenStack Ironic project is a multi-vendor community solution to this problem at the server level.  By providing a common API for server provisioning, Ironic encourages vendors to write drivers for their individual tooling such as iDRAC for Dell or iLO for HP.

Ironic abstracts configuration and expects to be driven by an orchestration system that makes the decisions of how to configure each server. That type of orchestration is the heart of Crowbar physical ops magic [side node: 5 ways that physical ops is different from cloud]

The OpenCrowbar project created extensible orchestration to solve this problem at the system level.  By decomposing system configuration into isolated functional actions, Crowbar can coordinate disparate configuration actions for servers, switches and between systems.

Today, the Provisioner component of Crowbar performs similar functions as Ironic for operating system installation and image lay down.  Since configuration activity is tightly coupled with other Crowbar configuration, discovery and networking setup, it is difficult to isolate in the current code base.  As Ironic progresses, it should be possible to shift these activities from the Provisioner to Ironic and take advantage of the community-based configuration drivers.

The immediate synergy between Crowbar and Ironic comes from accepting two modes of operation for OpenStack: bootstrapping infrastructure and multi-tenant server allocation.

Crowbar was designed as an operational platform that seeds an OpenStack ready environment.  Once that environment is configured, OpenStack can take over ownership of the resources and allow Ironic to manage and deliver “hypervisor-free” servers for each tenant.  In that way, we can accelerate the adoption of OpenStack for self-service metal.

Physical operations is messy and challenging, but we’re committed to working together to make it suck less.  Operators of the world unite!

To thrive, OpenStack must better balance dev, ops and business needs.

OpenStack has grown dramatically in many ways but we have failed to integrate development, operations and business communities in a balanced way.

My most urgent observation from Paris is that these three critical parts of the community are having vastly different dialogs about OpenStack.

Clouds DownAt the Conference, business people were talking were about core, stability and utility while the developers were talking about features, reorganizing and expanding projects. The operators, unfortunately segregated in a different location, were trying to figure out how to share best practices and tools.

Much of this structural divergence was intentional and should be (re)evaluated as we grow.

OpenStack events are split into distinct focus areas: the conference for business people, the summit for developers and specialized days for operators. While this design serves a purpose, the community needs to be taking extra steps to ensure communication. Without that communication, corporate sponsors and users may find it easier to solve problems inside their walls than outside in the community.

The risk is clear: vendors may find it easier to work on a fork where they have business and operational control than work within the community.

Inside the community, we are working to help resolve this challenge with several parallel efforts. As a community member, I challenge you to get involved in these efforts to ensure the project balances dev, biz and ops priorities.  As a board member, I feel it’s a leadership challenge to make sure these efforts converge and that’s one of the reasons I’ve been working on several of these efforts:

  • OpenStack Project Managers (was Hidden Influencers) across companies in the ecosystem are getting organized into their own team. Since these managers effectively direct the majority of OpenStack developers, this group will allow
  • DefCore Committee works to define a smaller subset of the overall OpenStack Project that will be required for vendors using the OpenStack trademark and logo. This helps the business community focus on interoperability and stability.
  • Technical leadership (TC) lead “Big Tent” concept aligns with DefCore work and attempts to create a stable base platform while making it easier for new projects to enter the ecosystem. I’ve got a lot to say about this, but frankly, without safeguards, this scares people in the ops and business communities.
  • An operations “ready state” baseline keeps the community from being able to share best practices – this has become a pressing need.  I’d like to suggest as OpenCrowbar an outside of OpenStack a way to help provide an ops neutral common starting point. Having the OpenStack developer community attempting to create an installer using OpenStack has proven a significant distraction and only further distances operators from the community.

We need to get past seeing the project primarily as a technology platform.  Infrastructure software has to deliver value as an operational tool for enterprises.  For OpenStack to thrive, we must make sure the needs of all constituents (Dev, Biz, Ops) are being addressed.

Self-Exposure: Hidden Influencers become OpenStack Product Working Group

Warning to OpenStack PMs: If you are not actively involved in this effort then you (and your teams) will be left behind!

ManagersThe Hidden Influencers (now called “OpenStack Product Working Group”) had a GREAT and PRODUCTIVE session at the OpenStack (full notes):

  1. Named the group!  OpenStack Product Working Group (now, that’s clarity in marketing) [note: I was incorrect saying “Product Managers” earlier].
  2. Agreed to use the mailing list for communication.
  3. Committed to a face-to-face mid-cycle meetup (likely in South Bay)
  4. Output from the meetup will be STRATEGIC DIRECTION doc to board (similar but broader than “Win the Enterprise”)
  5. Regular meeting schedule – like developers but likely voice interactive instead of IRC.  Stefano Maffulli is leading.

PMs starting this group already direct the work for a super majority (>66%) of active contributors.

The primary mission for the group is to collaborate and communicate around development priorities so that we can ensure that project commitments get met.

It was recognized that the project technical leads are already strapped coordinating release and technical objectives.  Further, the product managers are already but independently engaged in setting strategic direction, we cannot rely on existing OpenStack technical leadership to have the bandwidth.

This effort will succeed to the extent that we can help the broader community tied in and focus development effort back to dollars for the people paying for those developers.  In my book, that’s what product managers are supposed to do.  Hopefully, getting this group organized will help surface that discussion.

This is a big challenge considering that these product managers have to balance corporate, shared project and individual developers’ requirements.  Overall, I think Allison Randall summarized our objectives best: “we’re herding cats in the same direction.”

Leveling OpenStack’s Big Tent: is OpenStack a product, platform or suite?

Question of the day: What should OpenStack do with all those eager contributors?  Does that mean expanding features or focusing on a core?

IMG_20141108_101906In the last few months, the OpenStack technical leadership (Sean Dague, Monty Taylor) has been introducing two interconnected concepts: big tent and levels.

  • Big tent means expanding the number of projects to accommodate more diversity (both in breath and depth) in the official OpenStack universe.  This change accommodates the growth of the community.
  • Levels is a structured approach to limiting integration dependencies between projects.  Some OpenStack components are highly interdependent and foundational (Keystone, Nova, Glance, Cinder) while others are primarily consumers (Horizon, Saraha) of lower level projects.

These two concepts are connected because we must address integration challenges that make it increasingly difficult to make changes within the code base.  If we substantially expand the code base with big tent then we need to make matching changes to streamline integration efforts.  The levels proposal reflects a narrower scope at the base than we currently use.

By combining big tent and levels, we are simultaneously growing and shrinking: we grow the community and scope while we shrink the integration points.  This balance may be essential to accommodate OpenStack’s growth.

UNIVERSALLY, the business OpenStack community who wants OpenStack to be a product.  Yet, what’s included in that product is unclear.

Expanding OpenStack projects tends to turn us into a suite of loosely connected functions rather than a single integrated platform with an ecosystem.  Either approach is viable, but it’s not possible to be both simultaneously.

On a cautionary note, there’s an anti-Big Tent position I heard expressed at the Paris Conference.  It’s goes like this: until vendors start generating revenue from the foundation components to pay for developer salaries; expanding the scope of OpenStack is uninteresting.

Recent DefCore changes also reflect the Big Tent thinking by adding component and platform levels.  This change was an important and critical compromise to match real-world use patterns by companies like Swiftstack (Object), DreamHost (Compute+Ceph), Piston (Compute+Ceph) and others; however, it creates the need to explain “which OpenStack” these companies are using.

I believe we have addressed interoperability in this change.  It remains to be seen if OpenStack vendors will choose to offer the broader platform or limit to themselves to individual components.  If vendors chase the components over platform then OpenStack becomes a suite of loosely connect products.  It’s ultimately a customer and market decision.

It’s not too late to influence these discussions!  I’m very interested in hearing from people in the community which direction they think the project should go.

Unicorn captured! Unpacking multi-node OpenStack Juno from ready state.

OpenCrowbar Packstack install demonstrates that abstracting hardware to ready state smooths install process.  It’s a working balance: Crowbar gets the hardware, O/S & networking right while Packstack takes care of OpenStack.

LAYERSThe Crowbar team produced the first open OpenStack installer back in 2011 and it’s been frustrating to watch the community fragment around building a consistent operational model.  This is not an OpenStack specific problem, but I think it’s exaggerated in a crowded ecosystem.

When I step back from that experience, I see an industry wide pattern of struggle to create scale deployments patterns that can be reused.  Trying to make hardware uniform is unicorn hunting, so we need to create software abstractions.  That’s exactly why IaaS is powerful and the critical realization behind the OpenCrowbar approach to physical ready state.

So what has our team created?  It’s not another OpenStack installer – we just made the existing one easier to use.

We build up a ready state infrastructure that makes it fast and repeatable to use Packstack, one of the leading open OpenStack installers.  OpenCrowbar can do the same for the OpenStack Chef cookbooks or Salt Formula.   It can even use Saltstack, Chef and Puppet together (which we do for the Packstack work)!  Plus we can do it on multiple vendors hardware and with different operating systems.   Plus we build the correct networks!

For now, the integration is available as a private beta (inquiries welcome!) because our team is not in the OpenStack support business – we are in the “get scale systems to ready state and integrate” business.  We are very excited to work with people who want to take this type of functionality to the next level and build truly repeatable, robust and upgradable application deployments.

Tweaking DefCore to subdivide OpenStack platform

The following material will be a major part of the discussion for The OpenStack Board meeting on Monday 10/20.  Comments and suggest welcome!

OpenStack in PartsAuthor’s 2/26/2015 Note:  This proposal was approved by the Board in October 2014.

For nearly two years, the OpenStack Board has been moving towards creating a common platform definition that can help drive interoperability.  At the last meeting, the Board paused to further review one of the core tenants of the DefCore process (Item #3: Core definition can be applied equally to all usage models).

Outside of my role as DefCore chair, I see the OpenStack community asking itself an existential question: “are we one platform or a suite of projects?”  I’m having trouble believing “we are both” is an acceptable answer.

During the post-meeting review, Mark Collier drafted a Foundation supported recommendation that basically creates an additional core tier without changing the fundamental capabilities & designated code concepts.  This proposal has been reviewed by the DefCore committee (but not formally approved in a meeting).

The original DefCore proposed capabilities set becomes the “platform” level while capability subsets are called “components.”  We are considering two initial components, Compute & Object, and both are included in the platform (see illustration below).  The approach leaves the door open for new core component to exist both under and outside of the platform umbrella.

In the proposal, OpenStack vendors who meet either component or platform requirements can qualify for the “OpenStack Powered” logo; however, vendors using the only a component (instead of the full platform) will have more restrictive marks and limitations about how they can use the term OpenStack.

This approach addresses the “is Swift required?” question.  For platform, Swift capabilities will be required; however, vendors will be able to implement the Compute component without Swift and implement the Object component without Nova/Glance/Cinder.

It’s important to note that there is only one yard stick for components or the platform: the capabilities groups and designed code defined by the DefCore process.  From that perspective, OpenStack is one consistent thing.  This change allows vendors to choose sub-components if that serves their business objectives.

It’s up to the community to prove the platform value of all those sub-components working together.

OpenStack Goldilocks’ Syndrome: three questions to help us find our bearings

Goldilocks Atlas

Action: Please join Stefano. Allison, Sean and me in Paris on Monday, November 3rd, in the afternoon (schedule link)

If wishes were fishes, OpenStack’s rapid developer and user rise would include graceful process and commercial transitions too.  As a Foundation board member, it’s my responsibility to help ensure that we’re building a sustainable ecosystem for the project.  That’s a Goldilock’s challenge because adding either too much or too little controls and process will harm the project.

In discussions with the community, that challenge seems to breaks down into three key questions:

After last summit, a few of us started a dialog around Hidden Influencers that helps to frame these questions in an actionable way.  Now, it’s time for us to come together and talk in Paris in the hallways and specifically on Monday, November 3rd, in the afternoon (schedule link).   From there, we’ll figure out about next steps using these three questions as a baseline.

If you’ve got opinions about these questions, don’t wait for Paris!  I’d love to start the discussion here in the comments, on twitter (@zehicle), by phone, with email or via carrier pidgins.

To improve flow, we must view OpenStack community as a Software Factory

This post was sparked by a conversation at OpenStack Atlanta between OpenStack Foundation board members Todd Moore (IBM) and Rob Hirschfeld (Dell/Community).  We share a background in industrial and software process and felt that sharing lean manufacturing translates directly to helping face OpenStack challenges.

While OpenStack has done an amazing job of growing contributors, scale has caused our code flow processes to be bottlenecked at the review stage.  This blocks flow throughout the entire system and presents a significant risk to both stability and feature addition.  Flow failures can ultimately lead to vendor forking.

Fundamentally, Todd and I felt that OpenStack needs to address system flows to build an integrated product.  The post expands on the “hidden influencers” issue and adds an additional challenge because improving flow requires that the community influences better understands the need to optimize work inter-project in a more systematic way.

Let’s start by visualizing the “OpenStack Factory”

Factory Floor

Factory Floor from Alpha Industries Wikipedia page

Imagine all of OpenStack’s 1000s of developers working together in a single giant start-up warehouse.  Each project in its own floor area with appropriate fooz tables, break areas and coffee bars.  It’s easy to visualize clusters of intent developers talking around tables or coding in dark corners while PTLs and TC members dash between groups coordinating work.

Expand the visualization so that we can actually see the code flowing between teams as little colored boxes.  Giving project has a unique color allows us to quickly see dependencies between teams.  Some features are piled up waiting for review inside teams while others are waiting on pallets between projects waiting on needed cross features have not completed.  At release time, we’d be able to see PTLs sorting through stacks of completed boxes to pick which ones were ready to ship.

Watching a factory floor from above is a humbling experience and a key feature of systems thinking enlightenment in both The Phoenix Project and The Goal.  It’s very easy to be caught up in a single project (local optimization) and miss the broader system implications of local choices.

There is a large body of work about Lean Process for Manufacturing

You’ve already visualized OpenStack code creation as a manufacturing floor: it’s a small step to accept that we can use the same proven processes for software and physical manufacturing.

As features move between teams (work centers), it becomes obvious that we’ve created a very highly interlocked sequence of component steps needed to deliver product; unfortunately, we have minimal coordination between the owners of the work centers.  If a feature is needs a critical resource (think programmer) to progress then we rely on the resource to allocate time to the work.  Since that person’s manager may not agree to the priority, we have a conflict between system flow and individual optimization.

That conflict destroys flow in the system.

The number #1 lesson from lean manufacturing is that putting individual optimization over system optimization reduces throughput.  Since our product and people managers are often competitors, we need to work doubly hard to address system concerns.  Worse yet our inventory of work in process and the interdependencies between projects is harder to discern.  Unlike the manufacturing floor, our developers and project leads cannot look down upon it and see the physical work as it progresses from station to station in one single holistic view.  The bottlenecks that throttle the OpenStack workflow are harder to see but we can find them, as can be demonstrated later in this post.

Until we can engage the resource owners in balancing system flow, OpenStack’s throughput will decline as we add resources.  This same principle is at play in the famous aphorism: adding developers makes a late project later.

Is there a solution?

There are lessons from Lean Manufacturing that can be applied

  1. Make quality a priority (expand tests from function to integration)
  2. Ensure integration from station to station (prioritize working together over features)
  3. Make sure that owners of work are coordinating (expose hidden influencers)
  4. Find and mange from the bottleneck (classic Lean says find the bottleneck and improve that)
  5. Create and monitor a system view
  6. Have everyone value finished product, not workstation output

Added Subscript: I highly recommend reading Daniel Berrange’s email about this.