unBIOSed? Is Redfish an IPMI retread or can vendors find unification?

Server management interfaces stink.  They are inconsistent both between vendors and within their own product suites.  Ideally, Vendors would agree on a single API; however, it’s not clear if the diversity is a product of competition or actual platform variation.  Likely, it’s both.

From RedFish SiteWhat is Redfish?  It’s a REST API for server configuration that aims to replace both IPMI and vendor specific server interfaces (like WSMAN).  Here’s the official text from RedfishSpecification.org.

Redfish is a modern intelligent [server] manageability interface and lightweight data model specification that is scalable, discoverable and extensible.  Redfish is suitable for a multitude of end-users, from the datacenter operator to an enterprise management console.

I think that it’s great to see vendors trying to get on the same page and I’m optimistic that we could get something better than IPMI (that’s a very low bar).  However, I don’t expect that vendors can converge to a single API; it’s just not practical due to release times and pressures to expose special features.  I think the divergence in APIs is due both to competitive pressures and to real variance between platforms.

Even if we manage to a grand server management unification; the problem of interface heterogeneity has a long legacy tail.

In the best case reality, we’re going from N versions to N+1 (and likely N*2) versions because the legacy gear is still around for a long time.  Adding Redfish means API sprawl is going to get worse until it gets back to being about the same as it is now.

Putting pessimism aside, the sprawl problem is severe enough that it’s worth supporting Redfish on the hope that it makes things better.

That’s easy to say, but expensive to do.  If I was making hardware (I left Dell in Oct 2014), I’d consider it an expensive investment for an uncertain return.  Even so, several major hardware players are stepping forward to help standardize.  I think Redfish would have good ROI for smaller vendors looking to displace a major player can ride on the standard.

Redfish is GREAT NEWS for me since RackN/Crowbar provides hardware abstraction and heterogeneous interface support.  More API variation makes my work more valuable.

One final note: if Redfish improves hardware security in a real way then it could be a game changer; however, embedded firmware web servers can be tricky to secure and patch compared to larger application focused software stacks.  This is one area what I’m hoping to see a lot of vendor collaboration!  [note: this should be it’s own subject – the security issue is more than API, it’s about system wide configuration.  stay tuned!]

Ironic + Crowbar: United in Vision, Complementary in Approach

This post is co-authored by Devanda van der Veen, OpenStack Ironic PTL, and Rob Hirschfeld, OpenCrowbar Founder.  We discuss how Ironic and Crowbar work together today and into the future.

Normalizing the APIs for hardware configuration is a noble and long-term goal.  While the end result, a configured server, is very easy to describe; the differences between vendors’ hardware configuration tools are substantial.  These differences make it impossible challenging to create repeatable operations automation (DevOps) on heterogeneous infrastructure.

Illustration to show potential changes in provisioning control flow over time.

Illustration to show potential changes in provisioning control flow over time.

The OpenStack Ironic project is a multi-vendor community solution to this problem at the server level.  By providing a common API for server provisioning, Ironic encourages vendors to write drivers for their individual tooling such as iDRAC for Dell or iLO for HP.

Ironic abstracts configuration and expects to be driven by an orchestration system that makes the decisions of how to configure each server. That type of orchestration is the heart of Crowbar physical ops magic [side node: 5 ways that physical ops is different from cloud]

The OpenCrowbar project created extensible orchestration to solve this problem at the system level.  By decomposing system configuration into isolated functional actions, Crowbar can coordinate disparate configuration actions for servers, switches and between systems.

Today, the Provisioner component of Crowbar performs similar functions as Ironic for operating system installation and image lay down.  Since configuration activity is tightly coupled with other Crowbar configuration, discovery and networking setup, it is difficult to isolate in the current code base.  As Ironic progresses, it should be possible to shift these activities from the Provisioner to Ironic and take advantage of the community-based configuration drivers.

The immediate synergy between Crowbar and Ironic comes from accepting two modes of operation for OpenStack: bootstrapping infrastructure and multi-tenant server allocation.

Crowbar was designed as an operational platform that seeds an OpenStack ready environment.  Once that environment is configured, OpenStack can take over ownership of the resources and allow Ironic to manage and deliver “hypervisor-free” servers for each tenant.  In that way, we can accelerate the adoption of OpenStack for self-service metal.

Physical operations is messy and challenging, but we’re committed to working together to make it suck less.  Operators of the world unite!

Physical Ops = Plumbers of the Internet. Celebrating dirty IT jobs 8 bit style

I must be crazy because I like to make products that take on the hard and thankless jobs in IT.  Its not glamorous, but someone needs to do them.

marioAnalogies are required when explaining what I do to most people.  For them, I’m not a specialist in physical data center operations, I’m an Internet plumber who is part of the team you call when your virtual toilet backs up.  I’m good with that – it’s work that’s useful, messy and humble.

Plumbing, like the physical Internet, disappears from most people’s conscious once it’s out of sight under the floor, cabinet or modem closet.  And like plumbers, we can’t do physical ops without getting dirty.  Unlike cloud-based ops with clean APIs and virtual services, you can’t do physical ops without touching something physical.  Even if you’ve got great telepresence, you cannot get away from physical realities like NIC and SATA enumeration, BIOS management and network topology.  I’m delighted that cloud has abstracted away that layer for most people but that does not mean we can ignore it.

Physical ops lacks the standardization of plumbing.  There are many cross-vendor standards but innovation and vendor variation makes consistency as unlikely as a unicorn winning the Rainbow Triple Crown.

493143-donkey_kong_1For physical ops, it feels like we’re the internet’s most famous plumber, Mario, facing Donkey Kong.  We’ve got to scale ladders, jump fireballs and swing between chains.  The job is made harder because there’s no half measures.  Sometimes you can find the massive hammer and blast your way through but that’s just a short term thing.

Unfortunately, there’s a real enemy here: complexity.

Just like Donkey Kong keeps dashing off with the princess, operations continue to get more and more complex.  Like with Mario, the solution is not to bypass the complexity; it’s to get better and faster at navigating the obstacles that get thrown at you.  Physical ops is about self-reliance and adaptability.  In that case, there are a lot of lessons to be learned from Mario.

If I’m an internet plumber then I’m happy to embrace Mario as my mascot.  Plumbers of the internet unite!

Who’s the grown-up here?  It’s the VM not the Iron!

This ANALOGY exploring Virtual vs Physical Ops is Joint posting by Rob Hirschfeld, RackN, and Russel Doty, Redhat.RUSSEL DOTY

babyCompared to provisioning physical servers, getting applications running in a virtual machine is like coaching an adult soccer team – the players are ready, you just have to get them to the field and set the game in motion.  The physical servers can be compared to a grade school team – tremendous potential, but they can require a lot of coaching and intervention. And they don’t always play nice.

Russell Doty and I were geeking on the challenges of configuring physical servers when we realized that our friends in cloud just don’t have these problems.  When they ask for a server, it’s delivered to them on a platter with an SLA.  It’s a known configuration – calm, rational and well-behaved.  By comparison, hardware is cranky, irregular and sporadic.  To us, it sometimes feels like we are more in the babysitting business. Yes, we’ve had hardware with the colic!

Continuing the analogy, physical operations requires a degree of child-proofing and protection that is (thankfully) hidden behind cloud abstractions of hardware.  More importantly, it requires a level of work that adults take for granted like diaper changes (bios/raid setup), food preparation (network configs), and self-entertainment (O/S updates).

And here’s where the analogy breaks down…

The irony here is that the adults (vms) are the smaller, weaker part of the tribe.  Not only that, these kids have to create the environment that the “adults” run on.

If you’re used to dealing with adults to get work done, you’re going to be in for a shock when you ask the kids to do the same job.

That’s why the cloud is such a productive platform for software.  It’s an adults-only environment – the systems follow the rules and listen to your commands.  Even further, cloud systems know how to dress themselves (get an O/S), rent an apartment (get an IP and connect) and even get credentials (get a driver’s license).

These “little things” are taken for granted in the cloud are not automatic behaviors for physical infrastructure.

Of course, there are trade-offs – most notably performance and “scale up” scalability. The closer you need to get to hardware performance, on cpu, storage, or networks, the closer you need to get to the hardware.

It’s the classic case of standardizing vs. customization. And a question of how much time you are prepared to put into care and feeding!

To thrive, OpenStack must better balance dev, ops and business needs.

OpenStack has grown dramatically in many ways but we have failed to integrate development, operations and business communities in a balanced way.

My most urgent observation from Paris is that these three critical parts of the community are having vastly different dialogs about OpenStack.

Clouds DownAt the Conference, business people were talking were about core, stability and utility while the developers were talking about features, reorganizing and expanding projects. The operators, unfortunately segregated in a different location, were trying to figure out how to share best practices and tools.

Much of this structural divergence was intentional and should be (re)evaluated as we grow.

OpenStack events are split into distinct focus areas: the conference for business people, the summit for developers and specialized days for operators. While this design serves a purpose, the community needs to be taking extra steps to ensure communication. Without that communication, corporate sponsors and users may find it easier to solve problems inside their walls than outside in the community.

The risk is clear: vendors may find it easier to work on a fork where they have business and operational control than work within the community.

Inside the community, we are working to help resolve this challenge with several parallel efforts. As a community member, I challenge you to get involved in these efforts to ensure the project balances dev, biz and ops priorities.  As a board member, I feel it’s a leadership challenge to make sure these efforts converge and that’s one of the reasons I’ve been working on several of these efforts:

  • OpenStack Project Managers (was Hidden Influencers) across companies in the ecosystem are getting organized into their own team. Since these managers effectively direct the majority of OpenStack developers, this group will allow
  • DefCore Committee works to define a smaller subset of the overall OpenStack Project that will be required for vendors using the OpenStack trademark and logo. This helps the business community focus on interoperability and stability.
  • Technical leadership (TC) lead “Big Tent” concept aligns with DefCore work and attempts to create a stable base platform while making it easier for new projects to enter the ecosystem. I’ve got a lot to say about this, but frankly, without safeguards, this scares people in the ops and business communities.
  • An operations “ready state” baseline keeps the community from being able to share best practices – this has become a pressing need.  I’d like to suggest as OpenCrowbar an outside of OpenStack a way to help provide an ops neutral common starting point. Having the OpenStack developer community attempting to create an installer using OpenStack has proven a significant distraction and only further distances operators from the community.

We need to get past seeing the project primarily as a technology platform.  Infrastructure software has to deliver value as an operational tool for enterprises.  For OpenStack to thrive, we must make sure the needs of all constituents (Dev, Biz, Ops) are being addressed.

Self-Exposure: Hidden Influencers become OpenStack Product Working Group

Warning to OpenStack PMs: If you are not actively involved in this effort then you (and your teams) will be left behind!

ManagersThe Hidden Influencers (now called “OpenStack Product Working Group”) had a GREAT and PRODUCTIVE session at the OpenStack (full notes):

  1. Named the group!  OpenStack Product Working Group (now, that’s clarity in marketing) [note: I was incorrect saying “Product Managers” earlier].
  2. Agreed to use the mailing list for communication.
  3. Committed to a face-to-face mid-cycle meetup (likely in South Bay)
  4. Output from the meetup will be STRATEGIC DIRECTION doc to board (similar but broader than “Win the Enterprise”)
  5. Regular meeting schedule – like developers but likely voice interactive instead of IRC.  Stefano Maffulli is leading.

PMs starting this group already direct the work for a super majority (>66%) of active contributors.

The primary mission for the group is to collaborate and communicate around development priorities so that we can ensure that project commitments get met.

It was recognized that the project technical leads are already strapped coordinating release and technical objectives.  Further, the product managers are already but independently engaged in setting strategic direction, we cannot rely on existing OpenStack technical leadership to have the bandwidth.

This effort will succeed to the extent that we can help the broader community tied in and focus development effort back to dollars for the people paying for those developers.  In my book, that’s what product managers are supposed to do.  Hopefully, getting this group organized will help surface that discussion.

This is a big challenge considering that these product managers have to balance corporate, shared project and individual developers’ requirements.  Overall, I think Allison Randall summarized our objectives best: “we’re herding cats in the same direction.”

Starting RackN – Delivering open ops by pulling an OpenCrowbar Bunny out of our hat

When Dell pulled out from OpenCrowbar last April, I made a commitment to our community to find a way to keep it going.  Since my exit from Dell early in October 2014, that commitment has taken the form of RackN.

Rack N BlackToday, we’re ready to help people run and expand OpenCrowbar (days away from v2.1!). We’re also seeking investment to make the project more “enterprise-ready” and build integrations that extend ready state.

RackN focuses on maintenance and support of OpenCrowbar for ready state physical provisioning.  We will build the community around Crowbar as an open operations core and extend it with a larger set of hardware support and extensions.  We are building partnerships to build application integration (using Chef, Puppet, Salt, etc) and platform workloads (like OpenStack, Hadoop, Ceph, CloudFoundry and Mesos) above ready state.

I’ve talked with hundreds of people about the state of physical data center operations at scale. Frankly, it’s a scary state of affairs: complexity is increasing for physical infrastructure and we’re blurring the lines by adding commodity networking with local agents into the mix.

Making this jumble of stuff work together is not sexy cloud work – I describe it as internet plumbing to non-technical friends.  It’s unforgiving, complex and full of sharp edge conditions; however, people are excited to hear about our hardware abstraction mission because it solves a real pain for operators.

I hope you’ll stay tuned, or even play along, as we continue the Open Ops journey.

Tweaking DefCore to subdivide OpenStack platform

The following material will be a major part of the discussion for The OpenStack Board meeting on Monday 10/20.  Comments and suggest welcome!

OpenStack in PartsAuthor’s 2/26/2015 Note:  This proposal was approved by the Board in October 2014.

For nearly two years, the OpenStack Board has been moving towards creating a common platform definition that can help drive interoperability.  At the last meeting, the Board paused to further review one of the core tenants of the DefCore process (Item #3: Core definition can be applied equally to all usage models).

Outside of my role as DefCore chair, I see the OpenStack community asking itself an existential question: “are we one platform or a suite of projects?”  I’m having trouble believing “we are both” is an acceptable answer.

During the post-meeting review, Mark Collier drafted a Foundation supported recommendation that basically creates an additional core tier without changing the fundamental capabilities & designated code concepts.  This proposal has been reviewed by the DefCore committee (but not formally approved in a meeting).

The original DefCore proposed capabilities set becomes the “platform” level while capability subsets are called “components.”  We are considering two initial components, Compute & Object, and both are included in the platform (see illustration below).  The approach leaves the door open for new core component to exist both under and outside of the platform umbrella.

In the proposal, OpenStack vendors who meet either component or platform requirements can qualify for the “OpenStack Powered” logo; however, vendors using the only a component (instead of the full platform) will have more restrictive marks and limitations about how they can use the term OpenStack.

This approach addresses the “is Swift required?” question.  For platform, Swift capabilities will be required; however, vendors will be able to implement the Compute component without Swift and implement the Object component without Nova/Glance/Cinder.

It’s important to note that there is only one yard stick for components or the platform: the capabilities groups and designed code defined by the DefCore process.  From that perspective, OpenStack is one consistent thing.  This change allows vendors to choose sub-components if that serves their business objectives.

It’s up to the community to prove the platform value of all those sub-components working together.

OpenStack Goldilocks’ Syndrome: three questions to help us find our bearings

Goldilocks Atlas

Action: Please join Stefano. Allison, Sean and me in Paris on Monday, November 3rd, in the afternoon (schedule link)

If wishes were fishes, OpenStack’s rapid developer and user rise would include graceful process and commercial transitions too.  As a Foundation board member, it’s my responsibility to help ensure that we’re building a sustainable ecosystem for the project.  That’s a Goldilock’s challenge because adding either too much or too little controls and process will harm the project.

In discussions with the community, that challenge seems to breaks down into three key questions:

After last summit, a few of us started a dialog around Hidden Influencers that helps to frame these questions in an actionable way.  Now, it’s time for us to come together and talk in Paris in the hallways and specifically on Monday, November 3rd, in the afternoon (schedule link).   From there, we’ll figure out about next steps using these three questions as a baseline.

If you’ve got opinions about these questions, don’t wait for Paris!  I’d love to start the discussion here in the comments, on twitter (@zehicle), by phone, with email or via carrier pidgins.

Need a physical ops baseline? Crowbar continues to uniquely fill gap

Robots Everywhere!I’ve been watching to see if other open “bare metal” projects would morph to match the system-level capabilities that we proved in Crowbar v1 and honed in the re-architecture of OpenCrowbar.  The answer appears to be that Crowbar simply takes a broader approach to solving the physical ops repeatably problem.

Crowbar Architect Victor Lowther says “What makes Crowbar a better tool than Cobbler, Razor, or Foreman is that Crowbar has an orchestration engine that can be used to safely and repeatably deploy complex workloads across large numbers of machines. This is different from (and better than, IMO) just being able to hand responsibility off to Chef/Puppet/Salt, because we can manage the entire lifecycle of a machine where Cobbler, Razor and Chef cannot, we can describe how we want workloads configured at a more abstract level than Foreman can, and we do it all using the same API and UI.”

Since we started with a vision of an integrated system to address the “apply-rinse-repeat” cycle; it’s no surprise that Crowbar remains the only open platform that’s managed to crack the complete physical deployment life-cycle.

The Crowbar team realized that it’s not just about automation setting values: physical ops requires orchestration to make sure the values are set in the correct sequence on the appropriate control surface including DNS, DHCP, PXE, Monitoring, et cetera.  Unlike architectures for aaS platforms, the heterogeneous nature of the physical control planes requires a different approach.

We’ve seen that making more and more complex kickstart scripts or golden images is not a sustainable solution.  There is simply too much hardware variation and dependency thrash for operators to collaborate with those tools.  Instead, we’ve found that decomposing the provisioning operations into functional layers with orchestration is much more multi-site repeatable.

Accepting that physical ops (discovered infrastructure) is fundamentally different from cloud ops (created infrastructure) has been critical to architecting platforms that were resilient enough for the heterogeneous infrastructure of data centers.

If we want to start cleaning up physical ops, we need to stop looking at operating system provisioning in isolation and start looking at the full server bring up as just a part of a broader system operation that includes networking, management and operational integration.