Last Wednesday (3/11/15), I had the privilege of talking with the vBrownBag crowd about Functional Ops and bare metal deployment. In this hour, I talk about how functional operations (FuncOps) works as an extension of ready state. FuncOps is a critical concept for providing abstractions to scale heterogeneous physical operations.
Timing for this was fantastic since we’d just worked out ESXi install capability for OpenCrowbar (it will exposed for work starting on Drill, the next Crowbar release cycle).
Building cloud infrastructure requires a rock-solid foundation.
In this hour, Rob Hirschfeld will demo automated tooling, specifically OpenCrowbar, to prepare and integrate physical infrastructure to ready state and then use PackStack to install OpenStack.
The OpenCrowbar project started in 2011 as an OpenStack installer and had grown into a general purpose provisioning and infrastructure orchestration framework that works in parallel with multiple hardware vendors, operating systems and devops tools. These tools create a fast, durable and repeatable environment to install OpenStack, Ceph, Kubernetes, Hadoop or other scale platforms.
Rob will show off the latest features and discuss key concepts from the Crowbar operational model including Ready State, Functional Operations and Late Binding. These concepts, built into Crowbar, can be applied generally to make your operations more robust and scalable.
With the OpenCrowbar v2.1 out, I’ve been asked to update the video library of Crowbar demos. Since a complete tour is about 3 hours, I decided to cut it down into focused demos that would allow you to start at an area of interest and work backwards.
I’ve linked all the videos below by title. Here’s a visual table on contents:
Crowbar v2.1 demo: Visual Table of Contents [click for playlist]
The heart of the demo series is the Annealer and Ready State (video #3).
Applying architecture and computer science principles to infrastructure automation helps us build better controls. In this post, we create an OSI-like model that helps decompose the ops environment.
The RackN team discussions about “what is Ready State” have led to some interesting realizations about physical ops. One of the most critical has been splitting the operational configuration (DNS, NTP, SSH Keys, Monitoring, Security, etc) from the application configuration.
Interactions between these layers is much more dynamic than developers and operators expect.
In cloud deployments, you can use ask for the virtual infrastructure to be configured in advance via the IaaS and/or golden base images. In hardware, the environment build up needs to be more incremental because that variations in physical infrastructure and operations have to be accommodated.
Greg Althaus, Crowbar co-founder, and I put together this 7 layer model (it started as 3 and grew) because we needed to be more specific in discussion about provisioning and upgrade activity. The system view helps explain how layer 5 and 6 operate at the system layer.
The Seven Layers of our DIP:
shared infrastructure – the base layer is about the interconnects between the nodes. In this model, we care about the specific linkage to the node: VLAN tags on the switch port, which switch is connected, which PDU ID controls turns it on.
firmware and management – nodes have substantial driver (RAID/BIOS/IPMI) software below the operating system that must be configured correctly. In some cases, these configurations have external interfaces (BMC) that require out-of-band access while others can only be configured in pre-install environments (I call that side-band).
operating system – while the operating system is critical, operators are striving to keep this layer as thin to avoid overhead. Even so, there are critical security, networking and device mapping functions that must be configured. Critical local resource management items like mapping media or building network teams and bridges are level 2 functions.
operations clients – this layer connects the node to the logical data center infrastructure is basic ways like time synch (NTP) and name resolution (DNS). It’s also where more sophisticated operators configure things like distributed cache, centralized logging and system health monitoring. CMDB agents like Chef, Puppet or Saltstack are installed at the “top” of this layer to complete ready state.
applications – once all the baseline is setup, this is the unique workload. It can range from platforms for other applications (like OpenStack or Kubernetes) or the software itself like Ceph, Hadoop or anything.
operations management – the external system references for layer 3 must be factored into the operations model because they often require synchronized configuration. For example, registering a server name and IP addresses in a DNS, updating an inventory database or adding it’s thresholds to a monitoring infrastructure. For scale and security, it is critical to keep the node configuration (layer 3) constantly synchronized with the central management systems.
cluster coordination – no application stands alone; consequently, actions from layer 4 nodes must be coordinated with other nodes. This ranges from database registration and load balancing to complex upgrades with live data migration. Working in layer 4 without layer 6 coordination creates unmanageable infrastructure.
This seven layer operations model helps us discuss which actions are required when provisioning a scale infrastructure. In my experience, many developers want to work exclusively in layer 4 and overlook the need to have a consistent and managed infrastructure in all the other layers. We enable this thinking in cloud and platform as a service (PaaS) and that helps improve developer productivity.
We cannot overlook the other layers in physical ops; however, working to ready state helps us create more cloud-like boundaries. Those boundaries are a natural segue my upcoming post about functional operations (older efforts here).
Normalizing the APIs for hardware configuration is a noble and long-term goal. While the end result, a configured server, is very easy to describe; the differences between vendors’ hardware configuration tools are substantial. These differences make it impossible challenging to create repeatable operations automation (DevOps) on heterogeneous infrastructure.
Illustration to show potential changes in provisioning control flow over time.
The OpenStack Ironic project is a multi-vendor community solution to this problem at the server level. By providing a common API for server provisioning, Ironic encourages vendors to write drivers for their individual tooling such as iDRAC for Dell or iLO for HP.
Ironic abstracts configuration and expects to be driven by an orchestration system that makes the decisions of how to configure each server. That type of orchestration is the heart of Crowbar physical ops magic [side node: 5 ways that physical ops is different from cloud]
The OpenCrowbar project created extensible orchestration to solve this problem at the system level. By decomposing system configuration into isolated functional actions, Crowbar can coordinate disparate configuration actions for servers, switches and between systems.
Today, the Provisioner component of Crowbar performs similar functions as Ironic for operating system installation and image lay down. Since configuration activity is tightly coupled with other Crowbar configuration, discovery and networking setup, it is difficult to isolate in the current code base. As Ironic progresses, it should be possible to shift these activities from the Provisioner to Ironic and take advantage of the community-based configuration drivers.
The immediate synergy between Crowbar and Ironic comes from accepting two modes of operation for OpenStack: bootstrapping infrastructure and multi-tenant server allocation.
Crowbar was designed as an operational platform that seeds an OpenStack ready environment. Once that environment is configured, OpenStack can take over ownership of the resources and allow Ironic to manage and deliver “hypervisor-free” servers for each tenant. In that way, we can accelerate the adoption of OpenStack for self-service metal.
Physical operations is messy and challenging, but we’re committed to working together to make it suck less. Operators of the world unite!
meh. Compared to cloud, Ops on physical infrastructure sinks.
Unfortunately, the cloud and scale platforms need to run someone so someone’s got to deal with it. In fact, we’ve got to deal with crates of cranky servers and flocks of finicky platforms. It’s enough to keep a good operator down.
There is a light at the end of the tunnel! We can make it repeatable to provision OpenStack, Hadoop and other platforms.
As a community, we’re steadily bringing best practices and proven automation from cloud ops down into the physical space. On the OpenCrowbar project, we’re accelerating this effort using the ready state concept as a hand off point for “physical-cloud equivalency” and exploring the concept of “functional operations” to make DevOps scripts more portable.
OpenStack has grown dramatically in many ways but we have failed to integrate development, operations and business communities in a balanced way.
My most urgent observation from Paris is that these three critical parts of the community are having vastly different dialogs about OpenStack.
At the Conference, business people were talking were about core, stability and utility while the developers were talking about features, reorganizing and expanding projects. The operators, unfortunately segregated in a different location, were trying to figure out how to share best practices and tools.
Much of this structural divergence was intentional and should be (re)evaluated as we grow.
OpenStack events are split into distinct focus areas: the conference for business people, the summit for developers and specialized days for operators. While this design serves a purpose, the community needs to be taking extra steps to ensure communication. Without that communication, corporate sponsors and users may find it easier to solve problems inside their walls than outside in the community.
The risk is clear: vendors may find it easier to work on a fork where they have business and operational control than work within the community.
Inside the community, we are working to help resolve this challenge with several parallel efforts. As a community member, I challenge you to get involved in these efforts to ensure the project balances dev, biz and ops priorities. As a board member, I feel it’s a leadership challenge to make sure these efforts converge and that’s one of the reasons I’ve been working on several of these efforts:
OpenStack Project Managers (was Hidden Influencers) across companies in the ecosystem are getting organized into their own team. Since these managers effectively direct the majority of OpenStack developers, this group will allow
DefCore Committee works to define a smaller subset of the overall OpenStack Project that will be required for vendors using the OpenStack trademark and logo. This helps the business community focus on interoperability and stability.
Technical leadership (TC) lead “Big Tent” concept aligns with DefCore work and attempts to create a stable base platform while making it easier for new projects to enter the ecosystem. I’ve got a lot to say about this, but frankly, without safeguards, this scares people in the ops and business communities.
An operations “ready state” baseline keeps the community from being able to share best practices – this has become a pressing need. I’d like to suggest as OpenCrowbar an outside of OpenStack a way to help provide an ops neutral common starting point. Having the OpenStack developer community attempting to create an installer using OpenStack has proven a significant distraction and only further distances operators from the community.
We need to get past seeing the project primarily as a technology platform. Infrastructure software has to deliver value as an operational tool for enterprises. For OpenStack to thrive, we must make sure the needs of all constituents (Dev, Biz, Ops) are being addressed.