Ironic + Crowbar: United in Vision, Complementary in Approach

Posted on December 9, 2014 by Rob H

This post is co-authored by Devanda van der Veen, OpenStack Ironic PTL, and Rob Hirschfeld, OpenCrowbar Founder. We discuss how Ironic and Crowbar work together today and into the future.

Normalizing the APIs for hardware configuration is a noble and long-term goal. While the end result, a configured server, is very easy to describe; the differences between vendors’ hardware configuration tools are substantial. These differences make it ~~impossible~~ challenging to create repeatable operations automation (DevOps) on heterogeneous infrastructure.

Illustration to show potential changes in provisioning control flow over time.

The OpenStack Ironic project is a multi-vendor community solution to this problem at the server level. By providing a common API for server provisioning, Ironic encourages vendors to write drivers for their individual tooling such as iDRAC for Dell or iLO for HP.

Ironic abstracts configuration and expects to be driven by an orchestration system that makes the decisions of how to configure each server. That type of orchestration is the heart of Crowbar physical ops magic [side node: 5 ways that physical ops is different from cloud]

The OpenCrowbar project created extensible orchestration to solve this problem at the system level. By decomposing system configuration into isolated functional actions, Crowbar can coordinate disparate configuration actions for servers, switches and between systems.

Today, the Provisioner component of Crowbar performs similar functions as Ironic for operating system installation and image lay down. Since configuration activity is tightly coupled with other Crowbar configuration, discovery and networking setup, it is difficult to isolate in the current code base. As Ironic progresses, it should be possible to shift these activities from the Provisioner to Ironic and take advantage of the community-based configuration drivers.

The immediate synergy between Crowbar and Ironic comes from accepting two modes of operation for OpenStack: bootstrapping infrastructure and multi-tenant server allocation.

Crowbar was designed as an operational platform that seeds an OpenStack ready environment. Once that environment is configured, OpenStack can take over ownership of the resources and allow Ironic to manage and deliver “hypervisor-free” servers for each tenant. In that way, we can accelerate the adoption of OpenStack for self-service metal.

Physical operations is messy and challenging, but we’re committed to working together to make it suck less. Operators of the world unite!

Physical Ops = Plumbers of the Internet. Celebrating dirty IT jobs 8 bit style

Posted on December 4, 2014 by Rob H

I must be crazy because I like to make products that take on the hard and thankless jobs in IT. Its not glamorous, but someone needs to do them.

Analogies are required when explaining what I do to most people. For them, I’m not a specialist in physical data center operations, I’m an Internet plumber who is part of the team you call when your virtual toilet backs up. I’m good with that – it’s work that’s useful, messy and humble.

Plumbing, like the physical Internet, disappears from most people’s conscious once it’s out of sight under the floor, cabinet or modem closet. And like plumbers, we can’t do physical ops without getting dirty. Unlike cloud-based ops with clean APIs and virtual services, you can’t do physical ops without touching something physical. Even if you’ve got great telepresence, you cannot get away from physical realities like NIC and SATA enumeration, BIOS management and network topology. I’m delighted that cloud has abstracted away that layer for most people but that does not mean we can ignore it.

Physical ops lacks the standardization of plumbing. There are many cross-vendor standards but innovation and vendor variation makes consistency as unlikely as a unicorn winning the Rainbow Triple Crown.

For physical ops, it feels like we’re the internet’s most famous plumber, Mario, facing Donkey Kong. We’ve got to scale ladders, jump fireballs and swing between chains. The job is made harder because there’s no half measures. Sometimes you can find the massive hammer and blast your way through but that’s just a short term thing.

Unfortunately, there’s a real enemy here: complexity.

Just like Donkey Kong keeps dashing off with the princess, operations continue to get more and more complex. Like with Mario, the solution is not to bypass the complexity; it’s to get better and faster at navigating the obstacles that get thrown at you. Physical ops is about self-reliance and adaptability. In that case, there are a lot of lessons to be learned from Mario.

If I’m an internet plumber then I’m happy to embrace Mario as my mascot. Plumbers of the internet unite!

Who’s the grown-up here? It’s the VM not the Iron!

Posted on December 3, 2014 by Rob H

This ANALOGY exploring Virtual vs Physical Ops is Joint posting by Rob Hirschfeld, RackN, and Russel Doty, Redhat.RUSSEL DOTY

Compared to provisioning physical servers, getting applications running in a virtual machine is like coaching an adult soccer team – the players are ready, you just have to get them to the field and set the game in motion. The physical servers can be compared to a grade school team – tremendous potential, but they can require a lot of coaching and intervention. And they don’t always play nice.

Russell Doty and I were geeking on the challenges of configuring physical servers when we realized that our friends in cloud just don’t have these problems. When they ask for a server, it’s delivered to them on a platter with an SLA. It’s a known configuration – calm, rational and well-behaved. By comparison, hardware is cranky, irregular and sporadic. To us, it sometimes feels like we are more in the babysitting business. Yes, we’ve had hardware with the colic!

Continuing the analogy, physical operations requires a degree of child-proofing and protection that is (thankfully) hidden behind cloud abstractions of hardware. More importantly, it requires a level of work that adults take for granted like diaper changes (bios/raid setup), food preparation (network configs), and self-entertainment (O/S updates).

And here’s where the analogy breaks down…

The irony here is that the adults (vms) are the smaller, weaker part of the tribe. Not only that, these kids have to create the environment that the “adults” run on.

If you’re used to dealing with adults to get work done, you’re going to be in for a shock when you ask the kids to do the same job.

That’s why the cloud is such a productive platform for software. It’s an adults-only environment – the systems follow the rules and listen to your commands. Even further, cloud systems know how to dress themselves (get an O/S), rent an apartment (get an IP and connect) and even get credentials (get a driver’s license).

These “little things” are taken for granted in the cloud are not automatic behaviors for physical infrastructure.

Of course, there are trade-offs – most notably performance and “scale up” scalability. The closer you need to get to hardware performance, on cpu, storage, or networks, the closer you need to get to the hardware.

It’s the classic case of standardizing vs. customization. And a question of how much time you are prepared to put into care and feeding!

To thrive, OpenStack must better balance dev, ops and business needs.

Posted on November 21, 2014 by Rob H

OpenStack has grown dramatically in many ways but we have failed to integrate development, operations and business communities in a balanced way.

My most urgent observation from Paris is that these three critical parts of the community are having vastly different dialogs about OpenStack.

At the Conference, business people were talking were about core, stability and utility while the developers were talking about features, reorganizing and expanding projects. The operators, unfortunately segregated in a different location, were trying to figure out how to share best practices and tools.

Much of this structural divergence was intentional and should be (re)evaluated as we grow.

OpenStack events are split into distinct focus areas: the conference for business people, the summit for developers and specialized days for operators. While this design serves a purpose, the community needs to be taking extra steps to ensure communication. Without that communication, corporate sponsors and users may find it easier to solve problems inside their walls than outside in the community.

The risk is clear: vendors may find it easier to work on a fork where they have business and operational control than work within the community.

Inside the community, we are working to help resolve this challenge with several parallel efforts. As a community member, I challenge you to get involved in these efforts to ensure the project balances dev, biz and ops priorities. As a board member, I feel it’s a leadership challenge to make sure these efforts converge and that’s one of the reasons I’ve been working on several of these efforts:

OpenStack Project Managers (was Hidden Influencers) across companies in the ecosystem are getting organized into their own team. Since these managers effectively direct the majority of OpenStack developers, this group will allow
DefCore Committee works to define a smaller subset of the overall OpenStack Project that will be required for vendors using the OpenStack trademark and logo. This helps the business community focus on interoperability and stability.
Technical leadership (TC) lead “Big Tent” concept aligns with DefCore work and attempts to create a stable base platform while making it easier for new projects to enter the ecosystem. I’ve got a lot to say about this, but frankly, without safeguards, this scares people in the ops and business communities.
An operations “ready state” baseline keeps the community from being able to share best practices – this has become a pressing need. I’d like to suggest as OpenCrowbar an outside of OpenStack a way to help provide an ops neutral common starting point. Having the OpenStack developer community attempting to create an installer using OpenStack has proven a significant distraction and only further distances operators from the community.

We need to get past seeing the project primarily as a technology platform. Infrastructure software has to deliver value as an operational tool for enterprises. For OpenStack to thrive, we must make sure the needs of all constituents (Dev, Biz, Ops) are being addressed.

Need a physical ops baseline? Crowbar continues to uniquely fill gap

Posted on October 14, 2014 by Rob H

I’ve been watching to see if other open “bare metal” projects would morph to match the system-level capabilities that we proved in Crowbar v1 and honed in the re-architecture of OpenCrowbar. The answer appears to be that Crowbar simply takes a broader approach to solving the physical ops repeatably problem.

Crowbar Architect Victor Lowther says “What makes Crowbar a better tool than Cobbler, Razor, or Foreman is that Crowbar has an orchestration engine that can be used to safely and repeatably deploy complex workloads across large numbers of machines. This is different from (and better than, IMO) just being able to hand responsibility off to Chef/Puppet/Salt, because we can manage the entire lifecycle of a machine where Cobbler, Razor and Chef cannot, we can describe how we want workloads configured at a more abstract level than Foreman can, and we do it all using the same API and UI.”

Since we started with a vision of an integrated system to address the “apply-rinse-repeat” cycle; it’s no surprise that Crowbar remains the only open platform that’s managed to crack the complete physical deployment life-cycle.

The Crowbar team realized that it’s not just about automation setting values: physical ops requires orchestration to make sure the values are set in the correct sequence on the appropriate control surface including DNS, DHCP, PXE, Monitoring, et cetera. Unlike architectures for aaS platforms, the heterogeneous nature of the physical control planes requires a different approach.

We’ve seen that making more and more complex kickstart scripts or golden images is not a sustainable solution. There is simply too much hardware variation and dependency thrash for operators to collaborate with those tools. Instead, we’ve found that decomposing the provisioning operations into functional layers with orchestration is much more multi-site repeatable.

Accepting that physical ops (discovered infrastructure) is fundamentally different from cloud ops (created infrastructure) has been critical to architecting platforms that were resilient enough for the heterogeneous infrastructure of data centers.

If we want to start cleaning up physical ops, we need to stop looking at operating system provisioning in isolation and start looking at the full server bring up as just a part of a broader system operation that includes networking, management and operational integration.

Apply, Rinse, Repeat! How do I get that DevOps conditioner out of my hair?

Posted on October 2, 2014 by Rob H

I’ve been trying to explain the ~~pain~~ Tao of physical ops in a way that’s accessible to people without scale ops experience. It comes down to a yin-yang of two elements: exploding complexity and iterative learning.

Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications. Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.

But that’s not the real challenge! The bigger problem is that it’s impossible to design for all those parameters in advance.

When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system. After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”

Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.

The first cycle was completely generic to provide a working baseline and validate the physical environment.
The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.
The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment
The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.

Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer! Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.

To improve flow, we must view OpenStack community as a Software Factory

Posted on September 15, 2014 by Rob H

This post was sparked by a conversation at OpenStack Atlanta between OpenStack Foundation board members Todd Moore (IBM) and Rob Hirschfeld (Dell/Community). We share a background in industrial and software process and felt that sharing lean manufacturing translates directly to helping face OpenStack challenges.

While OpenStack has done an amazing job of growing contributors, scale has caused our code flow processes to be bottlenecked at the review stage. This blocks flow throughout the entire system and presents a significant risk to both stability and feature addition. Flow failures can ultimately lead to vendor forking.

Fundamentally, Todd and I felt that OpenStack needs to address system flows to build an integrated product. The post expands on the “hidden influencers” issue and adds an additional challenge because improving flow requires that the community influences better understands the need to optimize work inter-project in a more systematic way.

Let’s start by visualizing the “OpenStack Factory”

Factory Floor from Alpha Industries Wikipedia page

Imagine all of OpenStack’s 1000s of developers working together in a single giant start-up warehouse. Each project in its own floor area with appropriate fooz tables, break areas and coffee bars. It’s easy to visualize clusters of intent developers talking around tables or coding in dark corners while PTLs and TC members dash between groups coordinating work.

Expand the visualization so that we can actually see the code flowing between teams as little colored boxes. Giving project has a unique color allows us to quickly see dependencies between teams. Some features are piled up waiting for review inside teams while others are waiting on pallets between projects waiting on needed cross features have not completed. At release time, we’d be able to see PTLs sorting through stacks of completed boxes to pick which ones were ready to ship.

Watching a factory floor from above is a humbling experience and a key feature of systems thinking enlightenment in both The Phoenix Project and The Goal. It’s very easy to be caught up in a single project (local optimization) and miss the broader system implications of local choices.

There is a large body of work about Lean Process for Manufacturing

You’ve already visualized OpenStack code creation as a manufacturing floor: it’s a small step to accept that we can use the same proven processes for software and physical manufacturing.

As features move between teams (work centers), it becomes obvious that we’ve created a very highly interlocked sequence of component steps needed to deliver product; unfortunately, we have minimal coordination between the owners of the work centers. If a feature is needs a critical resource (think programmer) to progress then we rely on the resource to allocate time to the work. Since that person’s manager may not agree to the priority, we have a conflict between system flow and individual optimization.

That conflict destroys flow in the system.

The number #1 lesson from lean manufacturing is that putting individual optimization over system optimization reduces throughput. Since our product and people managers are often competitors, we need to work doubly hard to address system concerns. Worse yet our inventory of work in process and the interdependencies between projects is harder to discern. Unlike the manufacturing floor, our developers and project leads cannot look down upon it and see the physical work as it progresses from station to station in one single holistic view. The bottlenecks that throttle the OpenStack workflow are harder to see but we can find them, as can be demonstrated later in this post.

Until we can engage the resource owners in balancing system flow, OpenStack’s throughput will decline as we add resources. This same principle is at play in the famous aphorism: adding developers makes a late project later.

Is there a solution?

There are lessons from Lean Manufacturing that can be applied

Make quality a priority (expand tests from function to integration)
Ensure integration from station to station (prioritize working together over features)
Make sure that owners of work are coordinating (expose hidden influencers)
Find and mange from the bottleneck (classic Lean says find the bottleneck and improve that)
Create and monitor a system view
Have everyone value finished product, not workstation output

Added Subscript: I highly recommend reading Daniel Berrange’s email about this.

Boot me up! out-of-band IPMI rocks then shuts up and waits

Posted on July 16, 2014 by Rob H

It’s hard to get excited about re-implementing functionality from v1 unless the v2 happens to also be freaking awesome. It’s awesome because the OpenCrowbar architecture allows us to it “the right way” with real out-of-band controls against the open WSMAN APIs.

With out-of-band control, we can easily turn systems on and off using OpenCrowbar orchestration. This means that it’s now standard practice to power off nodes after discovery & inventory until they are ready for OS installation. This is especially interesting because many servers RAID and BIOS can be configured out-of-band without powering on at all.

Frankly, Crowbar 1 (cutting edge in 2011) was a bit hacky. All of the WSMAN control was done in-band but looped through a gateway on the admin server so we could access the out-of-band API. We also used the vendor (Dell) tools instead of open API sets.

That means that OpenCrowbar hardware configuration is truly multi-vendor. I’ve got Dell & SuperMicro servers booting and out-of-band managed. Want more vendors? I’ll give you my shipping address.

OpenCrowbar does this out of the box and in the open so that everyone can participate. That’s how we solve this problem as an industry and start to cope with hardware snowflaking.

And this out-of-band management gets even more interesting…

Since we’re talking to servers out-of-band (without the server being “on”) we can configure systems before they are even booted for provisioning. Since OpenCrowbar does not require a discovery boot, you could pre-populate all your configurations via the API and have the Disk and BIOS settings ready before they are even booted (for models like the Dell iDRAC where the BMCs start immediately on power connect).

Those are my favorite features, but there’s more to love:

the new design does not require network gateway (v1 did) between admin and bmc networks (which was a security issue)
the configuration will detect and preserves existing assigned IPs. This is a big deal in lab configurations where you are reusing the same machines and have scripted remote consoles.
OpenCrowbar offers an API to turn machines on/off using the out-of-band BMC network.
The system detects if nodes have IPMI (VMs & containers do not) and skip configuration BUT still manage to have power control using SSH (and could use VM APIs in the future)
Of course, we automatically setup BMC network based on your desired configuration

a Ready State analogy: “roughed in” brings it Home for non-ops-nerds

Posted on July 15, 2014 by Rob H

I’ve been seeing great acceptance on the concept of ops Ready State. Technologists from both ops and dev immediately understand the need to “draw a line in the sand” between system prep and installation. We also admit that getting physical infrastructure to Ready State is largely taken for granted; however, it often takes multiple attempts to get it right and even small application changes can require a full system rebuild.

Since even small changes can redefine the ready state requirements, changing Ready State can feel like being told to tear down your house so you remodel the kitchen.

A friend asked me to explain “Ready State” in non-technical terms. So far, the best analogy that I’ve found is when a house is “Roughed In.” It’s helpful if you’ve ever been part of house construction but may not be universally accessible so I’ll explain.

Getting to Rough In means that all of the basic infrastructure of the house is in place but nothing is finished. The foundation is poured, the plumbing lines are placed, the electrical mains are ready, the roof on and the walls are up. The house is being built according to architectural plans and major decisions like how many rooms there are and the function of the rooms (bathroom, kitchen, great room, etc). For Ready State, that’s like having the servers racked and setup with Disk, BIOS, and network configured.

While we’ve built a lot, rough in is a relatively early milestone in construction. Even major items like type of roof, siding and windows can still be changed. Speaking of windows, this is like installing an operating system in Ready State. We want to consider this as a distinct milestone because there’s still room to make changes. Once the roof and exteriors are added, it becomes much more disruptive and expensive to make.

Once the house is roughed in, the finishing work begins. Almost nothing from roughed in will be visible to the people living in the house. Like a Ready State setup, the users interact with what gets laid on top of the infrastructure. For homes it’s the walls, counters, fixtures and following. For operators, its applications like Hadoop, OpenStack or CloudFoundry.

Taking this analogy back to where we started, what if we could make rebuilding an entire house take just a day?! In construction, that’s simply not practical; however, we’re getting to a place in Ops where automation makes it possible to reconstruct the infrastructure configuration much faster.

While we can’t re-pour the foundation (aka swap out physical gear) instantly, we should be able to build up from there to ready state in a much more repeatable way.

SDN’s got Blind Spots! What are these Projects Ignoring? [Guest Post by Scott Jensen]

Posted on July 8, 2014 by Rob H

Scott Jensen returns as a guest poster about SDN! I’m delighted to share his pointed insights that expand on previous 2 Part serieS about NFV and SDN. I especially like his Rumsfeldian “unknowable workloads”

In my [Scott’s] last post, I talked about why SDN is important in cloud environments; however, I’d like to challenge the underlying assumption that SDN cures all ops problems.

SDN implementations which I have looked at make the following base assumption about the physical network. From the OpenContrails documentation:

The role of the physical underlay network is to provide an “IP fabric” – its responsibility is to provide unicast IP connectivity from any physical device (server, storage device, router, or switch) to any other physical device. An ideal underlay network provides uniform low-latency, non-blocking, high-bandwidth connectivity from any point in the network to any other point in the network.

The basic idea is to build an overlay network on top of the physical network in order to utilize a variety of protocols (Netflow, VLAN, VXLAN, MPLS etc.) and build the networking infrastructure which is needed by the applications and more importantly allow the applications to modify this virtual infrastructure to build the constructs that they need to operate correctly.

All well and good; however, what about the Physical Networks?

That is where you will run into bandwidth issues, QOS issues, latency differences and where the rubber really meets the road. Ignoring the physical networks configuration can (and probably will) cause the entire system to perform poorly.

Does it make sense to just assume that you have uniform low latency connectivity to all points in the network? In many cases, it does not. For example:

Accesses to storage arrays have a different traffic pattern than a distributed storage system.
Compute resources which are used to house VMs which are running web applications are different than those which run database applications.
Some applications are specifically sensitive to certain networking issues such as available bandwidth, Jitter, Latency and so forth.
Where others will perform actions over the network at certain times of the day but then will not require the network resources for the rest of the day. Classic examples of this are system backups or replication events.

If the infrastructure you are trying to implement is truly unknown as to how it will be utilized then you may have no choice than to over-provision the physical network. In building a public cloud, the users will run whichever application they wish it may not be possible to engineer the appropriate traffic patterns.

This unknowable workload is exactly what these types of SDN projects are trying to target!

When designing these systems you do have a good idea of how it will be utilized or at least how specific portions of the system will be utilized and you need to account for that when building up the physical network under the SDN.

It is my belief that SDN applications should not just create an overlay. That is part of the story, but should also take into account the physical infrastructure and assist with modifying the configuration of the Physical devices. This balance achieves the best use of the network for both the applications which are running in the environment AND for the systems which they run on or rely upon for their operations.

We need to reframe our thinking about SDN because we cannot just keep assuming that the speeds of the network will follow Moore’s Law and that you can assume that the Network is an unlimited resource.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: Operations