Boot me up! out-of-band IPMI rocks then shuts up and waits

It’s hard to get excited about re-implementing functionality from v1 unless the v2 happens to also be freaking awesome.   It’s awesome because the OpenCrowbar architecture allows us to it “the right way” with real out-of-band controls against the open WSMAN APIs.

gangnam styleWith out-of-band control, we can easily turn systems on and off using OpenCrowbar orchestration.  This means that it’s now standard practice to power off nodes after discovery & inventory until they are ready for OS installation.  This is especially interesting because many servers RAID and BIOS can be configured out-of-band without powering on at all.

Frankly, Crowbar 1 (cutting edge in 2011) was a bit hacky.  All of the WSMAN control was done in-band but looped through a gateway on the admin server so we could access the out-of-band API.  We also used the vendor (Dell) tools instead of open API sets.

That means that OpenCrowbar hardware configuration is truly multi-vendor.  I’ve got Dell & SuperMicro servers booting and out-of-band managed.  Want more vendors?  I’ll give you my shipping address.

OpenCrowbar does this out of the box and in the open so that everyone can participate.  That’s how we solve this problem as an industry and start to cope with hardware snowflaking.

And this out-of-band management gets even more interesting…

Since we’re talking to servers out-of-band (without the server being “on”) we can configure systems before they are even booted for provisioning.  Since OpenCrowbar does not require a discovery boot, you could pre-populate all your configurations via the API and have the Disk and BIOS settings ready before they are even booted (for models like the Dell iDRAC where the BMCs start immediately on power connect).

Those are my favorite features, but there’s more to love:

  • the new design does not require network gateway (v1 did) between admin and bmc networks (which was a security issue)
  • the configuration will detect and preserves existing assigned IPs.  This is a big deal in lab configurations where you are reusing the same machines and have scripted remote consoles.
  • OpenCrowbar offers an API to turn machines on/off using the out-of-band BMC network.
  • The system detects if nodes have IPMI (VMs & containers do not) and skip configuration BUT still manage to have power control using SSH (and could use VM APIs in the future)
  • Of course, we automatically setup BMC network based on your desired configuration

 

a Ready State analogy: “roughed in” brings it Home for non-ops-nerds

I’ve been seeing great acceptance on the concept of ops Ready State.  Technologists from both ops and dev immediately understand the need to “draw a line in the sand” between system prep and installation.  We also admit that getting physical infrastructure to Ready State is largely taken for granted; however, it often takes multiple attempts to get it right and even small application changes can require a full system rebuild.

Since even small changes can redefine the ready state requirements, changing Ready State can feel like being told to tear down your house so you remodel the kitchen.

Foundation RawA friend asked me to explain “Ready State” in non-technical terms.  So far, the best analogy that I’ve found is when a house is “Roughed In.”  It’s helpful if you’ve ever been part of house construction but may not be universally accessible so I’ll explain.

Foundation PouredGetting to Rough In means that all of the basic infrastructure of the house is in place but nothing is finished.  The foundation is poured, the plumbing lines are placed, the electrical mains are ready, the roof on and the walls are up.  The house is being built according to architectural plans and major decisions like how many rooms there are and the function of the rooms (bathroom, kitchen, great room, etc).  For Ready State, that’s like having the servers racked and setup with Disk, BIOS, and network configured.

Framed OutWhile we’ve built a lot, rough in is a relatively early milestone in construction.  Even major items like type of roof, siding and windows can still be changed.  Speaking of windows, this is like installing an operating system in Ready State.  We want to consider this as a distinct milestone because there’s still room to make changes.  Once the roof and exteriors are added, it becomes much more disruptive and expensive to make.

Roughed InOnce the house is roughed in, the finishing work begins.  Almost nothing from roughed in will be visible to the people living in the house.  Like a Ready State setup, the users interact with what gets laid on top of the infrastructure.  For homes it’s the walls, counters, fixtures and following.  For operators, its applications like Hadoop, OpenStack or CloudFoundry.

Taking this analogy back to where we started, what if we could make rebuilding an entire house take just a day?!  In construction, that’s simply not practical; however, we’re getting to a place in Ops where automation makes it possible to reconstruct the infrastructure configuration much faster.

While we can’t re-pour the foundation (aka swap out physical gear) instantly, we should be able to build up from there to ready state in a much more repeatable way.

SDN’s got Blind Spots! What are these Projects Ignoring? [Guest Post by Scott Jensen]

Scott Jensen returns as a guest poster about SDN!  I’m delighted to share his pointed insights that expand on previous 2 Part serieS about NFV and SDN.  I especially like his Rumsfeldian “unknowable workloads”

In my [Scott's] last post, I talked about why SDN is important in cloud environments; however, I’d like to challenge the underlying assumption that SDN cures all ops problems.

SDN implementations which I have looked at make the following base assumption about the physical network.  From the OpenContrails documentation:

The role of the physical underlay network is to provide an “IP fabric” – its responsibility is to provide unicast IP connectivity from any physical device (server, storage device, router, or switch) to any other physical device. An ideal underlay network provides uniform low-latency, non-blocking, high-bandwidth connectivity from any point in the network to any other point in the network.

The basic idea is to build an overlay network on top of the physical network in order to utilize a variety of protocols (Netflow, VLAN, VXLAN, MPLS etc.) and build the networking infrastructure which is needed by the applications and more importantly allow the applications to modify this virtual infrastructure to build the constructs that they need to operate correctly.

All well and good; however, what about the Physical Networks?

Under Provisioned / FunnyEarth.comThat is where you will run into bandwidth issues, QOS issues, latency differences and where the rubber really meets the road.  Ignoring the physical networks configuration can (and probably will) cause the entire system to perform poorly.

Does it make sense to just assume that you have uniform low latency connectivity to all points in the network?  In many cases, it does not.  For example:

  • Accesses to storage arrays have a different traffic pattern than a distributed storage system.
  • Compute resources which are used to house VMs which are running web applications are different than those which run database applications.
  • Some applications are specifically sensitive to certain networking issues such as available bandwidth, Jitter, Latency and so forth.
  • Where others will perform actions over the network at certain times of the day but then will not require the network resources for the rest of the day.  Classic examples of this are system backups or replication events.

Over Provisioned / zilya.netIf the infrastructure you are trying to implement is truly unknown as to how it will be utilized then you may have no choice than to over-provision the physical network.  In building a public cloud, the users will run whichever application they wish it may not be possible to engineer the appropriate traffic patterns.

This unknowable workload is exactly what these types of SDN projects are trying to target!

When designing these systems you do have a good idea of how it will be utilized or at least how specific portions of the system will be utilized and you need to account for that when building up the physical network under the SDN.

It is my belief that SDN applications should not just create an overlay.  That is part of the story, but should also take into account the physical infrastructure and assist with modifying the configuration of the Physical devices.  This balance achieves the best use of the network for both the applications which are running in the environment AND for the systems which they run on or rely upon for their operations.

Correctly ProvisionedWe need to reframe our thinking about SDN because we cannot just keep assuming that the speeds of the network will follow Moore’s Law and that you can assume that the Network is an unlimited resource.

Ops Bridges > Building a Sharable Ops Infrastructure with Composable Tool Chain Orchestration

This posted started from a discussion with Judd Maltin that he documented in a post about “wanting a composable run deck.”

Fitz and Trantrums: Breaking the Chains of LoveI’ve had several conversations comparing OpenCrowbar with other “bare metal provisioning” tools that do thing like serve golden images to PXE or IPXE server to help bootstrap deployments.  It’s those are handy tools, they do nothing to really help operators drive system-wide operations; consequently, they have a limited system impact/utility.

In building the new architecture of OpenCrowbar (aka Crowbar v2), we heard very clearly to have “less magic” in the system.  We took that advice very seriously to make sure that Crowbar was a system layer with, not a replacement to, standard operations tools.

Specifically, node boot & kickstart alone is just not that exciting.  It’s a combination of DHCP, PXE, HTTP and TFTP or DHCP and an IPXE HTTP Server.   It’s a pain to set this up, but I don’t really get excited about it anymore.   In fact, you can pretty much use open ops scripts (Chef) to setup these services because it’s cut and dry operational work.

Note: Setting up the networking to make it all work is perhaps a different question and one that few platforms bother talking about.

So, if doing node provisioning is not a big deal then why is OpenCrowbar important?  Because sustaining operations is about ongoing system orchestration (we’d say an “operations model“) that starts with provisioning.

It’s not the individual services that’s critical; it’s doing them in a system wide sequence that’s vital.

Crowbar does NOT REPLACE the services.  In fact, we go out of our way to keep your proven operations tool chain.  We don’t want operators to troubleshoot our IPXE code!  We’d much rather use the standard stuff and orchestrate the configuration in a predicable way.

In that way, OpenCrowbar embraces and composes the existing operations tool chain into an integrated system of tools.  We always avoid replacing tools.  That’s why we use Chef for our DSL instead of adding something new.

What does that leave for Crowbar?  Crowbar is providing a physical infratsucture targeted orchestration (we call it “the Annealer”) that coordinates this tool chain to work as a system.  It’s the system perspective that’s critical because it allows all of the operational services to work together.

For example, when a node is added then we have to create v4 and v6 IP address entries for it.  This is required because secure infrastructure requires reverse DNS.  If you change the name of that node or add an alias, Crowbar again needs to update the DNS.  This had to happen in the right sequence.  If you create a new virtual interface for that node then, again, you need to update DNS.   This type of operational housekeeping is essential and must be performed in the correct sequence at the right time.

The critical insight is that Crowbar works transparently alongside your existing operational services with proven configuration management tools.  Crowbar connects links in your tool chain but keeps you in the driver’s seat.

OpenCrowbar stands up 100 node community challenge

OpenCrowbar community contributors are offering a “100 Node Challenge” by volunteering to setup a 100+ node Crowbar system to prove out the v2 architecture at scale.  We picked 100* nodes since we wanted to firmly break the Crowbar v1 upper ceiling.

going up!The goal of the challenge is to prove scale of the core provisioning cycle.  It’s intended to be a short action (less than a week) so we’ll need advanced information about the hardware configuration.  The expectation is to do a full RAID/Disk hardware configuration beyond the base IPMI config before laying down the operating system.

The challenge logistics starts with an off-site prep discussion of the particulars of the deployment, then installing OpenCrowbar at the site and deploying the node century.  We will also work with you about using OpenCrowbar to manage the environment going forward.  

Sound too good to be true?  Well, as community members are doing this on their own time, we are only planning one challenge candidate and want to find the right target.
We will not be planning custom code changes to support the deployment, however, we would be happy to work with you in the community to support your needs.  If you want help to sustain the environment or have longer term plans, I have also been approached by community members who willing to take on full or part-time Crowbar consulting engagements.
Let’s get rack’n!
* we’ll consider smaller clusters but you have to buy the drinks and pizza.

You need a Squid Proxy fabric! Getting Ready State Best Practices

Sometimes a solving a small problem well makes a huge impact for operators.  Talking to operators, it appears that automated configuration of Squid does exactly that.

Not a SQUID but...

If you were installing OpenStack or Hadoop, you would not find “setup a squid proxy fabric to optimize your package downloads” in the install guide.   That’s simply out of scope for those guides; however, it’s essential operational guidance.  That’s what I mean by open operations and creating a platform for sharing best practice.

Deploying a base operating system (e.g.: Centos) on a lot of nodes creates bit-tons of identical internet traffic.  By default, each node will attempt to reach internet mirrors for packages.  If you multiply that by even 10 nodes, that’s a lot of traffic and a significant performance impact if you’re connection is limited.

For OpenCrowbar developers, the external package resolution means that each dev/test cycle with a node boot (which is up to 10+ times a day) is bottle necked.  For qa and install, the problem is even worse!

Our solution was 1) to embed Squid proxies into the configured environments and the 2) automatically configure nodes to use the proxies.   By making this behavior default, we improve the overall performance of a deployment.   This further improves the overall network topology of the operating environment while adding improved control of traffic.

This is a great example of how Crowbar uses existing operational tool chains (Chef configures Squid) in best practice ways to solve operations problems.  The magic is not in the tool or the configuration, it’s that we’ve included it in our out-of-the-box default orchestrations.

It’s time to stop fumbling around in the operational dark.  We need to compose our tool chains in an automated way!  This is how we advance operational best practice for ready state infrastructure.

OpenCrowbar Design Principles: Attribute Injection [Series 6 of 6]

This is part 5 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar.  The content is reposted from the OpenCrowbar docs repo.

Attribute Injection

Attribute Injection is an essential aspect of the “FuncOps” story because it helps clean boundaries needed to implement consistent scripting behavior between divergent sites.

attribute_injectionIt also allows Crowbar to abstract and isolate provisioning layers. This operational approach means that deployments are composed of layered services (see emergent services) instead of locked “golden” images. The layers can be maintained independently and allow users to compose specific configurations a la cart. This approach works if the layers have clean functional boundaries (FuncOps) that can be scoped and managed atomically.

To explain how Attribute Injection accomplishes this, we need to explore why search became an anti-pattern in Crowbar v1. Originally, being able to use server based search functions in operational scripting was a critical feature. It allowed individual nodes to act as part of a system by searching for global information needed to make local decisions. This greatly added Crowbar’s mission of system level configuration; however, it also created significant hidden interdependencies between scripts. As Crowbar v1 grew in complexity, searches became more and more difficult to maintain because they were difficult to correctly scope, hard to centrally manage and prone to timing issues.

Crowbar was not unique in dealing with this problem – the Attribute Injection pattern has become a preferred alternative to search in integrated community cookbooks.

Attribute Injection in OpenCrowbar works by establishing specific inputs and outputs for all state actions (NodeRole runs). By declaring the exact inputs needed and outputs provided, Crowbar can better manage each annealing operation. This control includes deployment scoping boundaries, time sequence of information plus override and substitution of inputs based on execution paths.

This concept is not unique to Crowbar. It has become best practice for operational scripts. Crowbar simply extends to paradigm to the system level and orchestration level.

Attribute Injection enabled operations to be:

  • Atomic – only the information needed for the operation is provided so risk of “bleed over” between scripts is minimized. This is also a functional programming preference.
  • Isolated Idempotent – risk of accidentally picking up changed information from previous runs is reduced by controlling the inputs. That makes it more likely that scripts can be idempotent.
  • Cleanly Scoped – information passed into operations can be limited based on system deployment boundaries instead of search parameters. This allows the orchestration to manage when and how information is added into configurations.
  • Easy to troubleshoot – since the information is limited and controlled, it is easier to recreate runs for troubleshooting. This is a substantial value for diagnostics.