Ops is Ops, except when it ain’t. Breaking down the impedance mismatches between physical and cloud ops.

We’ve made great strides in ops automation, but there’s no one-size-fits-all approach to ops because abstractions have limitations.

IMG_20141108_035537967Perhaps it’s my Industrial Engineering background, I’m a huge fan of operational automation and tooling. I can remember my first experience with VMware ESX and thinking that it needed tooling automation.  Since then, I’ve watched as cloud ops has revolutionized application development and deployment.  We are just at the beginning of the cloud automation curve and our continuous deployment tooling and platform services deliver exponential increases in value.

These cloud breakthroughs are fundamental to Ops and uncovered real best practices for operators.  Unfortunately, much of the cloud specific scripts and tools do not translate well to physical ops.  How can we correct that?

Now that I focus on physical ops, I’m in awe of the capabilities being unleashed by cloud ops. Looking at Netflix chaos monkeys pattern alone, we’ve reached a point where it’s practical to inject artificial failures to improve application robustness.  The idea of breaking things on purpose as an optimization is both terrifying and exhilarating.

In the last few years, I’ve watched (and lead) an application of these cloud tool chains down to physical infrastructure.  Fundamentally, there’s a great fit between DevOps configuration management (Chef, Puppet, Salt, Ansible) tooling and physical ops.  Most of the configuration and installation work (post-ready state) is fundamentally the same regardless if the services are physical, virtual or containerized.  Installing PostgreSQL is pretty much the same for any platform.

But pretty much the same is not exactly the same.  The differences between platforms often prevent us from translating useful work between frames.  In physics, we’d call that an impedance mismatch: where similar devices cannot work together dues to minor variations.

An example of this Ops impedance mismatch is networking.  Virtual systems present interfaces and networks that are specific to the desired workload while physical systems present all the available physical interfaces plus additional system interfaces like VLANs, bridges and teams.  On a typical server, there at least 10 available interfaces and you don’t get to choose which ones are connected – you have to discover the topology.  To complicate matters, the interface list will vary depending on both the server model and the site requirements.

It’s trivial in virtual by comparison, you get only the NICs you need and they are ordered consistently based on your network requests.  While the basic script is the same, it’s essential that it identify the correct interface.  That’s simple in cloud scripting and highly variable for physical!

Another example is drive configuration.  Hardware presents limitless options of RAID, JBOD plus SSD vs HDD.  These differences have dramatic performance and density implications that are, by design, completely obfuscated in cloud resources.

The solution is to create functional abstractions between the application configuration and the networking configuration.  The abstraction isolates configuration differences between the scripts.  So the application setup can be reused even if the networking is radically different.

With some of our OpenCrowbar latest work, we’re finally able to create practical abstractions for physical ops that’s repeatable site to site.  For example, we have patterns that allow us to functionally separate the network from the application layer.  Using that separation, we can build network interfaces in one layer and allow the next to assume the networking is correct as if it was a virtual machine.  That’s a very important advance because it allows us to finally share and reuse operational scripts.

We’ll never fully eliminate the physical vs cloud impedance issue, but I think we can make the gaps increasingly small if we continue to 1) isolate automation layers with clear APIs and 2) tune operational abstractions so they can be reused.

Starting RackN – Delivering open ops by pulling an OpenCrowbar Bunny out of our hat

When Dell pulled out from OpenCrowbar last April, I made a commitment to our community to find a way to keep it going.  Since my exit from Dell early in October 2014, that commitment has taken the form of RackN.

Rack N BlackToday, we’re ready to help people run and expand OpenCrowbar (days away from v2.1!). We’re also seeking investment to make the project more “enterprise-ready” and build integrations that extend ready state.

RackN focuses on maintenance and support of OpenCrowbar for ready state physical provisioning.  We will build the community around Crowbar as an open operations core and extend it with a larger set of hardware support and extensions.  We are building partnerships to build application integration (using Chef, Puppet, Salt, etc) and platform workloads (like OpenStack, Hadoop, Ceph, CloudFoundry and Mesos) above ready state.

I’ve talked with hundreds of people about the state of physical data center operations at scale. Frankly, it’s a scary state of affairs: complexity is increasing for physical infrastructure and we’re blurring the lines by adding commodity networking with local agents into the mix.

Making this jumble of stuff work together is not sexy cloud work – I describe it as internet plumbing to non-technical friends.  It’s unforgiving, complex and full of sharp edge conditions; however, people are excited to hear about our hardware abstraction mission because it solves a real pain for operators.

I hope you’ll stay tuned, or even play along, as we continue the Open Ops journey.

OpenCrowbar bootstrap positions SSH Keys for hand-offs

I was reading a ComputerWorld article about how Google and Amazon achieve scale.  The theme: you must do better than linear cost scale and the only way to achieve that is to automate and commoditize hardware.  I find interesting parallels in the Crowbar physical devops effort.

KeysAs the OpenCrowbar team continues to explore the concepts around “ready state,” I discover more and more small ops nuisances that need to be included in the build up before installing software.  These small items quickly add up at scale breaking the rule above.

I’ve already posted about the performance benefit of building a Squid Proxy fabric as part of the underlying ops environment.  As we work on Chef Metal, SaltStack and Packstack integrations (private beta), we’ve rediscovered the importance of management/population of SSH public keys.

In cloud infrastructure, key injection is taken for granted; however, it’s not an automatic behavior in the physical ops.  Since OpenCrowbar handles keys by default but other tools (like Cobbler or Razor) expect that you will use kickstart to inject your SSH keys when you install the Operating System..

Including keys in kickstart (which I’m using generically instead of preseed, auto-yast, jumpstart, etc) hand generated scripts is a potentially dangerous security practice since it makes it difficult to propagate and manage your keys.  It also means that every time a new operating system update is released that you may have to update and retest your kickstarts.  OpenCrowbar has the same challenge but our approach allows everyone can share in the work because our bootstrapping files are scripted and generic.

OpenCrowbar takes care of these ready state configurations in our integrations with these DevOps platforms.  Our experience has been that little items like SSH keys and proxy configurations can make a disproportionate advantage in running scale ops or during iterative development.

Need a physical ops baseline? Crowbar continues to uniquely fill gap

Robots Everywhere!I’ve been watching to see if other open “bare metal” projects would morph to match the system-level capabilities that we proved in Crowbar v1 and honed in the re-architecture of OpenCrowbar.  The answer appears to be that Crowbar simply takes a broader approach to solving the physical ops repeatably problem.

Crowbar Architect Victor Lowther says “What makes Crowbar a better tool than Cobbler, Razor, or Foreman is that Crowbar has an orchestration engine that can be used to safely and repeatably deploy complex workloads across large numbers of machines. This is different from (and better than, IMO) just being able to hand responsibility off to Chef/Puppet/Salt, because we can manage the entire lifecycle of a machine where Cobbler, Razor and Chef cannot, we can describe how we want workloads configured at a more abstract level than Foreman can, and we do it all using the same API and UI.”

Since we started with a vision of an integrated system to address the “apply-rinse-repeat” cycle; it’s no surprise that Crowbar remains the only open platform that’s managed to crack the complete physical deployment life-cycle.

The Crowbar team realized that it’s not just about automation setting values: physical ops requires orchestration to make sure the values are set in the correct sequence on the appropriate control surface including DNS, DHCP, PXE, Monitoring, et cetera.  Unlike architectures for aaS platforms, the heterogeneous nature of the physical control planes requires a different approach.

We’ve seen that making more and more complex kickstart scripts or golden images is not a sustainable solution.  There is simply too much hardware variation and dependency thrash for operators to collaborate with those tools.  Instead, we’ve found that decomposing the provisioning operations into functional layers with orchestration is much more multi-site repeatable.

Accepting that physical ops (discovered infrastructure) is fundamentally different from cloud ops (created infrastructure) has been critical to architecting platforms that were resilient enough for the heterogeneous infrastructure of data centers.

If we want to start cleaning up physical ops, we need to stop looking at operating system provisioning in isolation and start looking at the full server bring up as just a part of a broader system operation that includes networking, management and operational integration.

Apply, Rinse, Repeat! How do I get that DevOps conditioner out of my hair?

I’ve been trying to explain the pain Tao of physical ops in a way that’s accessible to people without scale ops experience.   It comes down to a yin-yang of two elements: exploding complexity and iterative learning.

Science = Explosions!Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications.  Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.

But that’s not the real challenge!  The bigger problem is that it’s impossible to design for all those parameters in advance.

When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system.  After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”

Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.

  1. The first cycle was completely generic to provide a working baseline and validate the physical environment.
  2. The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.
  3. The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment
  4. The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.

Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer!  Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.

a Ready State analogy: “roughed in” brings it Home for non-ops-nerds

I’ve been seeing great acceptance on the concept of ops Ready State.  Technologists from both ops and dev immediately understand the need to “draw a line in the sand” between system prep and installation.  We also admit that getting physical infrastructure to Ready State is largely taken for granted; however, it often takes multiple attempts to get it right and even small application changes can require a full system rebuild.

Since even small changes can redefine the ready state requirements, changing Ready State can feel like being told to tear down your house so you remodel the kitchen.

Foundation RawA friend asked me to explain “Ready State” in non-technical terms.  So far, the best analogy that I’ve found is when a house is “Roughed In.”  It’s helpful if you’ve ever been part of house construction but may not be universally accessible so I’ll explain.

Foundation PouredGetting to Rough In means that all of the basic infrastructure of the house is in place but nothing is finished.  The foundation is poured, the plumbing lines are placed, the electrical mains are ready, the roof on and the walls are up.  The house is being built according to architectural plans and major decisions like how many rooms there are and the function of the rooms (bathroom, kitchen, great room, etc).  For Ready State, that’s like having the servers racked and setup with Disk, BIOS, and network configured.

Framed OutWhile we’ve built a lot, rough in is a relatively early milestone in construction.  Even major items like type of roof, siding and windows can still be changed.  Speaking of windows, this is like installing an operating system in Ready State.  We want to consider this as a distinct milestone because there’s still room to make changes.  Once the roof and exteriors are added, it becomes much more disruptive and expensive to make.

Roughed InOnce the house is roughed in, the finishing work begins.  Almost nothing from roughed in will be visible to the people living in the house.  Like a Ready State setup, the users interact with what gets laid on top of the infrastructure.  For homes it’s the walls, counters, fixtures and following.  For operators, its applications like Hadoop, OpenStack or CloudFoundry.

Taking this analogy back to where we started, what if we could make rebuilding an entire house take just a day?!  In construction, that’s simply not practical; however, we’re getting to a place in Ops where automation makes it possible to reconstruct the infrastructure configuration much faster.

While we can’t re-pour the foundation (aka swap out physical gear) instantly, we should be able to build up from there to ready state in a much more repeatable way.

SDN’s got Blind Spots! What are these Projects Ignoring? [Guest Post by Scott Jensen]

Scott Jensen returns as a guest poster about SDN!  I’m delighted to share his pointed insights that expand on previous 2 Part serieS about NFV and SDN.  I especially like his Rumsfeldian “unknowable workloads”

In my [Scott’s] last post, I talked about why SDN is important in cloud environments; however, I’d like to challenge the underlying assumption that SDN cures all ops problems.

SDN implementations which I have looked at make the following base assumption about the physical network.  From the OpenContrails documentation:

The role of the physical underlay network is to provide an “IP fabric” – its responsibility is to provide unicast IP connectivity from any physical device (server, storage device, router, or switch) to any other physical device. An ideal underlay network provides uniform low-latency, non-blocking, high-bandwidth connectivity from any point in the network to any other point in the network.

The basic idea is to build an overlay network on top of the physical network in order to utilize a variety of protocols (Netflow, VLAN, VXLAN, MPLS etc.) and build the networking infrastructure which is needed by the applications and more importantly allow the applications to modify this virtual infrastructure to build the constructs that they need to operate correctly.

All well and good; however, what about the Physical Networks?

Under Provisioned / FunnyEarth.comThat is where you will run into bandwidth issues, QOS issues, latency differences and where the rubber really meets the road.  Ignoring the physical networks configuration can (and probably will) cause the entire system to perform poorly.

Does it make sense to just assume that you have uniform low latency connectivity to all points in the network?  In many cases, it does not.  For example:

  • Accesses to storage arrays have a different traffic pattern than a distributed storage system.
  • Compute resources which are used to house VMs which are running web applications are different than those which run database applications.
  • Some applications are specifically sensitive to certain networking issues such as available bandwidth, Jitter, Latency and so forth.
  • Where others will perform actions over the network at certain times of the day but then will not require the network resources for the rest of the day.  Classic examples of this are system backups or replication events.

Over Provisioned / zilya.netIf the infrastructure you are trying to implement is truly unknown as to how it will be utilized then you may have no choice than to over-provision the physical network.  In building a public cloud, the users will run whichever application they wish it may not be possible to engineer the appropriate traffic patterns.

This unknowable workload is exactly what these types of SDN projects are trying to target!

When designing these systems you do have a good idea of how it will be utilized or at least how specific portions of the system will be utilized and you need to account for that when building up the physical network under the SDN.

It is my belief that SDN applications should not just create an overlay.  That is part of the story, but should also take into account the physical infrastructure and assist with modifying the configuration of the Physical devices.  This balance achieves the best use of the network for both the applications which are running in the environment AND for the systems which they run on or rely upon for their operations.

Correctly ProvisionedWe need to reframe our thinking about SDN because we cannot just keep assuming that the speeds of the network will follow Moore’s Law and that you can assume that the Network is an unlimited resource.