Delicious 7 Layer DIP (DevOps Infrastructure Provisioning) model with graphic!

Applying architecture and computer science principles to infrastructure automation helps us build better controls.  In this post, we create an OSI-like model that helps decompose the ops environment.

The RackN team discussions about “what is Ready State” have led to some interesting realizations about physical ops.  One of the most critical has been splitting the operational configuration (DNS, NTP, SSH Keys, Monitoring, Security, etc) from the application configuration.

Interactions between these layers is much more dynamic than developers and operators expect.  

In cloud deployments, you can use ask for the virtual infrastructure to be configured in advance via the IaaS and/or golden base images.  In hardware, the environment build up needs to be more incremental because that variations in physical infrastructure and operations have to be accommodated.

Greg Althaus, Crowbar co-founder, and I put together this 7 layer model (it started as 3 and grew) because we needed to be more specific in discussion about provisioning and upgrade activity.  The system view helps explain how layer 5 and 6 operate at the system layer.

7 Layer DIP

The Seven Layers of our DIP:

  1. shared infrastructure – the base layer is about the interconnects between the nodes.  In this model, we care about the specific linkage to the node: VLAN tags on the switch port, which switch is connected, which PDU ID controls turns it on.
  2. firmware and management – nodes have substantial driver (RAID/BIOS/IPMI) software below the operating system that must be configured correctly.   In some cases, these configurations have external interfaces (BMC) that require out-of-band access while others can only be configured in pre-install environments (I call that side-band).
  3. operating system – while the operating system is critical, operators are striving to keep this layer as thin to avoid overhead.  Even so, there are critical security, networking and device mapping functions that must be configured.  Critical local resource management items like mapping media or building network teams and bridges are level 2 functions.
  4. operations clients – this layer connects the node to the logical data center infrastructure is basic ways like time synch (NTP) and name resolution (DNS).  It’s also where more sophisticated operators configure things like distributed cache, centralized logging and system health monitoring.  CMDB agents like Chef, Puppet or Saltstack are installed at the “top” of this layer to complete ready state.
  5. applications – once all the baseline is setup, this is the unique workload.  It can range from platforms for other applications (like OpenStack or Kubernetes) or the software itself like Ceph, Hadoop or anything.
  6. operations management – the external system references for layer 3 must be factored into the operations model because they often require synchronized configuration.  For example, registering a server name and IP addresses in a DNS, updating an inventory database or adding it’s thresholds to a monitoring infrastructure.  For scale and security, it is critical to keep the node configuration (layer 3) constantly synchronized with the central management systems.
  7. cluster coordination – no application stands alone; consequently, actions from layer 4 nodes must be coordinated with other nodes.  This ranges from database registration and load balancing to complex upgrades with live data migration. Working in layer 4 without layer 6 coordination creates unmanageable infrastructure.

This seven layer operations model helps us discuss which actions are required when provisioning a scale infrastructure.  In my experience, many developers want to work exclusively in layer 4 and overlook the need to have a consistent and managed infrastructure in all the other layers.  We enable this thinking in cloud and platform as a service (PaaS) and that helps improve developer productivity.

We cannot overlook the other layers in physical ops; however, working to ready state helps us create more cloud-like boundaries.  Those boundaries are a natural segue my upcoming post about functional operations (older efforts here).

Physical Ops = Plumbers of the Internet. Celebrating dirty IT jobs 8 bit style

I must be crazy because I like to make products that take on the hard and thankless jobs in IT.  Its not glamorous, but someone needs to do them.

marioAnalogies are required when explaining what I do to most people.  For them, I’m not a specialist in physical data center operations, I’m an Internet plumber who is part of the team you call when your virtual toilet backs up.  I’m good with that – it’s work that’s useful, messy and humble.

Plumbing, like the physical Internet, disappears from most people’s conscious once it’s out of sight under the floor, cabinet or modem closet.  And like plumbers, we can’t do physical ops without getting dirty.  Unlike cloud-based ops with clean APIs and virtual services, you can’t do physical ops without touching something physical.  Even if you’ve got great telepresence, you cannot get away from physical realities like NIC and SATA enumeration, BIOS management and network topology.  I’m delighted that cloud has abstracted away that layer for most people but that does not mean we can ignore it.

Physical ops lacks the standardization of plumbing.  There are many cross-vendor standards but innovation and vendor variation makes consistency as unlikely as a unicorn winning the Rainbow Triple Crown.

493143-donkey_kong_1For physical ops, it feels like we’re the internet’s most famous plumber, Mario, facing Donkey Kong.  We’ve got to scale ladders, jump fireballs and swing between chains.  The job is made harder because there’s no half measures.  Sometimes you can find the massive hammer and blast your way through but that’s just a short term thing.

Unfortunately, there’s a real enemy here: complexity.

Just like Donkey Kong keeps dashing off with the princess, operations continue to get more and more complex.  Like with Mario, the solution is not to bypass the complexity; it’s to get better and faster at navigating the obstacles that get thrown at you.  Physical ops is about self-reliance and adaptability.  In that case, there are a lot of lessons to be learned from Mario.

If I’m an internet plumber then I’m happy to embrace Mario as my mascot.  Plumbers of the internet unite!

Who’s the grown-up here?  It’s the VM not the Iron!

This ANALOGY exploring Virtual vs Physical Ops is Joint posting by Rob Hirschfeld, RackN, and Russel Doty, Redhat.RUSSEL DOTY

babyCompared to provisioning physical servers, getting applications running in a virtual machine is like coaching an adult soccer team – the players are ready, you just have to get them to the field and set the game in motion.  The physical servers can be compared to a grade school team – tremendous potential, but they can require a lot of coaching and intervention. And they don’t always play nice.

Russell Doty and I were geeking on the challenges of configuring physical servers when we realized that our friends in cloud just don’t have these problems.  When they ask for a server, it’s delivered to them on a platter with an SLA.  It’s a known configuration – calm, rational and well-behaved.  By comparison, hardware is cranky, irregular and sporadic.  To us, it sometimes feels like we are more in the babysitting business. Yes, we’ve had hardware with the colic!

Continuing the analogy, physical operations requires a degree of child-proofing and protection that is (thankfully) hidden behind cloud abstractions of hardware.  More importantly, it requires a level of work that adults take for granted like diaper changes (bios/raid setup), food preparation (network configs), and self-entertainment (O/S updates).

And here’s where the analogy breaks down…

The irony here is that the adults (vms) are the smaller, weaker part of the tribe.  Not only that, these kids have to create the environment that the “adults” run on.

If you’re used to dealing with adults to get work done, you’re going to be in for a shock when you ask the kids to do the same job.

That’s why the cloud is such a productive platform for software.  It’s an adults-only environment – the systems follow the rules and listen to your commands.  Even further, cloud systems know how to dress themselves (get an O/S), rent an apartment (get an IP and connect) and even get credentials (get a driver’s license).

These “little things” are taken for granted in the cloud are not automatic behaviors for physical infrastructure.

Of course, there are trade-offs – most notably performance and “scale up” scalability. The closer you need to get to hardware performance, on cpu, storage, or networks, the closer you need to get to the hardware.

It’s the classic case of standardizing vs. customization. And a question of how much time you are prepared to put into care and feeding!

why is hardware hard? Ready State Physical Ops Meetup on Tuesday 12/2 9am PT

meh.  Compared to cloud, Ops on physical infrastructure sinks.physical outlet

Unfortunately, the cloud and scale platforms need to run someone so someone’s got to deal with it.  In fact, we’ve got to deal with crates of cranky servers and flocks of finicky platforms.  It’s enough to keep a good operator down.

If that’s you, or someone you care about, join us for a physical ops support group meetup on Tuesday 9am PT (11 central).

There is a light at the end of the tunnel!  We can make it repeatable to provision OpenStack, Hadoop and other platforms.

As a community, we’re steadily bringing best practices and proven automation from cloud ops down into the physical space.   On the OpenCrowbar project, we’re accelerating this effort using the ready state concept as a hand off point for “physical-cloud equivalency” and exploring the concept of “functional operations” to make DevOps scripts more portable.

Join me and Rafael Knuth for a discussion about how operators can work together to break the cycle.

Ops is Ops, except when it ain’t. Breaking down the impedance mismatches between physical and cloud ops.

We’ve made great strides in ops automation, but there’s no one-size-fits-all approach to ops because abstractions have limitations.

IMG_20141108_035537967Perhaps it’s my Industrial Engineering background, I’m a huge fan of operational automation and tooling. I can remember my first experience with VMware ESX and thinking that it needed tooling automation.  Since then, I’ve watched as cloud ops has revolutionized application development and deployment.  We are just at the beginning of the cloud automation curve and our continuous deployment tooling and platform services deliver exponential increases in value.

These cloud breakthroughs are fundamental to Ops and uncovered real best practices for operators.  Unfortunately, much of the cloud specific scripts and tools do not translate well to physical ops.  How can we correct that?

Now that I focus on physical ops, I’m in awe of the capabilities being unleashed by cloud ops. Looking at Netflix chaos monkeys pattern alone, we’ve reached a point where it’s practical to inject artificial failures to improve application robustness.  The idea of breaking things on purpose as an optimization is both terrifying and exhilarating.

In the last few years, I’ve watched (and lead) an application of these cloud tool chains down to physical infrastructure.  Fundamentally, there’s a great fit between DevOps configuration management (Chef, Puppet, Salt, Ansible) tooling and physical ops.  Most of the configuration and installation work (post-ready state) is fundamentally the same regardless if the services are physical, virtual or containerized.  Installing PostgreSQL is pretty much the same for any platform.

But pretty much the same is not exactly the same.  The differences between platforms often prevent us from translating useful work between frames.  In physics, we’d call that an impedance mismatch: where similar devices cannot work together dues to minor variations.

An example of this Ops impedance mismatch is networking.  Virtual systems present interfaces and networks that are specific to the desired workload while physical systems present all the available physical interfaces plus additional system interfaces like VLANs, bridges and teams.  On a typical server, there at least 10 available interfaces and you don’t get to choose which ones are connected – you have to discover the topology.  To complicate matters, the interface list will vary depending on both the server model and the site requirements.

It’s trivial in virtual by comparison, you get only the NICs you need and they are ordered consistently based on your network requests.  While the basic script is the same, it’s essential that it identify the correct interface.  That’s simple in cloud scripting and highly variable for physical!

Another example is drive configuration.  Hardware presents limitless options of RAID, JBOD plus SSD vs HDD.  These differences have dramatic performance and density implications that are, by design, completely obfuscated in cloud resources.

The solution is to create functional abstractions between the application configuration and the networking configuration.  The abstraction isolates configuration differences between the scripts.  So the application setup can be reused even if the networking is radically different.

With some of our OpenCrowbar latest work, we’re finally able to create practical abstractions for physical ops that’s repeatable site to site.  For example, we have patterns that allow us to functionally separate the network from the application layer.  Using that separation, we can build network interfaces in one layer and allow the next to assume the networking is correct as if it was a virtual machine.  That’s a very important advance because it allows us to finally share and reuse operational scripts.

We’ll never fully eliminate the physical vs cloud impedance issue, but I think we can make the gaps increasingly small if we continue to 1) isolate automation layers with clear APIs and 2) tune operational abstractions so they can be reused.

API Driven Metal = OpenCrowbar + Chef Provisioning

The OpenCrowbar community created a Chef-Provisioning driver that allows you to quickly build hardware clusters using Chef cookbooks.

2012-08-05_14-13-18_310When we started using Chef in 2011, there was a distinct gap around bootstrapping systems.  The platform did a great job of automation and even connecting services together (via the Search anti-pattern, see below) but lacked a way to build the initial clusters automatically.

The current answer to this problem from Chef is refreshingly simply: a cookbook API extension called Chef Provisioning.  This approach uses the regular Chef DSL in recipes to create request and bind a cluster into Chef.  Basically, the code simply builds an array of nodes using an API that creates the nodes if they are missing from the array in the code.  Specifically, when a node is missing from the array, Chef calls out to create the node in an external system.

For clouds, that means using the API to request a server and then inject credentials for Chef management.  It’s trickier for physical gear because you cannot just make a server in the configuration you need it in.  Physical systems must first be discovered and profiled to ready state: the system must know how many NICs and disk drives are available to correctly configure the hardware prior to laying down the Operating System.

Consequently, Chef Provisioning automation is more about reallocation of existing discovered physical assets to Chef.  That’s exactly the approach the OpenCrowbar team took for our Chef Provisioning driver.

OpenCrowbar interacts with Chef Provisioning by pulling nodes from the System deployment into a Chef Provisioning deployment.  That action then allows the API client to request specific configurations like Operating System or network that need to be setup for Chef to execute.  Once these requests are made, Crowbar will simply run its normal annealing processes to ready state and then injects the Chef credentials.  Chef waits until the work queue is empty and then takes over management of the asset.  When Chef is finished, Crowbar can be instructed to reconfigure the node back to a base state.

Does that sound simple?  It is simple because the Crowbar APIs match the Chef needs very cleanly.

It’s worth noting that this integration is a great test of the OpenCrowbar API design.  Over the last two years, we’ve evolved the API to make it more final result focused.  Late binding is a critical concept for the project and the APIs reflect that objective.  For Chef Provisioning, we allow the integration to focus on simple requests like “give me a node then put this O/S on the node and go.”  Crowbar has the logic needed to figure out how to accomplish those objectives without much additional instruction.

Bonus Side Note: Why Search can become an anti-pattern?  

Search is an incredibly powerful feature in Chef that allows cross-role and cross-node integration; unfortunately, it’s also very difficult to maintain as complexity and contributor counts grow.  The reason is that search creates “forward dependencies” in the scripts that require operators creating data to be aware of downstream, hidden consumers.  High Availability (HA) is a clear example.  If I add a new “cluster database” role to the system then it is very likely to return multiple results for database searches.  That’s excellent until I learn that my scripts have coded search to assume that we only return one result for database lookups.  It’s very hard to find these errors since the searches are decoupled and downstream of the database cookbook.  Ultimately, the community had to advise against embedded search for shared cookbooks

OpenCrowbar bootstrap positions SSH Keys for hand-offs

I was reading a ComputerWorld article about how Google and Amazon achieve scale.  The theme: you must do better than linear cost scale and the only way to achieve that is to automate and commoditize hardware.  I find interesting parallels in the Crowbar physical devops effort.

KeysAs the OpenCrowbar team continues to explore the concepts around “ready state,” I discover more and more small ops nuisances that need to be included in the build up before installing software.  These small items quickly add up at scale breaking the rule above.

I’ve already posted about the performance benefit of building a Squid Proxy fabric as part of the underlying ops environment.  As we work on Chef Metal, SaltStack and Packstack integrations (private beta), we’ve rediscovered the importance of management/population of SSH public keys.

In cloud infrastructure, key injection is taken for granted; however, it’s not an automatic behavior in the physical ops.  Since OpenCrowbar handles keys by default but other tools (like Cobbler or Razor) expect that you will use kickstart to inject your SSH keys when you install the Operating System..

Including keys in kickstart (which I’m using generically instead of preseed, auto-yast, jumpstart, etc) hand generated scripts is a potentially dangerous security practice since it makes it difficult to propagate and manage your keys.  It also means that every time a new operating system update is released that you may have to update and retest your kickstarts.  OpenCrowbar has the same challenge but our approach allows everyone can share in the work because our bootstrapping files are scripted and generic.

OpenCrowbar takes care of these ready state configurations in our integrations with these DevOps platforms.  Our experience has been that little items like SSH keys and proxy configurations can make a disproportionate advantage in running scale ops or during iterative development.