Working for Dell, it’s no surprise that I have a lot of discussions around building up and maintaining the physical infrastructure to run a data centers at scale. Generally the context is around OpenCrowbar, Hadoop or OpenStack Ironic/TripleO/Heat but the concerns are really universal in my cloud operations experience.
Typically, deployments have three distinct phases: 1) mechanically plug together the systems, 2) get the systems ready to the OS and network level andthen 3) install the application. Often these phases are so distinct that they are handled by completely different teams!
That’s a problem because errors or unexpected changes from one phase are very expensive to address once you change teams. The solution has been to become more and more prescriptive about what the system looks like between the second (“ready”) and third (“installed”) phase. I’ve taken to calling this hand-off a achieving a ready state infrastructure.
I define a “ready state” infrastructure as having been configured so that the application lay down steps are simple and predictable.
In my experience, most application deployment guides start with a ready state assumption. They read like “Step 0: rack, configure, provision and tweak the nodes and network to have this specific starting configuration.” If you are really lucky then “specific configuration” is actually a documented and validated reference architecture.
The magic of cloud IaaS is that it always creates ready state infrastructure. If I request 10 servers with 2 NICs running Ubuntu 14.04 then that’s exactly what I get. The fact that cloud always provisions a ready state infrastructure has become an essential operating assumption for cloud orchestration and configuration management.
Unfortunately, hardware provisioning is messy. It takes significant effort to configure a physical system into a ready state. This is caused by a number of factors
- You can’t alter physical infrastructure with programming (an API) – for example, if the server enumerates the NICs differently than you expected, you have to adapt to that.
- You have to respect the physical topology of the system – for example, production deployments used teamed NICs that have to be use different switches for redundancy. You can’t make assumptions, you have to setup the team based on the specific configuration.
- You have to build up the configuration in sequence – for example, you can’t setup the RAID configuration after the operating system is installed. If you made a bad choice then you’ll likely have to repeat the whole sequence of the deployment and some bad choices (like using the wrong subnets) result in a total system rebuild.
- Hardware fails and is non-uniform – for example, in any order of sufficient size you will have NIC failures due to everything from simple mechanical card seating issues to BIOS interface mismatches. Troubleshooting these issues can occupy significant time.
- Component configurations are interlocked – for example, a change to the switch settings could result in DHCP failures when systems are rebooted (real experience). You cannot always work node-to-node, you must deal with the infrastructure as an integrated system.
Being consistent at turning discovered state into ready state is a complex and unique problem space. As I explore this bare metal provisioning space in the community, I am more and more convinced that it has a distinct architecture from applications built for ready state operations.
My hope in this post is test if the concept of “ready state” infrastructure is helpful in describing the transition point between provisioning and installation. Please let me know what you think!
Pingback: Dell Open Source Ecosystem Digest #45 - Dell TechCenter - TechCenter - Dell Community
Pingback: Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure] | Rob Hirschfeld
Pingback: OpenCrowbar: ready to fly as OpenOps neutral platform – Dell stepping back | Rob Hirschfeld
Pingback: OpenCrowbar Design Principles: Reintroduction [Series 1 of 6] | Rob Hirschfeld
Pingback: OpenCrowbar Design Principles: The Ops Challenge [Series 2 of 6] | Rob Hirschfeld
Pingback: OpenCrowbar Design Principles: Late Binding [Series 3 of 6] | Rob Hirschfeld
Pingback: OpenCrowbar Design Principles: Simulated Annealing [Series 4 of 6] | Rob Hirschfeld
Pingback: OpenCrowbar Design Principles: Emergent services [Series 5 of 6] | Rob Hirschfeld
Pingback: OpenCrowbar Design Principles: Attribute Injection [Series 6 of 6] | Rob Hirschfeld
Pingback: You need a Squid Proxy fabric! Getting Ready State Best Practices | Rob Hirschfeld
Pingback: a Ready State analogy: “roughed in” brings it Home for non-ops-nerds | Rob Hirschfeld
Pingback: Apply, Rinse, Repeat! How do I get that DevOps conditioner out of my hair? | Rob Hirschfeld
Pingback: OpenCrowbar 2.B to deliver multiple hardware vendor support and advanced integrations | Rob Hirschfeld
Pingback: OpenCrowbar bootstrap positions SSH Keys for hand-offs | Rob Hirschfeld
Pingback: Unicorn captured! Unpacking multi-node OpenStack Juno from ready state. | Rob Hirschfeld
Pingback: Starting RackN – Delivering open ops by pulling an OpenCrowbar Bunny out of our hat | Rob Hirschfeld
Pingback: API Driven Metal = OpenCrowbar + Chef Provisioning | Rob Hirschfeld
Pingback: Ops is Ops, except when it ain’t. Breaking down the impedance mismatches between physical and cloud ops. | Rob Hirschfeld
Pingback: To thrive, OpenStack must better balance dev, ops and business needs. | Rob Hirschfeld
Pingback: why is hardware hard? Ready State Physical Ops Meetup on Tuesday 12/2 9am PT | Rob Hirschfeld
Pingback: Delicious 7 Layer DIP (DevOps Infrastructure Provisioning) model with graphic! | Rob Hirschfeld
Pingback: Nextcast #14 Transcription on OpenStack & Crowbar > “we can’t hand out trophies to everyone” | Rob Hirschfeld
Pingback: Online Meetup Today (1/13): Build a rock-solid foundation under your OpenStack cloud | Rob Hirschfeld
Pingback: Talking Functional Ops & Bare Metal DevOps with vBrownBag [video] | Rob Hirschfeld
Pingback: From Metal Foundation to FIVE new workloads in five weeks | Rob Hirschfeld