Working for Dell, it’s no surprise that I have a lot of discussions around building up and maintaining the physical infrastructure to run a data centers at scale. Generally the context is around OpenCrowbar, Hadoop or OpenStack Ironic/TripleO/Heat but the concerns are really universal in my cloud operations experience.
Typically, deployments have three distinct phases: 1) mechanically plug together the systems, 2) get the systems ready to the OS and network level andthen 3) install the application. Often these phases are so distinct that they are handled by completely different teams!
That’s a problem because errors or unexpected changes from one phase are very expensive to address once you change teams. The solution has been to become more and more prescriptive about what the system looks like between the second (“ready”) and third (“installed”) phase. I’ve taken to calling this hand-off a achieving a ready state infrastructure.
I define a “ready state” infrastructure as having been configured so that the application lay down steps are simple and predictable.
In my experience, most application deployment guides start with a ready state assumption. They read like “Step 0: rack, configure, provision and tweak the nodes and network to have this specific starting configuration.” If you are really lucky then “specific configuration” is actually a documented and validated reference architecture.
The magic of cloud IaaS is that it always creates ready state infrastructure. If I request 10 servers with 2 NICs running Ubuntu 14.04 then that’s exactly what I get. The fact that cloud always provisions a ready state infrastructure has become an essential operating assumption for cloud orchestration and configuration management.
Unfortunately, hardware provisioning is messy. It takes significant effort to configure a physical system into a ready state. This is caused by a number of factors
- You can’t alter physical infrastructure with programming (an API) – for example, if the server enumerates the NICs differently than you expected, you have to adapt to that.
- You have to respect the physical topology of the system – for example, production deployments used teamed NICs that have to be use different switches for redundancy. You can’t make assumptions, you have to setup the team based on the specific configuration.
- You have to build up the configuration in sequence – for example, you can’t setup the RAID configuration after the operating system is installed. If you made a bad choice then you’ll likely have to repeat the whole sequence of the deployment and some bad choices (like using the wrong subnets) result in a total system rebuild.
- Hardware fails and is non-uniform – for example, in any order of sufficient size you will have NIC failures due to everything from simple mechanical card seating issues to BIOS interface mismatches. Troubleshooting these issues can occupy significant time.
- Component configurations are interlocked – for example, a change to the switch settings could result in DHCP failures when systems are rebooted (real experience). You cannot always work node-to-node, you must deal with the infrastructure as an integrated system.
Being consistent at turning discovered state into ready state is a complex and unique problem space. As I explore this bare metal provisioning space in the community, I am more and more convinced that it has a distinct architecture from applications built for ready state operations.
My hope in this post is test if the concept of “ready state” infrastructure is helpful in describing the transition point between provisioning and installation. Please let me know what you think!