I’ve been trying to explain the
pain Tao of physical ops in a way that’s accessible to people without scale ops experience. It comes down to a yin-yang of two elements: exploding complexity and iterative learning.
Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications. Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.
But that’s not the real challenge! The bigger problem is that it’s impossible to design for all those parameters in advance.
When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system. After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”
Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.
- The first cycle was completely generic to provide a working baseline and validate the physical environment.
- The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.
- The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment
- The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.
Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer! Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.