Cloud APIs are about abstracting operations to simplify deployment. We want users of our cloud infrastructure to operate with blissful unawareness of the underlying networking topology, storage configuration and physical infrastructure. For their perspective, the cloud is perfectly elastic, totally configurable and wonderfully consistent. Cloud Admins on the other hand need visibility and controls that expose the complexity while keeping it rational. These are profoundly different concerns.
Maintaining the illusion of clean and simple Cloud ops infrastructure is very valuable; however, it’s just an illusion. The black metal box behind those APIs is complex, messy, unpredictable and dynamic.
1. Metal Ops has to deal with network topology and details like if an operating system enumerates the NICs correctly, bonding the correct NIC pair and which 10g network to use for the storage traffic. In networking, the topology determines how much traffic you can subscribe to a link and how to provide resliency. Networking does not exist in isolation: you must consider the boundary firewalls and routers to either block or allow traffic because without connectivity the cloud is useless. Details like the access and registration in DNS, NTP and DHCP provide foundations our stable operations. These details are (and should be) hidden from the cloud user.
2. Metal Ops has to deal with firmware issues at every level. It matters to the server if it boots into BIOS or UEFI mode. We have to manage the fact that RAID partitions need to be optimized based on the workload and type of drive. We have to consider if there are specialized drivers and caches to manage and security features (like Intel TXT) to activate. These details are (and should be) hidden from the cloud user.
3. Metal Ops have to consider the security of their infrastructure. We have to manage where the admin control network crosses security domains. It matters which layer 2 networks have access to which parts of the infrastructure. Separation of responsibility for network vs. storage vs. compute is a reality that it not going away. These details are (and should be) hidden from the cloud user.
4. Metal Ops have to manage operating system compatibility. I know personally that vendors test and certify their operating systems on an enormous matrix of silicon. I also have learned that the matrix of possible combinations is far larger and fundamentally impossible at the edges. There’s a reason that operators seek homogeneous environments and LTS releases. These details are (and should be) hidden from the cloud user.
5. Metal Ops have to deal with hardware failures. By simple statistics, the larger the system the more things will break and metal ops have to cope with this reality. We have to expose failure zones and boundaries to make intelligent responses (like moving data from a failed drive to a non-adjacent one) that require intimate knowledge of system topography that are intentionally hidden in cloud ops. Further, we have to have monitoring and management tooling that knows how to identify which NIC in a bond failed or flash the lights on the failed drive of an array. These details are (and should be) hidden from the cloud user.
Cloud’s power is being able to abstract away this complexity. Dealing with it gracefully behind the scenes requires transparency and details that make Metal Ops job fundamentally different.
While both can be highly automated and pass my “Cloud is Infrastructure with an API” test, their objectives are different.