Scale out platforms like Hadoop have different operating rules. I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.
In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.
Finding and correcting this type of failure can be difficult. While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause. If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.
Consequently, I’ve found that management tooling and automation are essential for success. Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.
I’d also like to give a shout out to benchmarks as part of your tooling suite. Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.
Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic. A single person can have a disproportionate impact on overall team performance.