In scale-out infrastructure, tools & automation matter

WiseScale out platforms like Hadoop have different operating rules.  I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.

In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.

Finding and correcting this type of failure can be difficult.  While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause.   If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.

Consequently, I’ve found that management tooling and automation are essential for success.  Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.

I’d also like to give a shout out to benchmarks as part of your tooling suite.  Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.

Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic.  A single person can have a disproportionate impact on overall team performance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s