My personal bias against SANs in cloud architectures is well documented; however, I am in the minority at my employer (Dell) and few enterprise IT shops share my view. In his recent post about CAP theorem, Dave McCrory has persuaded me to look beyond their failure to bask in my flawless reasoning. Apparently, this crazy CAP thing explains why some people loves SANs (enterprise) and others don’t (clouds).
The deal with CAP is that you can only have two of Consistency, Availability, or Partitioning Tolerance. Since everyone wants Availablity, the choice is really between Consitency or Partitioning. Seeking Availability you’ve got two approaches:
Legacy applications tried to eliminate faults to achieve Consistency with physically redundant scale up designs.
Cloud applications assume faults to achieve Partitioning Tolerance with logically redundant scale out design.
According to CAP, Legacy and cloud approaches are so fundamentally different that they create a “CAP Chasm” in which the very infrastructure fabric needed to deploy these applications is different.
As a cloud geek, I consider the inherent cost and scale limitations of a CA approach much too limited. My first hand experience is that our customers and partners share my view: they have embraced AP patterns. These patterns make more efficient use of resources, dictate simpler infrastructure layout, scale like hormone-crazed rabbits at a carrot farm, and can be deployed on less expensive commodity hardware.
As a CAP theorem enlightened IT professional, I can finally accept that there are other intellectually valid infrastructure models.
An application that runs “in the cloud” is designed fundamentally differently than a traditional enterprise application. Cloud apps live on fundamentally unreliable, oversubscribed infrastructure; consequently, we must adopt the same mindset that drove the first RAID storage systems to create a Redundant Array of Inexpensive Nodes (RAIN).
The drivers for RAIN are the same as RAID. It’s more cost effective and much more scalable to put together a set of inexpensive units redundantly than build a single large super-reliable unit. Each node in the array handles a fraction of the overall workload so application design must partition the workloads into atomic units.
I’ve attempted to generally map RAIN into RAID style levels. Not a perfect fit, but helpful.
RAIN 0 – no redundancy. If one part fails then the whole application dies. Think of a web server handing off to a backend system that fronts for the database. You may succeed in subdividing the workload to improve throughput, but a failure in any component breaks the system.
RAIN 1 – active-passive clustering. If one part fails then a second steps in to take over the workload. Simple redundancy yet expensive because half your resources are idle.
RAIN 2 – active-active clustering. Both parts of the application perform work so resource utilization is better, but now you’ve got a data synchronization problem.
RAIN 5 – multiple nodes can process the load.
RAIN 6 – multiple nodes with specific dedicated stand-by capacity. Sometimes called “N+1” deployment, this approach works will with failure-ready designs.
RAIN 5-1 or 5-2 – multiple front end nodes (“farm”) backed by a redundant database.
RAIN 5-5 – multiple front end nodes with a distributed database tier.
RAIN 50 – mixed use nodes where data is stored local to the front end nodes.
RAIN 551 or 552 – geographical distribution of an application so that nodes are running in multiple data centers with data synchronization
RAIN 555 – nirvana (no, I’m not going to suggest a 666).
Unlike RAID, there’s an extra hardware dimension to RAIN. All our careful redundancy goes out the window if the nodes are packed onto the same server and/or network path. We’ll save that for another post.
I hope you’ll agree that Clouds create RAINy apps.