Are Clouds using Dark Cycles?

Or “Darth Vader vs Godzilla”

Way way back in January, I’d heard loud and clear that companies where not expecting to mix cloud computing loads.  I was treated like a three-eyed Japanese tree slug for suggesting that we could mixing HPC and Analytics loads with business applications in the same clouds.  The consensus was that companies would stand up independent clouds for each workload.  The analysis work was too important to interrupt and the business applications too critical to risk.

It has always rankled me that all those unused compute cycles (“the dark cycles”) could be put to good use.  It’s appeals to my eco-geek side to make best possible use of all those idle servers.   Dave McCrory and I even wrote some cloud patents around this.

However, I succumbed to the scorn and accepted the separation.

Now all of a sudden, this idea seems to be playing Godzilla to a Tokyo shaped cloud data center.  I see several forces merging together to resurrect mixing workloads.

  1. Hadoop (and other map-reduce Analytics) are becoming required business tools
  2. Public clouds are making it possible to quickly (if not cheaply) setup analytic clouds
  3. Governance of virtualization is getting better
  4. Companies want to save some $$$

This trend will only continue as Moore’s Law improves the compute density for hardware.  Since our designs are leading towards scale out designs that distribute applications over multiple nodes; it is not practical to expect an application to consume all the power of a single computer.

That leaves a lot of lonely dark cycles looking for work.

Now all of a sudden, this idea seems to be playing Godzilla to a Tokyo shaped cloud data center.  I see several forces merging together to resurrect mixing workloads.

  1. Hadoop (and other map-reduce Analytics) are becoming required business tools
  2. Public clouds are making it possible to quickly (if not cheaply) setup analytic clouds
  3. Governance of virtualization is getting better
  4. Companies want to save some $$$

This trend will only continue as Moore’s Law improves the compute density for hardware.  Since our designs are leading towards scale out designs that distribute applications over multiple nodes; it is not practical to expect an application to consume all the power of a single computer.

That leaves a lot of lonely dark cycles looking for work.

Network World on Ubuntu Cloud

My team at Dell is working on solutions around this cloud strategy.  I like the approach that Canonical & Ecalyptus are taking concerning the use of open source (KVM), ad hoc API standards (Amazon), and flexible storage configurations (DAS or SAN).

Looking at usage trends, stateless server designs (as we get closer to PaaS) will allow us to rethink how we architect hypervisor based clouds.  Of course, this requires us to rethink application architectures and the OS choices that we make to run them. 

Thanks for BartonGeorge.net for the link  that got this thought started.  Network World says…

“Ubuntu Enterprise Cloud provides tight integration between Ubuntu and Ecalyptus and a series of CLI tools (made even more simple by apps like HybridFox with gives them a GUI) that follows along Amazon’s construction. Work done for Ubuntu Enterprise Cloud ends up being somewhat reusable if you’re transporting your work to Amazon.”

Just Striping in the RAIN

Or “behold the power of the unreliable”

In a previous post, I discussed the concept of a Redundant Array of Inexpensive Nodes (RAIN) as a way to create more reliable and scalable applications.   Deploying a RAIN application is like being the House in Vegas – it’s about having enough size that the odds come out in your favor.  Even if one player is on a roll, you can predict that nearly everyone else is paying your rent.  Imagine what would happen if all the winning gamblers were in your casino!  If you don’t want to go bankrupt when deploying a RAIN app, then ensure that the players spread out all over the Strip.

One of my core assumptions is that you’ll deploy a RAIN application on a cloud.   This is a significant because we’re assuming that your nodes are

  1. idle most of the time because your traffic loads are cyclic
  2. unreliable because the cloud provider does not offer much SLA
  3. divisible because renting ten 1/10ths of a server costs roughly the same as a whole one
  4. burstable because 1/10th servers can sometimes consume that extra 9/10th server

The burstable concept is a dramatic power multiplier that tips the whole RAIN equation heavily towards clouds. 

Bursting means that under load, your 10 1/10th servers (roughly 1 server in cost) could instantly expand to the power of 10 full servers!  That reflects an order of magnitude spike in demand without any change in your application.

In the past, we’ve racked extra servers to handle this demand.  That meant that we had a lot of extra capacity taking up rack space, clubbing innocent migratory electrons for their soft velvety fur, and committing over provisioning atrocities.  

Today, multi-tenant clouds allow us to average out these bursts by playing the odds on when application bursts will occur.  In this scenario, the infrastructure provider benefits from the fact that applications need to be over provisioned.  The application author benefits because they can instantly tap more resources on demand.  Now, that is what I would call a win-win synergy!

All this goodness depends on

  • standard patterns & practices that developers use to scale out RAIN applications
  • platform improvements in cloud infrastructure to enable smooth scale out
  • commercial models that fairly charge for multi-tenant over-subscription
  • workable security that understands the underlying need for co-mingling workloads

The growing dominance of cloud deployments looks different when you understand the multiplying interplay between multi-tenant clouds and cloud-ready RAIN applications.

Green Clouds?

This is an interesting take on clouds by the Guardian.  Dell’s new cloud offerings are more power efficient; however, we are racking lots and lots of servers.  It’s like everyone in China buying fuel efficient cars – they are better then Hummers, but still going to use gas.

We’re clearly entering an age where compute consumed per person is going up dramatically.  They are correct that the cost and environmental impact of that compute is hidden from the consumer.  I have a front row seat to these cloud data centers and I can verify that lots and lots of new servers are being brought online every day. 

Welcome back to 2001.

Rethinking Storage

Or “UNthinking SANs”

Back in 2001, I was co-founder of a start-up building the first Internet virtualized cloud.  Dual CPU 1U pizza box servers were brand new and we were ready to build out an 8 node, 64 VM cloud!  It was going to be a dream – all that RAM and CPU just begging to be oversubscribed.  It was enough to make Turing weep for joy.

Unfortunately, all those VMs needed lots and lots of storage.

Never fear, EMC was more than happy to quote us a lovely SAN with plenty of redundant FBAs and interconnected fabric switches.  It was all so shiny and cool yet totally unscalable and obscenely expensive.   Yes, unscalable because that nascent 8 node cloud was already at the port limit for the solution!  Yes, expensive because that $50,000 hardware solution would have needed a $1,000,000 storage solution!

The funny part is that even after learning all that, we still wanted to buy the SAN.  It was just that cool.

We never bought that SAN, but we did buy a very workable NAS device.  Then it was my job to change (“pragmatic-ize”) our architecture so that our cloud management did not require expensive shiny objects.

Our ultimate solution used the NAS for master images that were accessed by many nodes.  These requests were mainly reads and optimized.  Writes were made to differencing disks kept on local disk and highly scalable.  In systems, we were able to keep the masters local and save bandwidth.  This same strategy could easily be applied in current “stateless” VM deployments.

Some of the SANless benefits are:

  • Less cost
  • Simplicity of networking and management
  • Nearer to linear scale out
  • Improved I/O throughput
  • Better fault tolerance (storage faults are isolated to individual nodes)

Of course, there are costs:

  • More spindles means more energy use (depending on drive selection and other factors)
  • Lack of centralized data management
  • Potentially wasted space because each system carries excess capacity
  • The need to synchronize data stored in multiple locations

These are real costs; however, I believe the data management problems are unsolved issues for SAN deployments too.  Data proliferation is simply hidden inside of the VMs.

Today, I observe many different SAN focused architectures and cringe.  These same solutions could be much simpler, more scalable and dramatically affordable with minimal (or even no) changes.  If you’re serious about deploying a cloud based on commodity system then you need seriously need to re-evaluate your storage.

Dell goes to the Clouds (hardware & Joyent)

As a Dell employee, I’ve had the privilege of being on the front lines of Dell’s cloud strategy.  Until today, I have not been able to post about the exciting offerings that we’ve been brewing.

Two related components have been occupying my days.  The first is the new cloud optimized hardware and the second is the agreement to offer private clouds using Joyent’s infrastructure. Over the next few weeks, I’ll be exploring some of the implications of these technologies.  I’ve already been exploring them in previous posts.

Cloud optimized hardware grew out of lesson learned in Dell’s custom mega-volume hardware business (that’s another story!).  This hardware is built for applications and data centers that embrace scale out designs.  These customers build applications that are so fault tolerant that they can focus on power, density, and cost optimizations instead of IT hardening.  It’s a different way of looking at the data center because they see the applications and the hardware as a whole system.

To me, that system view is the soul of cloud computing.

The Dell-Joyent relationship is a departure from the expected.  As a founder of Surgient, I’m no stranger to hypervisor private clouds; however, the Joyent takes a fundamentally different approach.  Riding on top of OpenSolaris’ paravirtualization, this cloud solution virtually eliminates the overhead and complexity that seem to be the default for other virtualization solutions.  I especially like Joyent’s application architectures and their persistent vision on how to build scale-out applications from the ground up.

To me, scale should be baked into the heart of cloud applications.

So when I look at Dell’s offings, I think we’ve captured the heart and soul of true cloud computing.

Cloud Application Life Cycle

Or “you learn by doing, and doing, and doing”

One of the most consistent comments I hear about cloud applications is that it fundamentally changes the way applications are written.  I’m not talking about the technologies, but the processes and infrastructure.

Since our underlying assumption of a cloud application is that node failure is expected then our development efforts need to build in that assumption before any code is written.  Consequently, cloud apps should be written directly on cloud infrastructure.

In old school development, I would have all the components for my application on my desktop.  That’s necessary for daily work, but does not give me a warm fuzzy for success in production.

Today’s scale production environments involve replicated data with synchronization lags, shared multi-writer memcache, load balancers, and mixed code versions.  There is no way that I can simulate that on my desktop!   There is no way I can fully anticipate how that will behave all together!

The traditional alternative is to wait.  Wait for QA to try and find bugs through trial and error.  Or (more likely) wait for users to discover the problem post deployment.

My alternative is to constantly deploy the application to a system that matches production.    As a bonus, I then attack the deployment with integration tests and simulators.

If you’re thinking that is too much effort then you are no thinking deeply enough.  This model forces developers to invest in install and deployment automation.  That means that you will be able to test earlier in the cycle.  It means you will be able to fix issues more quickly.  And that you’ll be able to ship more often.  It means that you can involve operations and networking specialists well before production.  You may even see more collaboration between your development, quality, and operations teams.   

Forget about that last one – if those teams actually worked together you might accidently ship product on time.  Gasp!