DevOps: There’s a new sheriff in Cloudville

DevOps SherrifLately there’s a flurry of interest (and hiring demand) for DevOps gurus.  It’s obvious to me that there’s as much agreement between the ethical use of ground unicorn horn as there is about the job description of a DevOps tech.

I look at the world very simply:

  • Developers = generate revenue
  • Ops = control expenses
  • DevOps = write code, setup infrastructure, ??? IDK!

Before I risk my supply of ethically obtained unicorn powder by defining DevOps, I want to explore why DevOps is suddenly hot.  With cloud driving horizontal scale applications (see RAIN posts), there’s been a sea change in the type of expertise needed to manage an application.

Stereotypically, Ops teams get code over the transom from Dev teams.  They have the job of turning the code into a smoothly running application.  That requires rigid controls and safe guards.  Traditionally, Ops could manage most of the scale and security aspects of an application with traditional scale-up, reliability, and network security practices.  These practices naturally created some IT expense and policy rigidity; however, that’s what it takes to keep the lights on with 5 nines (or 5 nyets if you’re an IT customer).

Stereotypically, Dev teams live a carpe diem struggle to turn their latest code into deployed product with the least delay.  They have the job of capturing mercurial customer value by changing applications rapidly.  Traditionally, they have assumed that problems like scale, reliability, and security could be added after the fact or fixed as they are discovered.  These practices naturally created a need to constantly evolve.

In the go-go cloud world, Dev teams are by-passing Ops by getting infrastructure directly from an IaaS provider.  Meanwhile, IaaS does not provide Ops the tools, access, and controls that they have traditionally relied on for control and management.  Consequently, Dev teams have found themselves having to stage, manage and deploy applications with little expertise in operations.  Further, Ops teams have found themselves handed running cloud applications that they have to secure, scale and maintain applications without the tools they have historically relied on.

DevOps has emerged as the way to fill that gap.  The DevOps hero is comfortable flying blind on an outsourced virtualized cloud, dealing with Ops issues to tighten controls and talking shop with Dev to make needed changes to architecture.  It’s a very difficult job because of the scope of skills and the utter lack of proven best practices.

So what is a day in the life of a DevOp?   Here’s my list:

  • Design and deploy scale out architecture
  • Identify and solve performance bottlenecks
  • Interact with developers to leverage cloud services
  • Interact with operations to integrate with enterprise services
  • Audit and secure applications
  • Manage application footprint based on scale
  • Automate actions on managed infrastructure

This job is so difficult that I think the market cannot supply the needed experts.  That deficit is becoming a forcing function where the cloud industry is being driven to adopt technologies and architectures that reduce the dependence for DevOps skills.  Now, that’s the topic for a future post!

Making Cloud Applications RAIN, part 1

An application that runs “in the cloud” is designed fundamentally differently than a traditional enterprise application.  Cloud apps live on fundamentally unreliable, oversubscribed infrastructure; consequently, we must adopt the same mindset that drove the first RAID storage systems to create a Redundant Array of Inexpensive Nodes (RAIN).

The drivers for RAIN are the same as RAID.  It’s more cost effective and much more scalable to put together a set of inexpensive units redundantly than build a single large super-reliable unit.  Each node in the array handles a fraction of the overall workload so application design must partition the workloads into atomic units.

I’ve attempted to generally map RAIN into RAID style levels.  Not a perfect fit, but helpful.

  • RAIN 0 – no redundancy.  If one part fails then the whole application dies.  Think of a web server handing off to a backend system that fronts for the database.  You may succeed in subdividing the workload to improve throughput, but a failure in any component breaks the system.
  • RAIN 1 – active-passive clustering.   If one part fails then a second steps in to take over the workload.  Simple redundancy yet expensive because half your resources are idle.
  • RAIN 2 – active-active clustering.  Both parts of the application perform work so resource utilization is better, but now you’ve got a data synchronization problem.
  • RAIN 5 – multiple nodes can process the load. 
  • RAIN 6 – multiple nodes with specific dedicated stand-by capacity.  Sometimes called “N+1” deployment, this approach works will with failure-ready designs.
  • RAIN 5-1 or 5-2 – multiple front end nodes (“farm”) backed by a redundant database.
  • RAIN 5-5 – multiple front end nodes with a distributed database tier.
  • RAIN 50 – mixed use nodes where data is stored local to the front end nodes.
  • RAIN 551 or 552 – geographical distribution of an application so that nodes are running in multiple data centers with data synchronization
  • RAIN 555 – nirvana (no, I’m not going to suggest a 666).

Unlike RAID, there’s an extra hardware dimension to RAIN.  All our careful redundancy goes out the window if the nodes are packed onto the same server and/or network path.  We’ll save that for another post. 

I hope you’ll agree that Clouds create RAINy apps.