PaaS Simplified: an application architecture that responds to load

handoff

In addition to attending the great sessions at the OpenStack Design Conference, our Dell team realized that we’ve been making Platform as a Service (PaaS) much more complex.  Stripping away the detritus is important because it looks like “What is a PaaS” is changing on a daily basis so boiling it down to the must fundamental is essential.

At its core, a PaaS is an application that changes its architecture based on the load.   That’s it no further definition is required.

I’ve been playing with this definition since April and am finding that it’s a much more productive definition of PaaS than any that I’ve used so far.  The reason is that it’s

  1. application focused,
  2. not language or services bound and
  3. captures the business use cases

Of course, I’m going to have to provide more backup in future posts.  I want to invite discussion about this perspective on PaaS.  I’m especially interesting in seeing how recent offerings from VMware (OpenPaaS/CloudFoundry) or Amazon (Elastic Beanstalk) measure against this concept.

Bad Premise: Cloud Outages are *not* driving IT back to premises

trapped

I wrote this responding to Lauren Carlson‘s (Software Advice) Blog Post.  Lauren – I’d be more likely to agree with the statement that “SLAs are dead”  Here’s why…

<soapbox>

Recent industry buzz about cloud service level agreements (SLAs) and reliability miss the core point about cloud.  Cloud is about agility, business models, consumerization of software and merciless pursuit of efficiency.

The fact that Amazon EC2 built its base without an “enterprise” SLA is exhibit #1 that the IT world changed and it’s not going back.

Here are my reasons why IT pandoras can’t get cloud back into the box.

#1. Cloud has vastly superior network connectivity

The concept of your users accessing your applications from inside your firewall is so 2005.  Today’s reality is that significant amounts of network access is externally routed means that applications need to live where they have excellent bandwidth to their users and to other applications.

#2. Cloud has elastic consumption of resources

Cloud is not less expensive infrastructure, it is mainly more flexible.  If you’re worried about an outage, then cloud is exactly the investment for you because you position a backup site at another location without having to pay for online resources.  It’s much harder to take down a site that invests the time to design a system that dynamically reallocates load between sites.

#3. Cloud drives more robust architecture

The fact that cloud delivery is more opaque and modular without a five 9s SLA has driven a cloud application architecture revolution (see CAP).  We have shifted the app paradigm from robust scale up hardware to robust scale out software.  Also significant, DevOps innovations have made deployments repeatable and adaptable.

The only “logical” argument for pulling applications back from the cloud is to assert control over more of the delivery chain for your application.  It the same reason that we think that driving is safer than flying – we’re the ones sitting behind the wheel when we drive.  News flash – driving is NOT safer than flying.

Cloud applications are not about hardware infrastructure, they are about SOFTWARE.  Perhaps one of the greatest disservices foisted on the market was saying cloud is synonymous with “Infrastructure as a Service” and “Virtualization.”  Cloud applications are powerful because we created ways that circumvent the limitations of IaaS and VMs!

</soapbox>

Not all APIs are equal: the power of API + implementation (OpenStack vs LibCloud vs DeltaCloud)

sky

I’ve been getting a lot of questions about Apache LibCloud and RedHat’s DeltaCloud vs. OpenStack.  While all of these projects offer APIs, only OpenStack is based on an implementation.

Having an implementation means that the API is reflected by code that delivers the functionality of the API.  This means that the implementation based API more closely reflects the actual workings of the system while the “pure” API must abstract the working of multiple systems.   The API only approach ends up having to become a least common deminator instead of a vision of the pure use cases.

LibCloud and DeltaCloud are important and useful.  They provide abstractions that help developers write applications without being tied to a specific cloud vendor.  While lack of lock-in is a concrete benefit, it comes at a price.  The price is that the API shim cannot expose features that differentiate the platforms.  This may represent a significant loss of functionality or performance.

When developers implement directly against an implemented API, they can take advantage of the full feature set of their target cloud.  They can also test and verify more directly.  These are significant benefits that result in richer, more robust and faster to market products.

Both approaches have their place and are needed in the market.  If I needed to write against multiple clouds for portability then Libcloud is a slam dunk.  If I needed rich features and an ecosystem then OpenStack or Amazon are better choices.

The unexpected openness of OpenStack: why it’s important to learn from others’ operations experience.

During the OpenStack Design Conference, Forrester’s James Staten (@Staten7) raved about OpenStack’s transparency compared to AWS.  Within the enclave of OpenStack fan boys supports (Dell alone sent >14 people to the summit), his post drew a considerable attention but did little to really further the value proposition.

“Open deployments” are a much more significant value to implementors than transparency from open source code.

For any technology solution, there are significant challenges that will only be understood when the system is under stress.  In some cases, these challenges are code defects; however, many will be related to configuration and deployment choices that are site specific.  It is correcting these issues that result in design patterns and practices that create a robust infrastructure; consequently, the process of hardening a solution is critical to its ultimate stability and success.

When a solution, like AWS, is deployed and managed by a single entity, it is extremely rare for operational lessons learned and best practices to make it to the larger community.  Amazon’s recent post mortem is a welcome exception.   This is not a bad thing (Roman Stanek’s contrasting point), it is just the reality of a proprietary cloud.  AWS operates as a black box and I don’t believe that Amazon’s operational experience would be relevant to others unless they were also operationally transparent.

While it makes business sense to remain operationally opaque, service providers lose the benefit of external lessons learned when there is no community working in parallel with them.

OpenStack’s community has an opportunity to iterate on CloudOps patterns and practices at a dramatically faster rate than any single provider.  This creates distinct value for OpenStack adopters because they can shorten or eliminate their own challenges because other adopters will have the same pains and benefit from the same fixes.

It is critical to understand that the benefit is conferred to both the party sharing the problem (they get advice and support) and the party lending assistance (they avoid the problem).  This is distinctly different from proprietary clouds where sharing is likely to cause embarrassment  unlikely to create helpful outcomes.

I am not advocating that all OpenStack deployments be the same or follow a prescriptive patterns. 

I believe that each installation will be unique in some way; however, there will  be enough commonalities and shared code to make sharing worthwhile.  This is especially true for adopters who start with tools like Crowbar that leverage community based Chef Recipes and automating scripts.  Tools that encourage automation and shared scripts help accelerate the establishment of robust deployment patterns and practices.

Ultimately, the ability to collaborate on cloud operation practice does more to strengthen OpenStack than developers, code reviews or corporate endorsements.

BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

Why cloud compute will be free

Today at Dell, I was presenting to our storage teams about cloud storage (aka the “storage banana”) and Dave “Data Gravity” McCrory reminded me that I had not yet posted my epiphany explaining “why cloud compute will be free.”  This realization derives from other topics that he and I have blogged but not stated so simply.

Overlooking that fact that compute is already free at Google and Amazon, you must understand that it’s a cloud eat cloud world out there where losing a customer places your cloud in jeopardy.  Speaking of Jeopardy…

Answer: Something sought by cloud hosts to make profits (and further the agenda of our AI overlords).

Question: What is lock-in?

Hopefully, it’s already obvious to you that clouds are all about data.  Cloud data takes three primary forms:

  1. Data in transformation (compute)
  2. Data in motion (network)
  3. Data at rest (storage)

These three forms combine to create cloud architecture applications (service oriented, externalized state).

The challenge is to find a compelling charge model that both:

  1. Makes it hard to leave your cloud AND
  2. Encourages customers to use your resources effectively (see #1 in Azure Top 20 post)

While compute demands are relatively elastic, storage demand is very consistent, predictable and constantly grows.  Data is easily measured and difficult to move.  In this way, data represents the perfect anchor for cloud customers (model rule #1).  A host with a growing data consumption foot print will have a long-term predictable revenue base.

However, storage consumption along does not encourage model rule #2.  Since storage is the foundation for the cloud, hosts can fairly judge resource use by measuring data egress, ingress and sidegress (attrib @mccrory 2/20/11).  This means tracking not only data in and out of the cloud, but also data transacted between the providers own cloud services.  For example, Azure changes for both data at rest ($0.15/GB/mo) and data in motion ($0.01/10K).

Consequently, the financially healthiest providers are the ones with most customer data.

If hosting success is all about building a larger, persistent storage footprint then service providers will give away services that drive data at rest and/or in motion.  Giving away compute means eliminating the barrier for customers to set up web sites, develop applications, and build their business.  As these accounts grow, they will deposit data in the cloud’s data bank and ultimately deposit dollars in their piggy bank.

However, there is a no-free-lunch caveat:  free compute will not have a meaningful service level agreement (SLA).  The host will continue to charge for customers who need their applications to operate consistently.  I expect that we’ll see free compute (or “spare compute” from the cloud providers perspective) highly used for early life-cycle (development, test, proof-of-concept) and background analytic applications.

The market is starting to wake up to the idea that cloud is not about IaaS – it’s about who has the data and the networks.

Oh, dem golden spindles!  Oh, dem golden spindles!

Seattle Cloud Camp, Dec 2010

While I was in Seattle for Azure training preparing for Dell’s Azure Appliance , Dave @McCrory suggested that we also attend the Seattle Cloud Camp (SCC Tweets).  This event was very well attended (200 people!).  With heavy attendance by Amazon (at their HQ), Microsoft (in the ‘hood), and Google, there was a substantial cloud vendor presence (>25% from those vendors alone).  Notable omission: VMware.

My reflection about the event by segment.

Opening Sessions:

  • Most of the opening sessions were too light for the audience.  I thought we were past the “what is cloud” level, sigh.
  • Of note, the Amazon security presentation by Steve Rileywas fun and entertaining.
  • Picking on a Dell competitor specifically: calling your cloud solution “WAS” is a branding #fail (not that DCSWA much is better).

Unpanel of self-appointed cloud extroverts experts:

  • The unpanel covered some decent topics (@adronbh captured them on twitter), unfortunately none of the answers really stood out to me.  Except for NoSQL.
  • The unpanel discussion about NoSQL drew 2 answers.  1) It’s not NoSQL, it’s eventually consistent instead of strictly consistent.  (note: I’ve been calling it “Storage++”) 2) We’ll see more and more choices in this area as we tune the models for utility then we’ll see some consolidation.  The suggestion was that NoSQL would follow the same explosion/contraction pattern of SQL databases.

Session on Cloud APIs (my suggested topic)

  • The Cloud API topic was well attended (30+).  The vast overwhelming majority or the attendees were using Amazon.
  • There was some interest in having “standard” APIs for cloud functions was not well received because it was felt to stifle innovation.  We are still to early.
  • It was postulated but not generally agreed that cloud aggregation (DeltaCloud, RightScale, etc) is workable.  This was considered a reason to not require standard clouds.
  • CloudCamp sponsor, Skytap, has their own API.  These APIs are value added and provide extra abstraction levels.
  • It was said that there are a LOT (50 now, 500 soon) smaller hosts that want to enter the cloud space.  These hosts will need an API – some are inventing their own.
  • I brought up the concept discussed at OpenStack that the logical abstraction for cloud network APIs is a “vlan.”  This created confusion because some thought that I meant actual 802.1q tags.  NO!  I just meant that is was the ABSTRACTION of a VLAN connecting VMs together.
  • There was agreement from the clouderati in the room that cloud networking was f’ed up, but most people were not ready to discuss.
  • Cloud APIs have some basics that are working (semantics around VMs) but still have lots of wholes.  Notably: networking, application, services, and identity)

Session on Google App Engine (GAE)

  • GAE is got a lot going on, especially in the social/mobile space.
  • Do not think a lack of news about GAE means that they are going slow, it’s just the opposite.  It looks like they are totally kicking ass with a very focused strategy.  I suspect that they are just waiting for the market to catch-up.
  • GAE understands what a “platform” really is.  They talk about their platform as the SERVICES that they are offering.  The code is just code.  The services are impressive and include identity, mail, analysis, SQL (business only), map (as in Map-Reduce), prediction (yes, prediction!), storage, etc.  The total list was nearly 20 distinct services.
  • GAE compared them selves to Azure, not Amazon.

VM != Cloud! Comparision draws ire, misses point

Having the requirement benefit of working with both Dave McCrory and Joyent on a daily basis at Dell, I cannot resist weighing in on the blog pong between them.

Dave’s post comparing VM pricing prompted Joyent to blog that VMs are not the only measure of cloud.

While I completely agree that clouds are not all about VMs, I think that Joyent is too limited in their definition of cloud in their reply.  We’re seeing an emergence of services as the differentiator between clouds.

Looking at Amazon, Azure, and Google, the clear way to reduce cloud spend is to migrate applications to consume their services (SQL, Storage, Bus, etc).

If cloud users are primarily concerned about price per hour (which I’m not convinced is the case) then they have real motivation to migrate from purely VM (or SmartMachine(tm) ) based applications to ones that use services.

PaaS, much ado about network services

There’s a surprising about of a hair pulling regarding IaaS vs PaaS.  People in the industry get into shouting matches about this topic as if it mattered more than Lindsay Lohan’s journey through rehab.

The cold hard reality is that while pundits are busy writing XaaS white papers, developers are off just writing software.  We are writing software that fits within cloud environments (weak SLA, small VMs), saves money (hosted data instead of data in VMs), and changes quickly (interpreted languages).  We’re doing using an expanding tool kit of networked components like databases, object stores, shared cache, message queue, etc.

Using network components in an application architecture is about as novel as building houses made of bricks.  So, what makes cloud architectures any better or different?

Nothing!  There is no difference if you buy VMs, install services, and wire together your application in its own little cloud bubble.  If I wanted to bait trolls, I’d call that an IaaS deployment.

However, there’s an emerging economic driver to leverage lower cost and more elastic infrastructure by using services provided by hosts rather than standing them up in a VM.  These services replace dedicated infrastructure with managed network attached services and they have become a key differentiator for all the cloud vendors

  • At Google App Engine, they include Big Tables, Queues, MemCache, etc
  • At Microsoft Azure, they include SQL Azure, Azure Storage, AppFabric, etc
  • At Amazon AWS, they include S3, SimpleDB, RDS (MySQL), Queue & Notify, etc

Using these services allows developers to focus on the business problems we are solving instead of building out infrastructure to run our applications.  We also save money because consuming an elastic managed network service is less expensive (and more consumption based) than standing up dedicated VMs to operate the services.

Ultimately, an application can be written as stateless code (really “externalized state” is more a accurate description) that relies on these services for persistence.  If a host were to dynamically instantiate instances of that code based on incoming requests then my application resource requirements would become strictly consumption based.   I would describe that as true cloud architecture. 

On a bold day, I would even consider an environment that enforced offered that architecture to be a platform.  Some may even dare to describe that as a PaaS; however, I think it’s a mistake to look to the service offering for the definition when it’s driven by the application designers’ decisions to use network services.

While we argue about PaaS vs IaaS, developers are just doing what they need.  Today they may stand-up their own services and tomorrow they incorporate 3rd party managed services.  The choice is not a binary switch, a layer cake, or a holy war.

The choice is about choosing the right most cost effective and scalable resource model.

API vs. API: How Amazon EC2 kicks VMware, RackSpace, and Microsoft

My day job is to try and choose and influence Cloud technologies so it’s no surprise when to hear different vendors pitching why their cloud API is more open, standards based, or performant.  They have convincing yet irrelevant arguments: the primary measure of a cloud API is the size of its ecosystem.

The API’s ecosystem is the number (and vitality) of the upstream partners, SaaS services, PaaS vendors, and ISVs that have built their business on top of that API.  The fundamental truth of this model, like all ad hoc IT standards, is that success is built on business traction, not on technical merit or endorsement by standards bodies.

So which Cloud API will be the winner?  We’re just rounding the first turn and Amazon is ahead.  Let’s look at the lead fillies

  • Amazon EC2/S3 has the clear leadership.  Their API is widely copied (without clear license to do so!), includes storage and their billing model is highly innovative.
  • Microsoft Azure is making a big push.  Windows continues to dominate as a platform and their SQL cloud helps address application porting.  In addition, their PaaS integration provides a forward migration.
  • VMware vCloud has taken to high road through the official standards bodies.  VMware dominates the private cloud space and their vCenter API represents a larger ecosystem than any other virtualization API.   This ecosystem guarantees that vCloud will be widely adopted but if they can cross over into public clouds is fuzzier.
  • RackSpace has an interesting position by offering both dedicated and shared hosting.  Their service and API have been along for a long time.  They have just not created the buzz that Amazon gets.  They could be a swing vote depending on their future decisions around Cloud APIs.

But maybe we don’t have to pick the winner!  Perhaps there’s an option for a trifecta bet where we don’t have to pick a single winner.  This scenario of building a multi-API abstraction layer is getting a lot of interest and creating a lot of value.  Vendors include RightScale, DeltaCloud (was RedHat, now Apache), and jCloud.

Right now, I’m sitting in the Delta Cloud session at RedHat Summit/JBoss World.  One of my concerns about API aggregation is that the API abstraction has to be either least common denominator (LCD) or have strange exceptions.  For example, the speaker is saying that approaches to Firewalls are very different or completely missing.  This creates a serious aggravation for aggregation:  does the API leave a gap, favor one API, or invent yet another way to solve the problem.

I believe the cloud API race is not just a single horse race for the Cloud Computing Cup, it’s more like the Triple Crown.   The real winning API will cover compute, network, and storage management.   

Then again, accelerating PaaS adoption could make these IaaS Clouds into buggy whip manufacturers.

Disclosure:  My employeer, Dell, is a partner with many of the companies listed above.