Microsoft Azure Cloud – Top 20 Lessons Learned about MS’s PaaS

Last week Dave McCrory (@McCrory) and I (@Zehicle) had the benefit of intensive Azure training at Microsoft HQ to support Dell’s Azure Stamp.

We’ve assembled a top 20 list of things to know about programming for Azure (and really any PaaS leaning cloud):

  1. If you want performance, optimize to reduce fees. Azure (and any cloud) is architected to penalize you if you use their resources poorly. The challenge is to fix this before your boss get the tab for your unenlightened design decisions.
  2. Coding .NET on Azure easy, architecting for Azure requires learning. Clouds put things in different places than you are used to and the rules are different. Expect a learning curve.
  3. Partitioning = parallelism. Learn to love partitions in all their forms, because your app will be throttled if you throw everything into a single partition! On the upside, each partition operates in parallel and even better, they usually don’t cost extra (SQL is the exception).
  4. Roles are flexible. You can run web servers (Apache, etc) on a worker and worker tasks on a web role. This is a good way to save some change since you pay per role instance. It’s counter to separation of concerns, but financially you should also combine workers into a single role where possible.
  5. Understand walking deployments. You can (and should) have simultaneous versions of the code operating against the same data so that you can roll upgrades (ala Timothy Fitz/Eric Ries) to reduce risk and without reducing performance. You should expect your data schema to simultaneously span mutiple code versions.
  6. Learn about Update Domains (UDs). Deployment domains allow rolling upgrades and changes to Applications and Services. They are part of how you partition your overall application. If you’re planning a VIP swap deployment, then you won’t care.
  7. Each service = ONE external IP. You can have many VMs backing each service (and multiple roles in a service) and Azure will load balance between them so you can scale out each service. Think of each service as a clonable entity: there will be at least 1 and more can be added if you want to scale.
  8. Understand between VIP and DIP. VIPs stand for Virtual IPs and are external, public, and metered. DIPs are internal, private, and load balanced. Azure provides an API to discover your DIPs – do not assume you know them because they are DYNAMIC IPs. Azure won’t let you see other DIPs inside the system.
  9. Azure has rich diagnostics, but beware. Azure leverages the existing diagnostics built into their system, but has to get the data off box since instances are volitile. This means that problems can be hard to isolate while excessive logging can impact performance and generate fees. Microsoft lets you target individual systems for elevated levels. You can also Terminal Server to a VM for troubleshooting (with caution).
  10. The new Azure admin console rocks. Take your pick between Silverlight or MMC Snap-in.
  11. Everything goes into Azure Storage. Learn to love it. Queues -> storage. Tables -> storage. Blobs -> storage. Logging -> storage. Code Repo -> storage. vDisk -> storage. SQL -> SQL (they march to their own drummer).
  12. Queues are essential, but tricky. Learn the meaning of idempotent because using queues requires you to handle failures and timeouts. The scary part is that it will work nicely until you exceed some limits and then you’ll experience cascading failure. Whee! Oh yea, and queues require polling (which stinks as a notification model).
  13. SQL Azure is just mostly like MS SQL. Microsoft did a smart thing in keeping Cloud SQL so it was highly compatible with Local SQL. The biggest note is that limited in size of partition. If you embrace the size limits you will get better performance. So stop pushing BLOBs into databases and start sharding.
  14. Duplicating data in tables will improve performance. This has to do with how partitions and keys operate but is an interesting architecture for NoSQL – stage data for use. Don’t be afraid to stage the same data in multiple ways. It may be faster/cheaper to write data twice if it becomes easier to find when you search it 1000s of times.
  15. Table data can be “warmed up.” Storage has logic that makes frequently accessed items faster (sort of like a cache ;). If you can anticipate load spikes then you should warm the data just before the spike.
  16. Storage billing is both amount and transactions. You can get burned on a small, but busy set of data. Note: you will pay even if you 404 a request.
  17. Azure has a CDN. Leveraging Microsoft’s Content Delivery Network (CDN) will improve performance for your users with small, low latency, high request items. You need to change your URLs for those assets. Best practice is to use some versioning in the URI so that you can force changes. Remember, CDN is SLOWER for the first hit when the data is not in cache so avoid CDN for low volume assets!
  18. Provisioning time is not instant. Azure needs anywhere from 1-3 minutes to spin a new instance of a role. Build this lag into your architecture and dynamic scale plans. New databases and partitions are fast.
  19. The VM Role is maintained by YOU. Using the VM role is a handy shortcut, but has a long list of gotcha’s. Some of note: 1) the VM can be “reset” to the last VM image state that you uploaded, 2) you are responsble for VM OS upgrades and patches, 3) VMs must be clonable because they will operate in parallel.
  20. Azure supports more than .NET. You can setup anything in a worker (and now VM) role, but there are nuances to doing this effectively. You really need to understand how Azure works and had better be ready to crack open Visual Studio for some things even if you’re writing in Java.

We hope this list helps you navigate Azure deployments. No matter what cloud you use, understanding Azure’s architecture will help you write better cloud scale applications.

We’d love to hear your suggestions and recommendations!

Mirrored on both blogs: Rob Hirschfeld’s Blog & Dave McCrory’s Blog

Jevon’s Paradox

I’ve been finding it necessary to quote Jevon’s paradox several times lately and realized that I have NOT referenced it here.  Quite simply, understanding Jevon’s paradox is essential to understanding cloud.

The concept of the paradox is that when we make something more efficient (for example gas in cars), the demand for that resource goes up (we move further into the exurbs because driving is cheaper).  Notably, as Moore’s law drives computer efficiency up, we are using more and more computers.  Specifically, I have more computers in my house every year even though the efficiency of just one smart phone far (far) exceeds power of my son’s Sinclair 1000.

In cloud computing, Jevon’s paradox points us to the expectation that the rush of applications and activity in the cloud will continue to accelerate.  Since I expect competition and Moore’s law to drive increasing gains in cloud efficiency (and therefore customer advantageous price signals) the market will happily convert these utilization improvements into more and more interesting capabilities.

The cloud expansion means that we can sustain more providers entering the market.  In fact, Jevon would tell us that more providers will likely INCREASE demand for cloud as competition and capacity put downward pressure on prices.  [Q: where will they make up the margin?   A: Adjacent Services]

The loser in cloud’s exploration of Jevon’s paradox are non-cloud deployments (Dell strategists are you listening?).  These systems suffer because their ability to improve their efficiency is limited.

As I look down the road on cloud, I can see many opportunities for current applications to take advantage of cheaper cloud resources to provide even more value.  For example, adding map-reduce analytics to scan a customer’s data can provide tremendous insights.  Today, it’s a luxury like flying on the Concorde.  Tomorrow, it will be part like hopping the Nerd Bird from San Jose to Austin – just a normal part of our daily lives.

 

Note: A shout out to Dave McCrory who introduced me to Jevon’s Paradox.

Seattle Cloud Camp, Dec 2010

While I was in Seattle for Azure training preparing for Dell’s Azure Appliance , Dave @McCrory suggested that we also attend the Seattle Cloud Camp (SCC Tweets).  This event was very well attended (200 people!).  With heavy attendance by Amazon (at their HQ), Microsoft (in the ‘hood), and Google, there was a substantial cloud vendor presence (>25% from those vendors alone).  Notable omission: VMware.

My reflection about the event by segment.

Opening Sessions:

  • Most of the opening sessions were too light for the audience.  I thought we were past the “what is cloud” level, sigh.
  • Of note, the Amazon security presentation by Steve Rileywas fun and entertaining.
  • Picking on a Dell competitor specifically: calling your cloud solution “WAS” is a branding #fail (not that DCSWA much is better).

Unpanel of self-appointed cloud extroverts experts:

  • The unpanel covered some decent topics (@adronbh captured them on twitter), unfortunately none of the answers really stood out to me.  Except for NoSQL.
  • The unpanel discussion about NoSQL drew 2 answers.  1) It’s not NoSQL, it’s eventually consistent instead of strictly consistent.  (note: I’ve been calling it “Storage++”) 2) We’ll see more and more choices in this area as we tune the models for utility then we’ll see some consolidation.  The suggestion was that NoSQL would follow the same explosion/contraction pattern of SQL databases.

Session on Cloud APIs (my suggested topic)

  • The Cloud API topic was well attended (30+).  The vast overwhelming majority or the attendees were using Amazon.
  • There was some interest in having “standard” APIs for cloud functions was not well received because it was felt to stifle innovation.  We are still to early.
  • It was postulated but not generally agreed that cloud aggregation (DeltaCloud, RightScale, etc) is workable.  This was considered a reason to not require standard clouds.
  • CloudCamp sponsor, Skytap, has their own API.  These APIs are value added and provide extra abstraction levels.
  • It was said that there are a LOT (50 now, 500 soon) smaller hosts that want to enter the cloud space.  These hosts will need an API – some are inventing their own.
  • I brought up the concept discussed at OpenStack that the logical abstraction for cloud network APIs is a “vlan.”  This created confusion because some thought that I meant actual 802.1q tags.  NO!  I just meant that is was the ABSTRACTION of a VLAN connecting VMs together.
  • There was agreement from the clouderati in the room that cloud networking was f’ed up, but most people were not ready to discuss.
  • Cloud APIs have some basics that are working (semantics around VMs) but still have lots of wholes.  Notably: networking, application, services, and identity)

Session on Google App Engine (GAE)

  • GAE is got a lot going on, especially in the social/mobile space.
  • Do not think a lack of news about GAE means that they are going slow, it’s just the opposite.  It looks like they are totally kicking ass with a very focused strategy.  I suspect that they are just waiting for the market to catch-up.
  • GAE understands what a “platform” really is.  They talk about their platform as the SERVICES that they are offering.  The code is just code.  The services are impressive and include identity, mail, analysis, SQL (business only), map (as in Map-Reduce), prediction (yes, prediction!), storage, etc.  The total list was nearly 20 distinct services.
  • GAE compared them selves to Azure, not Amazon.

Getting cozy with “Adjacent Services”

I’ve had a busy week with Azure Training and Cloud Camp Seattle.  It’s going to take a few days to unwind specific posts about both, but I wanted to hit some shiny new thoughts.

Services helping each other

  • Adjacent Services are dedicated and/or public services (XaaS) that are offered along side generic public cloud offerings.   For a company like Dell (my employer), this could be specific brands of storage or databases (e.g. Oracle).  I believe these are much higher margin XaaS than IaaS.
  • Layer 7 Load Balancers represent a more intelligent link between load direction and the applications. I heard people using this term in multiple contexts.   For example,  In Azure, the apps can set themselves as “offline” and they will stop getting traffic then they can turn themselves online when they are ready for more.
  • Cloud Rollout/Migration is a rolling upgrade scheme where you can send traffic to 2 versions of your application at the same time!  You upgrade by zones and if you have >2 zones then you’ll have two active versions at the same time.  Your data models need to accommodate this.
  • We don’t have enough Agile Cloud programming books (like Dave Thomas’ RoR Intro).  We need a cloud programming book that STARTS WITH INTEGRATION TESTS and shows how to use all the adjacent services.  I may just have to write one (or three).

Thanks to many many at Microsoft for the great Azure training sessions.  I’ll add more names, but for now I have links to Steve Marx (Smarx.com) & Srirm Krishnan (Sriram Krishnan.com) .

OpenStack videos peek into cloud shakers

Barton George (Dell’s cloud evangalist and cloud shouter) has posted videos from the OpenStack conference last week:

McCrory lays out VMware vision

Props are due to Dave McCrory for his fine investigative work reading the VMware cloudy tea leaves.  Over the weekend, he posted a series of articles about VMware’s Open PaaS and VMforce offerings.  This is a significant write-up based on information gleaned from their public code check-ins that he validated with them after the fact.

I have not had time to digest it yet – check back later for actual commentary.

OpenStack Day 2 Aspiration: Dreaming & Breathing

Between partnering meetings, I bounced through biz and tech sessions during Day 2 of the OpenStack conference (day 1 notes).   After my impression summary, I’m including some succinct impressions, pictures, and copies of presentations by my Dell team-mates Greg Althaus & Brent Douglas.

Clouds on the road to Bexar
My overwhelming impression is a healthy tension between aspirational* and practical discussions.  The community appetite for big broad and bodacious features is understandably high: cloud seems on track as a solution for IT problems but there are is still an impedance mismatch between current apps and cloud capabilities.
As service providers ASPire to address these issues, some OpenStack blue print discussions tended to digress towards more forward-looking or long-term designs.  However, watching the crowd, there was also a quietly heads down and pragmatic audience ready to act and implement.  For this action focused group, delivering working a cloud was the top priority.  The Rackers and Nebuliziers have product to deploy and will not be distracted from the immediate concerns of living, breathing shippable code.
I find the tension between dreaming aspiration (cloud futures) and breathing aspiration (cloud delivery) necessary to the vitality of OpenStack.
[Day 3 update, these coders are holding the floor.  People who are coding have moved into the front seats of the fishbowl and the process is working very nicely.]
Specific Comments (sorry, not linking everything):
  • Cloud networking is a mess and there is substantial opportunity for innovation here.  Nicira was making an impression talking about how Open vSwitch and OpenFlow could address this at the edge switches.  interesting,  but messy.
  • I was happy with our (Dell’s) presentations: real clouds today (Bexas111010DataCenterChanges) and what to deploy on (Bexar111010OpenStackOnDCS).
  • SheepDog was presented as a way to handle block storage.  Not an iSCSI solution, works directly w/ KVM.  Strikes me as too limiting – I’d rather see just using iSCSI.  We talked about GlusterFS or Ceph (NewDream).  This area needs a lot of work to catch up with Amazon EBS.  Unfortunately, persisting data on VM “local” disks is still the dominate paradigm.
  • Discussions about how to scale drifted towards aspirational.
  • Scalr did a side presentation about automating failover.
  • Discussion about migration from Eucalyptus to OpenStack got side tracked with aspirations for a “hot” migration.  Ultimately, the differences between network was a problem.  The practical issue is discovering the meta data – host info not entirely available from the API.
  • Talked about an API for cloud networking.  This blue print was heavily attended and messy.  The possible network topologies present too many challenges to describe easily.  Fundamentally, there seems consensus that the API should have a very very simple concept of connecting VM end points to a logical segment.  That approach leverages the accepted (but out dated) VLAN semantic, but implementation will have to be topology aware.  ouch!
  • Day 3 topic Live migration: Big crowd arguing with bated breath about this.  The summary “show us how to do it without shared storage THEN we’ll talk about the API.”
Executive Tweet:  #OpenStack getting to down business.  Big dreams.  Real problems.  Delivering Code.
 
Note: I nominate Aspirational for 2010 buzzword of the year.

Greg PresentingBig Crowd on Day 1

OpenStack Bexar Design Summit Day 1

Yesterday, Dell sent me to be part of our OpenStack vanguard for the design summit.  The conference is fascinating and productive for the content of the sessions and even more interesting for the hallway meetings.

It’s obvious looking at the board composition that RackSpace and NASA Nova are driving  most of the development; however, the is palpable community interest and enthusiasm.  Participants and contributors showed up in force at this event.

RackSpace and NASA leadership provides critical momentum for the community.  Code is the smallest part of their contribution, their commitment to run the code at scale in production is the magic rocket fuel powering OpenStack. I’ve had many conversations with partners and prospects planning to follow RackSpace into production with a 3-6 month lag.

Beyond that primary conference arc, my impressions:

  • Core vendors like Citrix, Dell, Canonical are signing up to do primary work for the code base.  They are taking ownership for their own components in the stack.
  • Universally, people comment about the speed of progress and amount of code being generated.  Did I mention that there is a lot of code being written.
  • Networking is still a major challenge.  OpenStack (with Citrix’s Xen support) is driving Open vSwitchas a replacement for iptables management.
  • IPv6 gets lackadaisical treatment in the US, but is urgent in Japan/Asia where their core infrastructure is ALREADY IPv6.  Their frustration to get attention here should be a canary in the cloud mine (but is not).  They proposed a gateway model where VMs have dual addresses: IPv4 gets NATed while IPv6 is a pass-through. Seems to me that the going IPv6 internal is the real solution.
  • Cloud bursting is still too fuzzy a thing to talk about in a big group.  The session about it covered so many use-cases that we did not accomplish anything.  Some people wanted to talk about cloud API proxy while others (myself included) wanted to talk about managing apps between clouds.  My $0.02 is that vendors like RightScale solve the API proxy issue so it’s the networking issues that need focus.  We need to get back to the use-cases!

Executive Tweet: #openstack: Partners & Code = great progress.  Networking = needs more love

Other notes:

CAP Chasm: why clouds say “no SANank you” to SANs

My personal bias against SANs in cloud architectures is well documented; however, I am in the minority at my employer (Dell) and few enterprise IT shops share my view.  In his recent post about CAP theorem, Dave McCrory has persuaded me to look beyond their failure to bask in my flawless reasoning.  Apparently, this crazy CAP thing explains why some people loves SANs (enterprise) and others don’t (clouds).

The deal with CAP is that you can only have two of Consistency, Availability, or Partitioning Tolerance.  Since everyone wants Availablity, the choice is really between Consitency or Partitioning.  Seeking Availability you’ve got two approaches:

  1. Legacy applications tried to eliminate faults to achieve Consistency with physically redundant scale up designs. 
  2. Cloud applications assume faults to achieve Partitioning Tolerance with logically redundant scale out design.

According to CAP, Legacy and cloud approaches are so fundamentally different that they create a “CAP Chasm” in which the very infrastructure fabric needed to deploy these applications is different.

As a cloud geek, I consider the inherent cost and scale limitations of a CA approach much too limited.   My first hand experience is that our customers and partners share my view: they have embraced AP patterns.  These patterns make more efficient use of resources, dictate simpler infrastructure layout, scale like hormone-crazed rabbits at a carrot farm, and can be deployed on less expensive commodity hardware.

As a CAP theorem enlightened IT professional, I can finally accept that there are other intellectually valid infrastructure models. 

See Mom?  I can play nicely with others after all.

VM != Cloud! Comparision draws ire, misses point

Having the requirement benefit of working with both Dave McCrory and Joyent on a daily basis at Dell, I cannot resist weighing in on the blog pong between them.

Dave’s post comparing VM pricing prompted Joyent to blog that VMs are not the only measure of cloud.

While I completely agree that clouds are not all about VMs, I think that Joyent is too limited in their definition of cloud in their reply.  We’re seeing an emergence of services as the differentiator between clouds.

Looking at Amazon, Azure, and Google, the clear way to reduce cloud spend is to migrate applications to consume their services (SQL, Storage, Bus, etc).

If cloud users are primarily concerned about price per hour (which I’m not convinced is the case) then they have real motivation to migrate from purely VM (or SmartMachine(tm) ) based applications to ones that use services.