AWS Ops patterns set the standard: embrace that and accelerate

RackN creates infrastructure agnostic automation so you can run physical and cloud infrastructure with the same elastic operational patterns.  If you want to make infrastructure unimportant then your hybrid DevOps objective is simple:

Create multi-infrastructure Amazon equivalence for ops automation.

Ecosystem View of AWSEven if you are not an AWS fan, they are the universal yardstick (15 minute & 40 minute presos) That goes for other clouds (public and private) and for physical infrastructure too. Their footprint is simply so pervasive that you cannot ignore “works on AWS” as a need even if you don’t need to work on AWS.  Like PCs in the late-80s, we can use vendor competition to create user choice of infrastructure. That requires a baseline for equivalence between the choices. In the 90s, the Windows’ monopoly provided those APIs.

Why should you care about hybrid DevOps? As we increase operational portability, we empower users to make economic choices that foster innovation.  That’s valuable even for AWS locked users.

We’re not talking about “give me a VM” here! The real operational need is to build accessible, interconnected systems – what is sometimes called “the underlay.” It’s more about networking, configuration and credentials than simple compute resources. We need consistent ways to automate systems that can talk to each other and static services, have access to dependency repositories (code, mirrors and container hubs) and can establish trust with other systems and administrators.

These “post” provisioning tasks are sophisticated and complex. They cannot be statically predetermined. They must be handled dynamically based on the actual resource being allocated. Without automation, this process becomes manual, glacial and impossible to maintain. Does that sound like traditional IT?

Side Note on Containers: For many developers, we are adding platforms like Docker, Kubernetes and CloudFoundry, that do these integrations automatically for their part of the application stack. This is a tremendous benefit for their use-cases. Sadly, hiding the problem from one set of users does not eliminate it! The teams implementing and maintaining those platforms still have to deal with underlay complexity.

I am emphatically not looking for AWS API compatibility: we are talking about emulating their service implementation choices.  We have plenty of ways to abstract APIs. Ops is a post-API issue.

In fact, I believe that red herring leads us to a bad place where innovation is locked behind legacy APIs.  Steal APIs where it makes sense, but don’t blindly require them because it’s the layer under them where the real compatibility challenge lurk.  

Side Note on OpenStack APIs (why they diverge): Trying to implement AWS APIs without duplicating all their behaviors is more frustrating than a fresh API without the implied AWS contracts.  This is exactly the problem with OpenStack variation.  The APIs work but there is not a behavior contract behind them.

For example, transitioning to IPv6 is difficult to deliver because Amazon still relies on IPv4. That lack makes it impossible to create hybrid automation that leverages IPv6 because they won’t work on AWS. In my world, we had to disable default use of IPv6 in Digital Rebar when we added AWS. Another example? Amazon’s regional AMI pattern, thankfully, is not replicated by Google; however, their lack means there’s no consistent image naming pattern.  In my experience, a bad pattern is generally better than inconsistent implementations.

As market dominance drives us to benchmark on Amazon, we are stuck with the good, bad and ugly aspects of their service.

For very pragmatic reasons, even AWS automation is highly fragmented. There are a large and shifting number of distinct system identifiers (AMIs, regions, flavors) plus a range of user-configured choices (security groups, keys, networks). Even within a single provider, these options make impossible to maintain a generic automation process.  Since other providers logically model from AWS, we will continue to expect AWS like behaviors from them.  Variation from those norms adds effort.

Failure to follow AWS without clear reason and alternative path is frustrating to users.

Do you agree?  Join us with Digital Rebar creating real a hybrid operations platform.

To avoid echo chamber, OpenStack must embrace competitive cloud ecosystem

wpid-20151023_100533.jpg
Japanese Bullet Train View

I was in Japan before the Tokyo summit on a bullet train to Kyoto watching the mix of heavy industry and bucolic mountains pass by. That scene reflects an OpenStack duality: we want to be both a dominant platform delivering core cloud services and an open source values driven collective.

First, I fundamentally believe in the success of OpenStack as the open virtual infrastructure management platform.

I believe that we have solved the virtual compute/storage/network problem sufficiently to become the de facto open IaaS platform. While not perfect, the technologies are sufficient assuming we continue to improve ease of use and operational hardening. Pursing that base capability is my primary motivation for DefCore work.

I don’t believe that the OpenStack community is, or should try to become, the authority on “all things cloud.”

In the presence of Amazon, VMware, Microsoft and Google, we cannot make that claim with any degree of self-respect. Even newcomers like DigitalOcean have an undeniable footprint and influence. Those vendor platforms drive cloud ecosystems and technologies which foster fast innovation because there is no friction to joining their ecosystems and they are sufficiently large and stable enough to represent a target market. We’ve seen clear signs from Rackspace, HP and others that platform diversity improves cloud strength.

I continue to think we (OpenStack) spend too much time evaluating what is “in” or “out” of the project and too little time talking about what’s “on,” “under” and “with” the project like Kubernetes, Mesos, Docker, SDN, Hadoop and Ceph. That type of thinking creates distance between OpenStack efforts and the majority of the market.

What motivates the drive to an all open captive community? It’s the reasonable concern that critical parts of the infrastructure will become pay-to-play. For example, what if a non-OpenStack alternative to Heat Orchestration gained popularity for OpenStack implementers. Perhaps something that ran on Amazon also. That would create external pressure that would drive internal priorities. These “non-OpenStack” products would then have influence without having to contribute back to upstream.

Can we afford to have external entities driving internal priorities? Hell yes, that’s what customer adoption looks like.

OpenStack does not own the market sufficiently to create cloud echo chamber. The next wave of cloud innovation (my money is on container platforms) will follow the path of least resistance and widest adoption. We need to embrace that these innovations will not all be inside our community so that we can welcome them as part of our ecosystem. The community needs to find peace with that.

Why cloud compute will be free

Today at Dell, I was presenting to our storage teams about cloud storage (aka the “storage banana”) and Dave “Data Gravity” McCrory reminded me that I had not yet posted my epiphany explaining “why cloud compute will be free.”  This realization derives from other topics that he and I have blogged but not stated so simply.

Overlooking that fact that compute is already free at Google and Amazon, you must understand that it’s a cloud eat cloud world out there where losing a customer places your cloud in jeopardy.  Speaking of Jeopardy…

Answer: Something sought by cloud hosts to make profits (and further the agenda of our AI overlords).

Question: What is lock-in?

Hopefully, it’s already obvious to you that clouds are all about data.  Cloud data takes three primary forms:

  1. Data in transformation (compute)
  2. Data in motion (network)
  3. Data at rest (storage)

These three forms combine to create cloud architecture applications (service oriented, externalized state).

The challenge is to find a compelling charge model that both:

  1. Makes it hard to leave your cloud AND
  2. Encourages customers to use your resources effectively (see #1 in Azure Top 20 post)

While compute demands are relatively elastic, storage demand is very consistent, predictable and constantly grows.  Data is easily measured and difficult to move.  In this way, data represents the perfect anchor for cloud customers (model rule #1).  A host with a growing data consumption foot print will have a long-term predictable revenue base.

However, storage consumption along does not encourage model rule #2.  Since storage is the foundation for the cloud, hosts can fairly judge resource use by measuring data egress, ingress and sidegress (attrib @mccrory 2/20/11).  This means tracking not only data in and out of the cloud, but also data transacted between the providers own cloud services.  For example, Azure changes for both data at rest ($0.15/GB/mo) and data in motion ($0.01/10K).

Consequently, the financially healthiest providers are the ones with most customer data.

If hosting success is all about building a larger, persistent storage footprint then service providers will give away services that drive data at rest and/or in motion.  Giving away compute means eliminating the barrier for customers to set up web sites, develop applications, and build their business.  As these accounts grow, they will deposit data in the cloud’s data bank and ultimately deposit dollars in their piggy bank.

However, there is a no-free-lunch caveat:  free compute will not have a meaningful service level agreement (SLA).  The host will continue to charge for customers who need their applications to operate consistently.  I expect that we’ll see free compute (or “spare compute” from the cloud providers perspective) highly used for early life-cycle (development, test, proof-of-concept) and background analytic applications.

The market is starting to wake up to the idea that cloud is not about IaaS – it’s about who has the data and the networks.

Oh, dem golden spindles!  Oh, dem golden spindles!

Microsoft Azure Cloud – Top 20 Lessons Learned about MS’s PaaS

Last week Dave McCrory (@McCrory) and I (@Zehicle) had the benefit of intensive Azure training at Microsoft HQ to support Dell’s Azure Stamp.

We’ve assembled a top 20 list of things to know about programming for Azure (and really any PaaS leaning cloud):

  1. If you want performance, optimize to reduce fees. Azure (and any cloud) is architected to penalize you if you use their resources poorly. The challenge is to fix this before your boss get the tab for your unenlightened design decisions.
  2. Coding .NET on Azure easy, architecting for Azure requires learning. Clouds put things in different places than you are used to and the rules are different. Expect a learning curve.
  3. Partitioning = parallelism. Learn to love partitions in all their forms, because your app will be throttled if you throw everything into a single partition! On the upside, each partition operates in parallel and even better, they usually don’t cost extra (SQL is the exception).
  4. Roles are flexible. You can run web servers (Apache, etc) on a worker and worker tasks on a web role. This is a good way to save some change since you pay per role instance. It’s counter to separation of concerns, but financially you should also combine workers into a single role where possible.
  5. Understand walking deployments. You can (and should) have simultaneous versions of the code operating against the same data so that you can roll upgrades (ala Timothy Fitz/Eric Ries) to reduce risk and without reducing performance. You should expect your data schema to simultaneously span mutiple code versions.
  6. Learn about Update Domains (UDs). Deployment domains allow rolling upgrades and changes to Applications and Services. They are part of how you partition your overall application. If you’re planning a VIP swap deployment, then you won’t care.
  7. Each service = ONE external IP. You can have many VMs backing each service (and multiple roles in a service) and Azure will load balance between them so you can scale out each service. Think of each service as a clonable entity: there will be at least 1 and more can be added if you want to scale.
  8. Understand between VIP and DIP. VIPs stand for Virtual IPs and are external, public, and metered. DIPs are internal, private, and load balanced. Azure provides an API to discover your DIPs – do not assume you know them because they are DYNAMIC IPs. Azure won’t let you see other DIPs inside the system.
  9. Azure has rich diagnostics, but beware. Azure leverages the existing diagnostics built into their system, but has to get the data off box since instances are volitile. This means that problems can be hard to isolate while excessive logging can impact performance and generate fees. Microsoft lets you target individual systems for elevated levels. You can also Terminal Server to a VM for troubleshooting (with caution).
  10. The new Azure admin console rocks. Take your pick between Silverlight or MMC Snap-in.
  11. Everything goes into Azure Storage. Learn to love it. Queues -> storage. Tables -> storage. Blobs -> storage. Logging -> storage. Code Repo -> storage. vDisk -> storage. SQL -> SQL (they march to their own drummer).
  12. Queues are essential, but tricky. Learn the meaning of idempotent because using queues requires you to handle failures and timeouts. The scary part is that it will work nicely until you exceed some limits and then you’ll experience cascading failure. Whee! Oh yea, and queues require polling (which stinks as a notification model).
  13. SQL Azure is just mostly like MS SQL. Microsoft did a smart thing in keeping Cloud SQL so it was highly compatible with Local SQL. The biggest note is that limited in size of partition. If you embrace the size limits you will get better performance. So stop pushing BLOBs into databases and start sharding.
  14. Duplicating data in tables will improve performance. This has to do with how partitions and keys operate but is an interesting architecture for NoSQL – stage data for use. Don’t be afraid to stage the same data in multiple ways. It may be faster/cheaper to write data twice if it becomes easier to find when you search it 1000s of times.
  15. Table data can be “warmed up.” Storage has logic that makes frequently accessed items faster (sort of like a cache😉. If you can anticipate load spikes then you should warm the data just before the spike.
  16. Storage billing is both amount and transactions. You can get burned on a small, but busy set of data. Note: you will pay even if you 404 a request.
  17. Azure has a CDN. Leveraging Microsoft’s Content Delivery Network (CDN) will improve performance for your users with small, low latency, high request items. You need to change your URLs for those assets. Best practice is to use some versioning in the URI so that you can force changes. Remember, CDN is SLOWER for the first hit when the data is not in cache so avoid CDN for low volume assets!
  18. Provisioning time is not instant. Azure needs anywhere from 1-3 minutes to spin a new instance of a role. Build this lag into your architecture and dynamic scale plans. New databases and partitions are fast.
  19. The VM Role is maintained by YOU. Using the VM role is a handy shortcut, but has a long list of gotcha’s. Some of note: 1) the VM can be “reset” to the last VM image state that you uploaded, 2) you are responsble for VM OS upgrades and patches, 3) VMs must be clonable because they will operate in parallel.
  20. Azure supports more than .NET. You can setup anything in a worker (and now VM) role, but there are nuances to doing this effectively. You really need to understand how Azure works and had better be ready to crack open Visual Studio for some things even if you’re writing in Java.

We hope this list helps you navigate Azure deployments. No matter what cloud you use, understanding Azure’s architecture will help you write better cloud scale applications.

We’d love to hear your suggestions and recommendations!

Mirrored on both blogs: Rob Hirschfeld’s Blog & Dave McCrory’s Blog

Seattle Cloud Camp, Dec 2010

While I was in Seattle for Azure training preparing for Dell’s Azure Appliance , Dave @McCrory suggested that we also attend the Seattle Cloud Camp (SCC Tweets).  This event was very well attended (200 people!).  With heavy attendance by Amazon (at their HQ), Microsoft (in the ‘hood), and Google, there was a substantial cloud vendor presence (>25% from those vendors alone).  Notable omission: VMware.

My reflection about the event by segment.

Opening Sessions:

  • Most of the opening sessions were too light for the audience.  I thought we were past the “what is cloud” level, sigh.
  • Of note, the Amazon security presentation by Steve Rileywas fun and entertaining.
  • Picking on a Dell competitor specifically: calling your cloud solution “WAS” is a branding #fail (not that DCSWA much is better).

Unpanel of self-appointed cloud extroverts experts:

  • The unpanel covered some decent topics (@adronbh captured them on twitter), unfortunately none of the answers really stood out to me.  Except for NoSQL.
  • The unpanel discussion about NoSQL drew 2 answers.  1) It’s not NoSQL, it’s eventually consistent instead of strictly consistent.  (note: I’ve been calling it “Storage++”) 2) We’ll see more and more choices in this area as we tune the models for utility then we’ll see some consolidation.  The suggestion was that NoSQL would follow the same explosion/contraction pattern of SQL databases.

Session on Cloud APIs (my suggested topic)

  • The Cloud API topic was well attended (30+).  The vast overwhelming majority or the attendees were using Amazon.
  • There was some interest in having “standard” APIs for cloud functions was not well received because it was felt to stifle innovation.  We are still to early.
  • It was postulated but not generally agreed that cloud aggregation (DeltaCloud, RightScale, etc) is workable.  This was considered a reason to not require standard clouds.
  • CloudCamp sponsor, Skytap, has their own API.  These APIs are value added and provide extra abstraction levels.
  • It was said that there are a LOT (50 now, 500 soon) smaller hosts that want to enter the cloud space.  These hosts will need an API – some are inventing their own.
  • I brought up the concept discussed at OpenStack that the logical abstraction for cloud network APIs is a “vlan.”  This created confusion because some thought that I meant actual 802.1q tags.  NO!  I just meant that is was the ABSTRACTION of a VLAN connecting VMs together.
  • There was agreement from the clouderati in the room that cloud networking was f’ed up, but most people were not ready to discuss.
  • Cloud APIs have some basics that are working (semantics around VMs) but still have lots of wholes.  Notably: networking, application, services, and identity)

Session on Google App Engine (GAE)

  • GAE is got a lot going on, especially in the social/mobile space.
  • Do not think a lack of news about GAE means that they are going slow, it’s just the opposite.  It looks like they are totally kicking ass with a very focused strategy.  I suspect that they are just waiting for the market to catch-up.
  • GAE understands what a “platform” really is.  They talk about their platform as the SERVICES that they are offering.  The code is just code.  The services are impressive and include identity, mail, analysis, SQL (business only), map (as in Map-Reduce), prediction (yes, prediction!), storage, etc.  The total list was nearly 20 distinct services.
  • GAE compared them selves to Azure, not Amazon.

Getting cozy with “Adjacent Services”

I’ve had a busy week with Azure Training and Cloud Camp Seattle.  It’s going to take a few days to unwind specific posts about both, but I wanted to hit some shiny new thoughts.

Services helping each other

  • Adjacent Services are dedicated and/or public services (XaaS) that are offered along side generic public cloud offerings.   For a company like Dell (my employer), this could be specific brands of storage or databases (e.g. Oracle).  I believe these are much higher margin XaaS than IaaS.
  • Layer 7 Load Balancers represent a more intelligent link between load direction and the applications. I heard people using this term in multiple contexts.   For example,  In Azure, the apps can set themselves as “offline” and they will stop getting traffic then they can turn themselves online when they are ready for more.
  • Cloud Rollout/Migration is a rolling upgrade scheme where you can send traffic to 2 versions of your application at the same time!  You upgrade by zones and if you have >2 zones then you’ll have two active versions at the same time.  Your data models need to accommodate this.
  • We don’t have enough Agile Cloud programming books (like Dave Thomas’ RoR Intro).  We need a cloud programming book that STARTS WITH INTEGRATION TESTS and shows how to use all the adjacent services.  I may just have to write one (or three).

Thanks to many many at Microsoft for the great Azure training sessions.  I’ll add more names, but for now I have links to Steve Marx (Smarx.com) & Srirm Krishnan (Sriram Krishnan.com) .

VM != Cloud! Comparision draws ire, misses point

Having the requirement benefit of working with both Dave McCrory and Joyent on a daily basis at Dell, I cannot resist weighing in on the blog pong between them.

Dave’s post comparing VM pricing prompted Joyent to blog that VMs are not the only measure of cloud.

While I completely agree that clouds are not all about VMs, I think that Joyent is too limited in their definition of cloud in their reply.  We’re seeing an emergence of services as the differentiator between clouds.

Looking at Amazon, Azure, and Google, the clear way to reduce cloud spend is to migrate applications to consume their services (SQL, Storage, Bus, etc).

If cloud users are primarily concerned about price per hour (which I’m not convinced is the case) then they have real motivation to migrate from purely VM (or SmartMachine(tm) ) based applications to ones that use services.