Substituting Action for Knowledge – adopting “ready, fire, aim” as a strategy (and when to run like hell)

Today my mother-in-law (a practicing psychiatrist) was bemoaning the current medical practice of substituting action for knowledge. In her world, many doctors will make rapid changes to their patients’ therapy. Their goal is to address the issues immediately presented (patient feels sad so Dr prescribes antidepressants) rather than taking time to understand the patients’ history or make changes incrementally and measure impacts. It feels like another example of our cultural compulsion to fix problems as quickly as possible.

Her comments made me question the core way that I evangelize!

Do Lean and Agile substitute action for knowledge? No. We use action to acquire knowledge.

The fundamental assumption that drives poor decision-making is that we have enough information to make a design, solve a problem or define a market. Lean and Agile’s more core tenet is that we must attack this assumption. We must assume that we can’t gather enough information to fully define our objective. The good news, is that even without much analysis we know a lot! We know:

  • roughly what we want to do (road map)
  • the first steps we should take (tactics)
  • who will be working on the problem (team members)
  • generally how much effort it will take (time & team size)
  • who has the problem that we are trying to solve (market)

We also know that we’ll learn a lot more as we get closer to our target. Every delay in starting effectively pushed our “day of clarity” further into the future. For that reason, it is essential that we build a process that constantly reviews and adjusts its targets.

We need to build a process that acquires knowledge as progress is made and makes rapid progress.

In Agile, we translate this need into the decorations of our process: reviews for learning, retrospectives for adjustments, planning for taking action and short iterations to drive the feedback loop.  Agile’s mantra is “ready, fire, aim, fire, aim, fire, aim, …” which is very different from simply jumping out of a plane without a parachute and hoping you’ll find a haystack to land in.

For cloud deployments, this means building operational knowledge in stages.  Technology is simply evolving too quickly and best practices too slowly for anyone to wait for a packaged solution to solve all their cloud infrastructure problems.  We tried this and it does not work: clouds are a mixture hardware, software and operations.  More accurately, clouds are an operational model supported by hardware and software.

Currently, 80% of cloud deployment effort is operations (or “DevOps“).

When I listen to people’s plans about building product or deploying cloud, I get very skeptical when they take a lot of time to aim at objects far off on the horizon.  Perhaps they are worried that they will substitute action for knowledge; however, I think they would be better served to test their knowledge with a little action.

My MIL agrees – she sees her patients frequently and makes small adjustments to their treatment as needed.  Wow, that’s an Rx for Agile!

Notes from 2011 Cloud Connect Event Day 2 (#ccevent)

With the OpenStack launch behind me, I have some time to attend the Cloud Connect Event.  I missed all the DevOps sessions, but was getting to geek out on the NoSQL & Big Data sessions.   I jumped to the private cloud track (based on Twitter traffic) and was rewarded for the shift.

I’m surprised at how much focus this cloud conference is dedicated to private cloud.  At other cloud conferences I’ve attended, the focus has been on learning how to use the cloud (specifically the public cloud).  This is the first cloud show I’ve attended that has so much emphasis, dialog and vendor feeding around private.  This was a suits & slacks show with few jeans, t-shirts, and pony tails.  Perhaps private cloud is where the $$$ is being spent now?

It definitely feels like using cloud has become assumed, but the best practices and tools are just emerging.

The twitter #ccevent stream is interesting but temporal.  I’m posting my raw (spelling optional) notes (below the more tag) because there is a lot of great content from the show to support and extend the twitter stream.  I’ll try to italicize some of the better lines.

Continue reading

“Flatness at the Edges” guides hyperscale cloud design

As I’m working on a larger “cloud bootstrapping” white paper (look for a pending Dell release), I stumbled on an apparent unifying principle for hyperscale cloud design.  I’m interested in feedback about this concept to see if it fairly encapsulates a common target for cloud hardware, networking and software design.

“Flatness at the Edges” is one of the guiding principles of hyperscale cloud designs.  

Flatness means that cloud infrastructure avoids creating tiers where possible.  For example, having a blade in a frame aggregating networking that is connected to a SAN via a VLAN is a tiered design in which the components are vertically coupled.  A single node with local disk connected directly to the switch has all the same components but in a single “flat” layer.  

Edges are the bottom tier (or “leaves” to us CS geeks) of the cloud.  Being flat creates a lot of edges because most of the components are self contained.  To scale and reduce complexity, clouds must rely on the edges to make independent decisions such as how to route network traffic, where to replicate data, or when to throttle VMs.  The anti-example of edge design is using VLANs to segment tenants because VLANs (a limited resource) require configuration at the switching tier to manage traffic generated by an edge component.  We are effectively distributing an intelligence overhead tax on each component of the cloud rather than relying on a “centralized overcloud” to rule them all. 

Combining flatness and edges evolves the sympathetic concepts into full-fledged cloud design principle.

Interested in discussing this face to face?  I’ll presenting this and other cloud setup concepts that the SJC OpenStack meetup on 2/3.

Microsoft Azure Cloud – Top 20 Lessons Learned about MS’s PaaS

Last week Dave McCrory (@McCrory) and I (@Zehicle) had the benefit of intensive Azure training at Microsoft HQ to support Dell’s Azure Stamp.

We’ve assembled a top 20 list of things to know about programming for Azure (and really any PaaS leaning cloud):

  1. If you want performance, optimize to reduce fees. Azure (and any cloud) is architected to penalize you if you use their resources poorly. The challenge is to fix this before your boss get the tab for your unenlightened design decisions.
  2. Coding .NET on Azure easy, architecting for Azure requires learning. Clouds put things in different places than you are used to and the rules are different. Expect a learning curve.
  3. Partitioning = parallelism. Learn to love partitions in all their forms, because your app will be throttled if you throw everything into a single partition! On the upside, each partition operates in parallel and even better, they usually don’t cost extra (SQL is the exception).
  4. Roles are flexible. You can run web servers (Apache, etc) on a worker and worker tasks on a web role. This is a good way to save some change since you pay per role instance. It’s counter to separation of concerns, but financially you should also combine workers into a single role where possible.
  5. Understand walking deployments. You can (and should) have simultaneous versions of the code operating against the same data so that you can roll upgrades (ala Timothy Fitz/Eric Ries) to reduce risk and without reducing performance. You should expect your data schema to simultaneously span mutiple code versions.
  6. Learn about Update Domains (UDs). Deployment domains allow rolling upgrades and changes to Applications and Services. They are part of how you partition your overall application. If you’re planning a VIP swap deployment, then you won’t care.
  7. Each service = ONE external IP. You can have many VMs backing each service (and multiple roles in a service) and Azure will load balance between them so you can scale out each service. Think of each service as a clonable entity: there will be at least 1 and more can be added if you want to scale.
  8. Understand between VIP and DIP. VIPs stand for Virtual IPs and are external, public, and metered. DIPs are internal, private, and load balanced. Azure provides an API to discover your DIPs – do not assume you know them because they are DYNAMIC IPs. Azure won’t let you see other DIPs inside the system.
  9. Azure has rich diagnostics, but beware. Azure leverages the existing diagnostics built into their system, but has to get the data off box since instances are volitile. This means that problems can be hard to isolate while excessive logging can impact performance and generate fees. Microsoft lets you target individual systems for elevated levels. You can also Terminal Server to a VM for troubleshooting (with caution).
  10. The new Azure admin console rocks. Take your pick between Silverlight or MMC Snap-in.
  11. Everything goes into Azure Storage. Learn to love it. Queues -> storage. Tables -> storage. Blobs -> storage. Logging -> storage. Code Repo -> storage. vDisk -> storage. SQL -> SQL (they march to their own drummer).
  12. Queues are essential, but tricky. Learn the meaning of idempotent because using queues requires you to handle failures and timeouts. The scary part is that it will work nicely until you exceed some limits and then you’ll experience cascading failure. Whee! Oh yea, and queues require polling (which stinks as a notification model).
  13. SQL Azure is just mostly like MS SQL. Microsoft did a smart thing in keeping Cloud SQL so it was highly compatible with Local SQL. The biggest note is that limited in size of partition. If you embrace the size limits you will get better performance. So stop pushing BLOBs into databases and start sharding.
  14. Duplicating data in tables will improve performance. This has to do with how partitions and keys operate but is an interesting architecture for NoSQL – stage data for use. Don’t be afraid to stage the same data in multiple ways. It may be faster/cheaper to write data twice if it becomes easier to find when you search it 1000s of times.
  15. Table data can be “warmed up.” Storage has logic that makes frequently accessed items faster (sort of like a cache ;). If you can anticipate load spikes then you should warm the data just before the spike.
  16. Storage billing is both amount and transactions. You can get burned on a small, but busy set of data. Note: you will pay even if you 404 a request.
  17. Azure has a CDN. Leveraging Microsoft’s Content Delivery Network (CDN) will improve performance for your users with small, low latency, high request items. You need to change your URLs for those assets. Best practice is to use some versioning in the URI so that you can force changes. Remember, CDN is SLOWER for the first hit when the data is not in cache so avoid CDN for low volume assets!
  18. Provisioning time is not instant. Azure needs anywhere from 1-3 minutes to spin a new instance of a role. Build this lag into your architecture and dynamic scale plans. New databases and partitions are fast.
  19. The VM Role is maintained by YOU. Using the VM role is a handy shortcut, but has a long list of gotcha’s. Some of note: 1) the VM can be “reset” to the last VM image state that you uploaded, 2) you are responsble for VM OS upgrades and patches, 3) VMs must be clonable because they will operate in parallel.
  20. Azure supports more than .NET. You can setup anything in a worker (and now VM) role, but there are nuances to doing this effectively. You really need to understand how Azure works and had better be ready to crack open Visual Studio for some things even if you’re writing in Java.

We hope this list helps you navigate Azure deployments. No matter what cloud you use, understanding Azure’s architecture will help you write better cloud scale applications.

We’d love to hear your suggestions and recommendations!

Mirrored on both blogs: Rob Hirschfeld’s Blog & Dave McCrory’s Blog

Exploding the Cloud Storage Banana

Storage Banana shows how cloud persistence is functionally diverse and optimized

Internally, my group (specifically Dave McCrory & Greg Althaus) has been kicking around some new ways of expressing clouds in an effort to help reconcile Dell’s traditional and cloud focused businesses.  We’ve found it challenging to translate CAP theorem and

externalized application state into more enterprise-ready concepts.

Our latest effort led to a pleasantly succinct explanation of why cloud storage is different than enterprise storage.  Ultimately, it’s a matter of control and optimization.  Cloud persistence (Cache, Queue, Tables, Objects) is functionally diverse in order to optimize for price and performance while enterprise storage (SAN, NAS, SQL) is control and centralization driven.  Unfortunately for enterprises, the data genie is out of the Pandora’s box with respect to architectures that drive much lower cost and higher performance.

The background on this irresistible transformation begins with seeing storage as a spectrum of services as per the table below.

Enterprise:

Consistent

 

Block (SAN) iSCSI, Infiband:

Amazon EBS, EqualLogic, EMC Symmeterix

File (NAS) NFS, CIFS:

NetApp, PowerVault, EMC Clariion

Database (ACID) MS SQL, Oracle 11g, MySQL, Postgres
Cloud:

Distributed

Partitioned

Object DX/Caringo, OpenStack Swift, EMC Atmos
Map/Reduce Hadoop DFS
Key Value Cassandra, CouchDB, Riak, Reddis, Mongo
Queue (Bus) RabbitMQ, ActiveMQ, ZeroMQ, OpenMQ, Celery
Cloud:

Transitory

 

Messaging AMPQ, MSMQ (.NET)
Shared RAM MemCache, Tokyo Cabinet

From this table, I approximated the relative price and performance for each component in the storage spectrum.

The result was the “cloud storage banana” graph.  In this graph, enterprise storage is clustered in the “compromise” quadrant where there’s a high price for relatively low performance.  The cloud persistence refuses to be clustered at all.  To save cost and enable distributed data, applications will use cheap but slow object storage.  This drives the need for high speed RAM based cache and distributed buses. These approaches are required when developers build fault tolerance at the application level.

Enterprises have enjoyed the false luxury of perceived hardware reliability.  Where these assumptions are removed, applications are freed to scale more gracefully and consider resource cost in their consumption plans.

When we compare the enterprise Pandora’s box storage to the cloud persistence banana, a more general pattern emerges.  The cloud persistence pattern represents a fragmentation of monolithic, IT controlled services into a more functional driven architecture.  In this case, we see desire for speed, distribution and cost forcing change to application design patterns.

We also see similar dispersion patterns driving changes in compute and networking conventions.

So next time your corporate IT refuses to deploy Rabbit MQ or MemCacheD, just remember my mother’s sage advice for cloud architects: “time flies like an arrow, fruit flies like an banana.”

OpenStack videos peek into cloud shakers

Barton George (Dell’s cloud evangalist and cloud shouter) has posted videos from the OpenStack conference last week:

OpenStack Bexar Design Summit Day 1

Yesterday, Dell sent me to be part of our OpenStack vanguard for the design summit.  The conference is fascinating and productive for the content of the sessions and even more interesting for the hallway meetings.

It’s obvious looking at the board composition that RackSpace and NASA Nova are driving  most of the development; however, the is palpable community interest and enthusiasm.  Participants and contributors showed up in force at this event.

RackSpace and NASA leadership provides critical momentum for the community.  Code is the smallest part of their contribution, their commitment to run the code at scale in production is the magic rocket fuel powering OpenStack. I’ve had many conversations with partners and prospects planning to follow RackSpace into production with a 3-6 month lag.

Beyond that primary conference arc, my impressions:

  • Core vendors like Citrix, Dell, Canonical are signing up to do primary work for the code base.  They are taking ownership for their own components in the stack.
  • Universally, people comment about the speed of progress and amount of code being generated.  Did I mention that there is a lot of code being written.
  • Networking is still a major challenge.  OpenStack (with Citrix’s Xen support) is driving Open vSwitchas a replacement for iptables management.
  • IPv6 gets lackadaisical treatment in the US, but is urgent in Japan/Asia where their core infrastructure is ALREADY IPv6.  Their frustration to get attention here should be a canary in the cloud mine (but is not).  They proposed a gateway model where VMs have dual addresses: IPv4 gets NATed while IPv6 is a pass-through. Seems to me that the going IPv6 internal is the real solution.
  • Cloud bursting is still too fuzzy a thing to talk about in a big group.  The session about it covered so many use-cases that we did not accomplish anything.  Some people wanted to talk about cloud API proxy while others (myself included) wanted to talk about managing apps between clouds.  My $0.02 is that vendors like RightScale solve the API proxy issue so it’s the networking issues that need focus.  We need to get back to the use-cases!

Executive Tweet: #openstack: Partners & Code = great progress.  Networking = needs more love

Other notes:

CAP Chasm: why clouds say “no SANank you” to SANs

My personal bias against SANs in cloud architectures is well documented; however, I am in the minority at my employer (Dell) and few enterprise IT shops share my view.  In his recent post about CAP theorem, Dave McCrory has persuaded me to look beyond their failure to bask in my flawless reasoning.  Apparently, this crazy CAP thing explains why some people loves SANs (enterprise) and others don’t (clouds).

The deal with CAP is that you can only have two of Consistency, Availability, or Partitioning Tolerance.  Since everyone wants Availablity, the choice is really between Consitency or Partitioning.  Seeking Availability you’ve got two approaches:

  1. Legacy applications tried to eliminate faults to achieve Consistency with physically redundant scale up designs. 
  2. Cloud applications assume faults to achieve Partitioning Tolerance with logically redundant scale out design.

According to CAP, Legacy and cloud approaches are so fundamentally different that they create a “CAP Chasm” in which the very infrastructure fabric needed to deploy these applications is different.

As a cloud geek, I consider the inherent cost and scale limitations of a CA approach much too limited.   My first hand experience is that our customers and partners share my view: they have embraced AP patterns.  These patterns make more efficient use of resources, dictate simpler infrastructure layout, scale like hormone-crazed rabbits at a carrot farm, and can be deployed on less expensive commodity hardware.

As a CAP theorem enlightened IT professional, I can finally accept that there are other intellectually valid infrastructure models. 

See Mom?  I can play nicely with others after all.

VM != Cloud! Comparision draws ire, misses point

Having the requirement benefit of working with both Dave McCrory and Joyent on a daily basis at Dell, I cannot resist weighing in on the blog pong between them.

Dave’s post comparing VM pricing prompted Joyent to blog that VMs are not the only measure of cloud.

While I completely agree that clouds are not all about VMs, I think that Joyent is too limited in their definition of cloud in their reply.  We’re seeing an emergence of services as the differentiator between clouds.

Looking at Amazon, Azure, and Google, the clear way to reduce cloud spend is to migrate applications to consume their services (SQL, Storage, Bus, etc).

If cloud users are primarily concerned about price per hour (which I’m not convinced is the case) then they have real motivation to migrate from purely VM (or SmartMachine(tm) ) based applications to ones that use services.

Shaken or stirred? Cloud Cocktail leads to insights

Part of my perfessional & personal mission is to kick over mental ant hills.  In the cloud space, I believe that people are trying way too hard to define cloud into neat little buckets.  That leads me to try and reorient around new visualizations.  The purpose of doing this is to strip away historical thought patterns that limit our ability to envision future patterns (meaning: attitude adjustment).

The Cloud Cocktail

With that overly erudite preamble, here’s a tasty potion that I mixed up for you to enjoy on your way to real libations at ACL.

The technologies underlying cloud are complex; however, the core components for cloud are simple: applications, networked services and virtualized infrastructure.  These three components in varying proportions garnished with management APIs form the basis for all cloud solutions. 

This cocktail napkin sketch of a cloud may appear sparse, but it provides the key insights that drive a vision for how to adapt and respond to clouds’ rapid metamorphosis.  It would be ideal to point to a single set of technologies and declare that it is a Cloud; unfortunately, cloud is a transformation, not an end-state.