32nd rule to measure complexity + 6 hyperscale network design rules

If you’ve studied computer science then you know there are algorithms that calculate “complexity.” Unfortunately, these have little practical use for data center operators.  My complexity rule does not require a PhD:

The 32nd rule: If it takes more than 30 seconds to pick out what would be impacted by a device failure then your design is too complex.

6 Hyperscale Network Design Rules

  1. Cost Matters
  2. Keep Networks Flat
  3. Filter at the Edge
  4. Design Fault Zones
  5. Plan for Local Traffic
  6. Offer load balancers (to your users)

Sorry for the teaser… I’ll be able to release more substance behind this list soon.   Until then comments are (as always) welcome!

 

 

 

“Flatness at the Edges” guides hyperscale cloud design

As I’m working on a larger “cloud bootstrapping” white paper (look for a pending Dell release), I stumbled on an apparent unifying principle for hyperscale cloud design.  I’m interested in feedback about this concept to see if it fairly encapsulates a common target for cloud hardware, networking and software design.

“Flatness at the Edges” is one of the guiding principles of hyperscale cloud designs.  

Flatness means that cloud infrastructure avoids creating tiers where possible.  For example, having a blade in a frame aggregating networking that is connected to a SAN via a VLAN is a tiered design in which the components are vertically coupled.  A single node with local disk connected directly to the switch has all the same components but in a single “flat” layer.  

Edges are the bottom tier (or “leaves” to us CS geeks) of the cloud.  Being flat creates a lot of edges because most of the components are self contained.  To scale and reduce complexity, clouds must rely on the edges to make independent decisions such as how to route network traffic, where to replicate data, or when to throttle VMs.  The anti-example of edge design is using VLANs to segment tenants because VLANs (a limited resource) require configuration at the switching tier to manage traffic generated by an edge component.  We are effectively distributing an intelligence overhead tax on each component of the cloud rather than relying on a “centralized overcloud” to rule them all. 

Combining flatness and edges evolves the sympathetic concepts into full-fledged cloud design principle.

Interested in discussing this face to face?  I’ll presenting this and other cloud setup concepts that the SJC OpenStack meetup on 2/3.

Adding #11 to RightScale CEO’s Top 10 Cloud Myths

Generally, I think of a “Top 10 Cloud Myths” post as pure self-serving marketing fluffery, so I was pleasantly surprised to see Michael Crandal (RightScale’s CEO) producing a list with some substance.   Don’t get me wrong, the list is still a RightScale value prop 101.  It’s just that they have the good fortune to be addressing real problems and creating real value.

So, here’s my Myth #11 “We have to re-write our applications to run in the cloud.”  While that’s largely a myth; it may be a good myth to keep around because many applications SHOULD be rewritten – not for the cloud, but for changing usage patterns (more mobile users, more remote users, more SOA clients, etc)

OpenStack Swift Retriever Demo Online (with JavaScript xmlhttprequest image retrieval)

This is a follow-up to my earlier post with the addition of WORKING CODE and an ONLINE DEMO. Before you go all demo happy, you need to have your own credentials to either a local OpenStack Swift (object storage) system or RackSpace CloudFiles.

The demo is written entirely using client side JavaScript. That is really important because it allows you to test Swift WITHOUT A WEB SERVER. All the other Swift/Rackspace libraries (there are several) are intended for your server application to connect and then pass the file back to the client. In addition, the API uses meta tags that are not settable from the browser so you can’t just browse into your Swift repos.

Here’s what the demo does:

  1. Login to your CloudFiles site – returns the URL & Token for further requests.
  2. Get a list of your containers
  3. See the files in each container (click on the container)
  4. Retrieve the file (click on the file) to see a preview if it is an image file

The purpose of this demo is to be functional, not esthetic. Little hacks like pumping the config JSON data to the bottom of the page are helpful for debugging and make the action more obvious. Comments and suggestions are welcome.

The demo code is 4 files:

  1. demo.html has all the component UI and javascript to update the UI
  2. demo.js has the Swift interfacing code (I’ll show a snippet below) to interact with Swift in a generic way
  3. demo.css is my lame attempt to make the page readable
  4. jQuery.js is some first class code that I’m using to make my code shorter and more functional.

1-17 update: in testing, we are working out differences with Swift and RackSpace. Please expect updates.

HACK NOTE: This code does something unusual and interesting. It uses the JavaScript XmlHttpRequest object to retrieve and render a BINARY IMAGE file. Doing this required pulling together information from several sources. I have not seen anyone pull together a document for the whole process onto a single page! The key to making this work is overrideMimeType (line G), truncating the 32 bit string to 16 bit ints ( & 0xFF in encode routine), using Base64 encoding (line 8 and encode routine), and then “src=’data:image/jpg;base64,[DATA GOES HERE]'” in the tag (see demo.html file).

Here’s a snippet of the core JavaScript code (full code) to interact with Swift. Fundamentally, the API is very simple: inject the token into the meta data (line E-F), request the /container/file that you want (line D), wait for the results (line H & 2). I made it a little more complex because the same function does EITHER binary or JSON returns. Enjoy!

retrieve : function(config, path, status, binary, results) {

1   xmlhttp = new XMLHttpRequest();

2   xmlhttp.onreadystatechange=function()  //callback

3      {

4         if (xmlhttp.readyState==4 && xmlhttp.status==200) {

5            var out = xmlhttp.responseText;

6            var type = xmlhttp.getResponseHeader("content-type");

7            if (binary)

8               results(Swift.encode(out), type);

9            else

A               results(JSON.parse(out));

B         }

C      }

D   xmlhttp.open('GET',config.site+'/'+path+'?format=json', true)

E   xmlhttp.setRequestHeader('Host', config.host);

F   xmlhttp.setRequestHeader('X-Auth-Token', config.token);

G   if (binary) xmlhttp.overrideMimeType('text/plain; charset=x-user-defined');

H   xmlhttp.send();
}

OpenStack Swift Demo (in a browser)

I’m working on mini-demo project for OpenStack Swift.  To keep things very simple and easy to understand, I decided that the whole demo would work in JavaScript in the browser.  I also choose to use RackSpace’s CloudFiles as a Swift target for testing since they have the same API are are universally available (unlike my lab systems).

One advantage of this approach is that FireBug makes it very nice to debug and check the activity of the code.  Unfortunately, FireBug also seems to eat the headers that I need.  *Let me phrase that in a google friend way so that someone else will not loose the 2 hours I just lost*

“XmlHttpRequest setRequestHeader FireFox Not Respected when using FireBug”

It works great in Safari. So onward and upward.  So far, I’ve got step #1 ready – getting the authorization token back from the cloud site.

Here’s the HTML page (you need jQuery too).  Basically, it uses the username and key from the inputs to set “x-auth-user” and “x-auth-key” header attributes.  These attributes will allow Swift to return a token that you can use on future requests when you want to do useful work.

<!DOCTYPE html>

<html>

<head>

<title>Dell Swift Demo [0.0]</title>

<script src=”jquery.js” type=”text/javascript”></script>

<script type=”text/javascript” charset=”utf-8″>

var xmlhttp = null;

function swiftLogin() {

var usr = $(‘input:text[name=usr]’).val();

var key = $(‘input:text[name=key]’).val();

// code for IE7+, Firefox, Chrome, Opera, Safari (UR SOL IE<7)

xmlhttp = new XMLHttpRequest();

xmlhttp.onreadystatechange=function() //callback

{

if (xmlhttp.readyState==2)

{

$(‘#status’).replaceWith(xmlhttp.getResponseHeader(“X-Auth-Token”));

}

}

xmlhttp.open(‘GET’,’https://auth.api.rackspacecloud.com/v1.0&#8242;, true);

xmlhttp.setRequestHeader(‘Host’, ‘auth.api.rackspacecloud.com’);

xmlhttp.setRequestHeader(‘X-Auth-User’, usr);

xmlhttp.setRequestHeader(‘X-Auth-Key’, key);

xmlhttp.send();

}

</script>

</head>

<body>

<div id=”credentials”>

<fieldset id=”credentials” class=””>

<legend>Swift Login</legend>

<label for=”user”>User: </label><input type=”text” name=”usr” value=”user” id=”user”>

<label for=”key”>Key: </label><input type=”text” name=”key” value=”key” id=”key”>

<input type=”button” name=”Login” value=”login” id=”Login” onclick=”swiftLogin();”>

</fieldset>

</div>

<div id=”status”>[pending]</div>

<div id=”footer”>Time?</div>

<script type=”text/javascript”>

$(‘#footer’).replaceWith((new Date).toString());

swiftLogin();

</script>

</body>

</html>

Cloud Gravity – launching apps into the clouds

Dave McCrory‘s Cloud Gravity series (Data Gravity & Escape Velocity) brings up some really interesting concepts and has lead to some spirited airplane discussions while Dell shuttled us to an end of year strategy meeting.  Note: whoever was on American 34 seats 22A/C – we apologize if we were too geek-rowdy for you.

Dave’s Cloud Gravity is the latest unfolding of how clouds are evolving as application architectures before more platform capable.  I’ve explored these concepts in previous posts (Storage Banana, PaaS vs IaaS, CAP Chasm) to show how cloud applications are using services differently than traditional applications.

Dave’s Escape Velocity post got me thinking about how cleanly Data Gravity fits with cloud architecture change and CAP theorem.

My first sketch shows how traditional applications are tightly coupled with the data they manipulate.  For example, most apps work directly on files or a database direct connection.  These apps rely on very consistent and available data access.  They are effectively in direct contact with their data much like a building resting on it’s foundation.  That works great until your building is too small (or too large).  In that case, you’re looking a substantial time delay before you can expand your capcity.

Cloud applications have broken into orbit around their data.  They still have close proximity to the data but they do their work via more generic network connections.  These connections add some latency, but allow much more flexible and dynamic applications.  Working within the orbit analogy, it’s much much easier realign assets in orbit (cloud servers) to help do work than to move buildings around on the surface.

In the cloud application orbital analogy, components of applications may be located in close proximity if they need fast access to the data.  Other components may be located farther away depending on resource availability, price or security.  The larger (or more valuable) the data, the more likely it will pull applications into tight orbits.

My second sketch extends to analogy to show that our cloud universe is not simply point apps and data sources.  There truly a universe of data on the internet with hugh sources (Facebook, Twitter, New York Stock Exchange, my blog, etc) creating gravitational pull that brings other data into orbit around them.  Once again, applications can work effectively on data at stellar distances but benefit from proximity (“location does not matter, but proximity does”).

Looking at data gravity in this light leads me to expect a data race where clouds (PaaS and SaaS) seek to capture as much data as possible.

Microsoft Azure Cloud – Top 20 Lessons Learned about MS’s PaaS

Last week Dave McCrory (@McCrory) and I (@Zehicle) had the benefit of intensive Azure training at Microsoft HQ to support Dell’s Azure Stamp.

We’ve assembled a top 20 list of things to know about programming for Azure (and really any PaaS leaning cloud):

  1. If you want performance, optimize to reduce fees. Azure (and any cloud) is architected to penalize you if you use their resources poorly. The challenge is to fix this before your boss get the tab for your unenlightened design decisions.
  2. Coding .NET on Azure easy, architecting for Azure requires learning. Clouds put things in different places than you are used to and the rules are different. Expect a learning curve.
  3. Partitioning = parallelism. Learn to love partitions in all their forms, because your app will be throttled if you throw everything into a single partition! On the upside, each partition operates in parallel and even better, they usually don’t cost extra (SQL is the exception).
  4. Roles are flexible. You can run web servers (Apache, etc) on a worker and worker tasks on a web role. This is a good way to save some change since you pay per role instance. It’s counter to separation of concerns, but financially you should also combine workers into a single role where possible.
  5. Understand walking deployments. You can (and should) have simultaneous versions of the code operating against the same data so that you can roll upgrades (ala Timothy Fitz/Eric Ries) to reduce risk and without reducing performance. You should expect your data schema to simultaneously span mutiple code versions.
  6. Learn about Update Domains (UDs). Deployment domains allow rolling upgrades and changes to Applications and Services. They are part of how you partition your overall application. If you’re planning a VIP swap deployment, then you won’t care.
  7. Each service = ONE external IP. You can have many VMs backing each service (and multiple roles in a service) and Azure will load balance between them so you can scale out each service. Think of each service as a clonable entity: there will be at least 1 and more can be added if you want to scale.
  8. Understand between VIP and DIP. VIPs stand for Virtual IPs and are external, public, and metered. DIPs are internal, private, and load balanced. Azure provides an API to discover your DIPs – do not assume you know them because they are DYNAMIC IPs. Azure won’t let you see other DIPs inside the system.
  9. Azure has rich diagnostics, but beware. Azure leverages the existing diagnostics built into their system, but has to get the data off box since instances are volitile. This means that problems can be hard to isolate while excessive logging can impact performance and generate fees. Microsoft lets you target individual systems for elevated levels. You can also Terminal Server to a VM for troubleshooting (with caution).
  10. The new Azure admin console rocks. Take your pick between Silverlight or MMC Snap-in.
  11. Everything goes into Azure Storage. Learn to love it. Queues -> storage. Tables -> storage. Blobs -> storage. Logging -> storage. Code Repo -> storage. vDisk -> storage. SQL -> SQL (they march to their own drummer).
  12. Queues are essential, but tricky. Learn the meaning of idempotent because using queues requires you to handle failures and timeouts. The scary part is that it will work nicely until you exceed some limits and then you’ll experience cascading failure. Whee! Oh yea, and queues require polling (which stinks as a notification model).
  13. SQL Azure is just mostly like MS SQL. Microsoft did a smart thing in keeping Cloud SQL so it was highly compatible with Local SQL. The biggest note is that limited in size of partition. If you embrace the size limits you will get better performance. So stop pushing BLOBs into databases and start sharding.
  14. Duplicating data in tables will improve performance. This has to do with how partitions and keys operate but is an interesting architecture for NoSQL – stage data for use. Don’t be afraid to stage the same data in multiple ways. It may be faster/cheaper to write data twice if it becomes easier to find when you search it 1000s of times.
  15. Table data can be “warmed up.” Storage has logic that makes frequently accessed items faster (sort of like a cache ;). If you can anticipate load spikes then you should warm the data just before the spike.
  16. Storage billing is both amount and transactions. You can get burned on a small, but busy set of data. Note: you will pay even if you 404 a request.
  17. Azure has a CDN. Leveraging Microsoft’s Content Delivery Network (CDN) will improve performance for your users with small, low latency, high request items. You need to change your URLs for those assets. Best practice is to use some versioning in the URI so that you can force changes. Remember, CDN is SLOWER for the first hit when the data is not in cache so avoid CDN for low volume assets!
  18. Provisioning time is not instant. Azure needs anywhere from 1-3 minutes to spin a new instance of a role. Build this lag into your architecture and dynamic scale plans. New databases and partitions are fast.
  19. The VM Role is maintained by YOU. Using the VM role is a handy shortcut, but has a long list of gotcha’s. Some of note: 1) the VM can be “reset” to the last VM image state that you uploaded, 2) you are responsble for VM OS upgrades and patches, 3) VMs must be clonable because they will operate in parallel.
  20. Azure supports more than .NET. You can setup anything in a worker (and now VM) role, but there are nuances to doing this effectively. You really need to understand how Azure works and had better be ready to crack open Visual Studio for some things even if you’re writing in Java.

We hope this list helps you navigate Azure deployments. No matter what cloud you use, understanding Azure’s architecture will help you write better cloud scale applications.

We’d love to hear your suggestions and recommendations!

Mirrored on both blogs: Rob Hirschfeld’s Blog & Dave McCrory’s Blog

Jevon’s Paradox

I’ve been finding it necessary to quote Jevon’s paradox several times lately and realized that I have NOT referenced it here.  Quite simply, understanding Jevon’s paradox is essential to understanding cloud.

The concept of the paradox is that when we make something more efficient (for example gas in cars), the demand for that resource goes up (we move further into the exurbs because driving is cheaper).  Notably, as Moore’s law drives computer efficiency up, we are using more and more computers.  Specifically, I have more computers in my house every year even though the efficiency of just one smart phone far (far) exceeds power of my son’s Sinclair 1000.

In cloud computing, Jevon’s paradox points us to the expectation that the rush of applications and activity in the cloud will continue to accelerate.  Since I expect competition and Moore’s law to drive increasing gains in cloud efficiency (and therefore customer advantageous price signals) the market will happily convert these utilization improvements into more and more interesting capabilities.

The cloud expansion means that we can sustain more providers entering the market.  In fact, Jevon would tell us that more providers will likely INCREASE demand for cloud as competition and capacity put downward pressure on prices.  [Q: where will they make up the margin?   A: Adjacent Services]

The loser in cloud’s exploration of Jevon’s paradox are non-cloud deployments (Dell strategists are you listening?).  These systems suffer because their ability to improve their efficiency is limited.

As I look down the road on cloud, I can see many opportunities for current applications to take advantage of cheaper cloud resources to provide even more value.  For example, adding map-reduce analytics to scan a customer’s data can provide tremendous insights.  Today, it’s a luxury like flying on the Concorde.  Tomorrow, it will be part like hopping the Nerd Bird from San Jose to Austin – just a normal part of our daily lives.

 

Note: A shout out to Dave McCrory who introduced me to Jevon’s Paradox.

Getting cozy with “Adjacent Services”

I’ve had a busy week with Azure Training and Cloud Camp Seattle.  It’s going to take a few days to unwind specific posts about both, but I wanted to hit some shiny new thoughts.

Services helping each other

  • Adjacent Services are dedicated and/or public services (XaaS) that are offered along side generic public cloud offerings.   For a company like Dell (my employer), this could be specific brands of storage or databases (e.g. Oracle).  I believe these are much higher margin XaaS than IaaS.
  • Layer 7 Load Balancers represent a more intelligent link between load direction and the applications. I heard people using this term in multiple contexts.   For example,  In Azure, the apps can set themselves as “offline” and they will stop getting traffic then they can turn themselves online when they are ready for more.
  • Cloud Rollout/Migration is a rolling upgrade scheme where you can send traffic to 2 versions of your application at the same time!  You upgrade by zones and if you have >2 zones then you’ll have two active versions at the same time.  Your data models need to accommodate this.
  • We don’t have enough Agile Cloud programming books (like Dave Thomas’ RoR Intro).  We need a cloud programming book that STARTS WITH INTEGRATION TESTS and shows how to use all the adjacent services.  I may just have to write one (or three).

Thanks to many many at Microsoft for the great Azure training sessions.  I’ll add more names, but for now I have links to Steve Marx (Smarx.com) & Srirm Krishnan (Sriram Krishnan.com) .

Exploding the Cloud Storage Banana

Storage Banana shows how cloud persistence is functionally diverse and optimized

Internally, my group (specifically Dave McCrory & Greg Althaus) has been kicking around some new ways of expressing clouds in an effort to help reconcile Dell’s traditional and cloud focused businesses.  We’ve found it challenging to translate CAP theorem and

externalized application state into more enterprise-ready concepts.

Our latest effort led to a pleasantly succinct explanation of why cloud storage is different than enterprise storage.  Ultimately, it’s a matter of control and optimization.  Cloud persistence (Cache, Queue, Tables, Objects) is functionally diverse in order to optimize for price and performance while enterprise storage (SAN, NAS, SQL) is control and centralization driven.  Unfortunately for enterprises, the data genie is out of the Pandora’s box with respect to architectures that drive much lower cost and higher performance.

The background on this irresistible transformation begins with seeing storage as a spectrum of services as per the table below.

Enterprise:

Consistent

 

Block (SAN) iSCSI, Infiband:

Amazon EBS, EqualLogic, EMC Symmeterix

File (NAS) NFS, CIFS:

NetApp, PowerVault, EMC Clariion

Database (ACID) MS SQL, Oracle 11g, MySQL, Postgres
Cloud:

Distributed

Partitioned

Object DX/Caringo, OpenStack Swift, EMC Atmos
Map/Reduce Hadoop DFS
Key Value Cassandra, CouchDB, Riak, Reddis, Mongo
Queue (Bus) RabbitMQ, ActiveMQ, ZeroMQ, OpenMQ, Celery
Cloud:

Transitory

 

Messaging AMPQ, MSMQ (.NET)
Shared RAM MemCache, Tokyo Cabinet

From this table, I approximated the relative price and performance for each component in the storage spectrum.

The result was the “cloud storage banana” graph.  In this graph, enterprise storage is clustered in the “compromise” quadrant where there’s a high price for relatively low performance.  The cloud persistence refuses to be clustered at all.  To save cost and enable distributed data, applications will use cheap but slow object storage.  This drives the need for high speed RAM based cache and distributed buses. These approaches are required when developers build fault tolerance at the application level.

Enterprises have enjoyed the false luxury of perceived hardware reliability.  Where these assumptions are removed, applications are freed to scale more gracefully and consider resource cost in their consumption plans.

When we compare the enterprise Pandora’s box storage to the cloud persistence banana, a more general pattern emerges.  The cloud persistence pattern represents a fragmentation of monolithic, IT controlled services into a more functional driven architecture.  In this case, we see desire for speed, distribution and cost forcing change to application design patterns.

We also see similar dispersion patterns driving changes in compute and networking conventions.

So next time your corporate IT refuses to deploy Rabbit MQ or MemCacheD, just remember my mother’s sage advice for cloud architects: “time flies like an arrow, fruit flies like an banana.”