The 451 Group Cloudscape report strikes chord misses harmony (DevOps, Hybrid Cloud, Orchestration) November 2, 2011
Posted by Rob H in Architecture, Clouds, Dave McCrory, DevOps.Tags: 451, applications, Architecture, cloud, DevOps, hybrid, networking, orchestration, ProTier, Storage
3 comments
It’s impossible to resist posting about this month’s 451 Group Cloudscape report when it calls me out by name as a leading cloud innovator:
… ProTier founders Dave McCrory and Rob Hirschfeld. ProTier [note: now part of Quest] was, indeed, the first VMware ecosystem vendor to be tracked by The 451 Group. In the face of a skeptical world, these entrepreneurs argued that virtualization needed automation in order to realize its full potential, and that the test lab was the low-hanging fruit. Subsequent events have more than vindicated their view (pg. 33).
It’s even better when the report is worth reading and offers insights into forces shaping the industry. It’s nice to be “more than vindicated” on an amazing journey we started over 10 years ago!
Rather than recite 451′s points (hybrid cloud = automation + orchestration + devops + pixie dust), I’d rather look at the problem different way as a counterpoint.
The problem is “how do we deal with applications that are scattered over multiple data centers?”
I do not think orchestration is the complete answer. Current orchestration is too focused on moving around virtual machines (aka workloads).
Ultimately, the solution lies in application architecture; however, I feel that is also a misdirection because cloud is redefining what an “application architecture” means.
Applications are a dynamic mix of compute, storage, and connectivity.
We’re entering an age when all of these ingredients will be delivered as elastic services that will be managed by the applications themselves. The concept of self management is an extension of DevOps principles that fuse application function and deployment. There are missing pieces, but I’m seeing the innovation moving to fill those gaps.
If you want to see the future of cloud applications then look at the network and storage services that are emerging. They will tell you more about the future than orchestration.
Virtualizing #OpenStack Nova: looking at the many ways to skin the CAcTus (#KVM v #XenServer v #ESX) May 27, 2011
Posted by Rob H in Uncategorized.Tags: cloud, esx, kvm, Nova, OpenStack, vcenter, virtualization, VMware, xen
3 comments
<service bulletin> Server virtualization is not cloud: it is a commonly used technology that creates convenient resource partitions for cloud operations and infrastructure as a service providers. </service bulletin>
OpenStack claims support for nearly every virtualization platform on the market. While the basics of “what is virtualization” are common across all platforms, there are important variances in how these platforms are deployed. It is important to understand these variances to make informed choices about virtualization platforms.
Your virtualization model choice will have deep implications on your server/networking choice, deployment methodology and operations infrastructure.
My focus is on architecture not specific hypervisors so I’m generalizing to just three to make the each architecture description more concrete:
- KVM (open source) is highly used by developers and single host systems
- XenServer (open/freemium) leads public cloud infrastructure (Amazon EC2, Rackspace Cloud, and GoGrid)
- ESX/vCenter (licensed) leads enterprise virtualized infrastructure
Of course, there are many more hypervisors and many different ways to deploy the three I’m referencing.
This picture shows all three options as a single system. In practice, only operators wishing to avoid exposure to RESTful recreational activities would implement multiple virtualization architectures in a single system. Let’s explore the three options:
OS + Hypervisor (KVM) architecture deploys the hypervisor a free standing application on top of an operating system (OS). In this model, the service provider manages the OS and the hypervisor independently. This means that the OS needs to be maintained, but is also allows the OS to be enhanced to better manage the cloud or add other functions (share storage). Because they are least restricted, free standing hypervisors lead the virtualization innovation wave.
Bare Metal Hypervisor (XenServer) architecture integrates the hypervisor and the OS as a single unit. In this model, the service provider manages the hypervisor as a single unit. This makes it easier to support and maintain the hypervisor because the platform can be tightly controlled; however, it limits the operator’s ability to extend or multi-purpose the server. In this model, operators may add agents directly to the individual hypervisor but would not make changes to the underlying OS or resource allocation.
Clustered Hypervisor (ESX + vCenter) architecture integrates multiple servers into a single hypervisor pool. In this model, the service provider does not manage the individual hypervisor; instead, they operate the environment through the cluster supervisor. This makes it easier to perform resource balancing and fault tolerance within the domain of the cluster; however, the operator must rely on the supervisor because directly managing the system creates a multi-master problem. Lack of direct management improves supportability at the cost of flexibility. Scale is also a challenge for clustered hypervisors because their span of control is limited to practical resource boundaries: this means that large clouds add complexity as they deal with multiple clusters.
Clearly, choosing a virtualization architecture is difficult with significant trade-offs that must be considered. It would be easy to get lost in the technical weeds except that the ultimate choice seems to be more stylistic.
Ultimately, the choice of virtualization approach comes down to your capability to manage and support cloud operations. The Hypervisor+OS approach maximum flexibility and minimum cost but requires an investment to build a level competence. Generally, this choice pervades an overall approach to embrace open cloud operations. Selecting more controlled models for virtualization reduces risk for operations and allows operators to leverage (at a price, of course) their vendor’s core competencies and mature software delivery timelines.
While all of these choices are seeing strong adoption in the general market, I have been looking at the OpenStack community in particular. In that community, the primary architectural choice is an agent per host instead of clusters. KVM is favored for development and is the hypervisor of NASA’s Nova implementation. XenServer has strong support from both Citrix and Rackspace.
Choice is good: know thyself.
BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale April 18, 2011
Posted by Rob H in Agile, Amazon, CloudOps, Clouds, Lean, OpenStack.Tags: amazon, BlackOps, cloud, CloudOps, hyperscale, OpenStack, rightscale, Taxonomy
7 comments
In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace. Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch. Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.
The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations. Each side has different objectives that together create an elastic, scalable cloud.
The objective of the API side is to provide the smallest usable “surface area” for cloud consumers. Surface area describes the scope of the interface that is published to the users. The smaller the area, the easier it is for users to comprehend and harder it is for them to break. Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.
To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem. The cloud ecosystem exists primarily beyond the API of the cloud. It provides users with flexible options that address their specific use cases. Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.
The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool. To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator.
Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps).
The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance. Even worse, taking the time to solve them in advance likely means that you will miss the market.
Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale. To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.
- Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
- Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
- Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
- Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
- Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
- Passion for operational simplicity – at hyperscale, technical debt compounds very quickly. Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
- Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work. You must catch problems early before they infect your scale infrastructure. The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.
It’s no surprise that these are exactly the Agile & Lean principles. The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.
Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.
I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies. BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing. Best of all, they are fast to market, able to deliver high quality and capable of responding to change.
Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project. I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.
Substituting Action for Knowledge – adopting “ready, fire, aim” as a strategy (and when to run like hell) March 28, 2011
Posted by Rob H in Agile, DevOps, Lean, Open Trends.Tags: Agile, antidepressants, cloud, DevOps, Lean, operational model, operations, roadmap
1 comment so far
Today my mother-in-law (a practicing psychiatrist) was bemoaning the current medical practice of substituting action for knowledge. In her world, many doctors will make rapid changes to their patients’ therapy. Their goal is to address the issues immediately presented (patient feels sad so Dr prescribes antidepressants) rather than taking time to understand the patients’ history or make changes incrementally and measure impacts. It feels like another example of our cultural compulsion to fix problems as quickly as possible.
Her comments made me question the core way that I evangelize!
Do Lean and Agile substitute action for knowledge? No. We use action to acquire knowledge.
The fundamental assumption that drives poor decision-making is that we have enough information to make a design, solve a problem or define a market. Lean and Agile’s more core tenet is that we must attack this assumption. We must assume that we can’t gather enough information to fully define our objective. The good news, is that even without much analysis we know a lot! We know:
- roughly what we want to do (road map)
- the first steps we should take (tactics)
- who will be working on the problem (team members)
- generally how much effort it will take (time & team size)
- who has the problem that we are trying to solve (market)
We also know that we’ll learn a lot more as we get closer to our target. Every delay in starting effectively pushed our “day of clarity” further into the future. For that reason, it is essential that we build a process that constantly reviews and adjusts its targets.
We need to build a process that acquires knowledge as progress is made and makes rapid progress.
In Agile, we translate this need into the decorations of our process: reviews for learning, retrospectives for adjustments, planning for taking action and short iterations to drive the feedback loop. Agile’s mantra is “ready, fire, aim, fire, aim, fire, aim, …” which is very different from simply jumping out of a plane without a parachute and hoping you’ll find a haystack to land in.
For cloud deployments, this means building operational knowledge in stages. Technology is simply evolving too quickly and best practices too slowly for anyone to wait for a packaged solution to solve all their cloud infrastructure problems. We tried this and it does not work: clouds are a mixture hardware, software and operations. More accurately, clouds are an operational model supported by hardware and software.
Currently, 80% of cloud deployment effort is operations (or “DevOps“).
When I listen to people’s plans about building product or deploying cloud, I get very skeptical when they take a lot of time to aim at objects far off on the horizon. Perhaps they are worried that they will substitute action for knowledge; however, I think they would be better served to test their knowledge with a little action.
My MIL agrees – she sees her patients frequently and makes small adjustments to their treatment as needed. Wow, that’s an Rx for Agile!
Notes from 2011 Cloud Connect Event Day 2 (#ccevent) March 10, 2011
Posted by Rob H in CAP Theorem, Cloud Connect, Clouds, Economics.Tags: cloud, CloudConnect, NoSQL, private cloud, public cloud, Reddit
add a comment
With the OpenStack launch behind me, I have some time to attend the Cloud Connect Event. I missed all the DevOps sessions, but was getting to geek out on the NoSQL & Big Data sessions. I jumped to the private cloud track (based on Twitter traffic) and was rewarded for the shift.
I’m surprised at how much focus this cloud conference is dedicated to private cloud. At other cloud conferences I’ve attended, the focus has been on learning how to use the cloud (specifically the public cloud). This is the first cloud show I’ve attended that has so much emphasis, dialog and vendor feeding around private. This was a suits & slacks show with few jeans, t-shirts, and pony tails. Perhaps private cloud is where the $$$ is being spent now?
It definitely feels like using cloud has become assumed, but the best practices and tools are just emerging.
The twitter #ccevent stream is interesting but temporal. I’m posting my raw (spelling optional) notes (below the more tag) because there is a lot of great content from the show to support and extend the twitter stream. I’ll try to italicize some of the better lines.
(more…)
“Flatness at the Edges” guides hyperscale cloud design January 29, 2011
Posted by Rob H in Architecture, Clouds, Events.Tags: cloud, edge computing, flat network, hyperscale, meetup, OpenStack, VLAN
8 comments
As I’m working on a larger “cloud bootstrapping” white paper (look for a pending Dell release), I stumbled on an apparent unifying principle for hyperscale cloud design. I’m interested in feedback about this concept to see if it fairly encapsulates a common target for cloud hardware, networking and software design.
“Flatness at the Edges” is one of the guiding principles of hyperscale cloud designs.
Flatness means that cloud infrastructure avoids creating tiers where possible. For example, having a blade in a frame aggregating networking that is connected to a SAN via a VLAN is a tiered design in which the components are vertically coupled. A single node with local disk connected directly to the switch has all the same components but in a single “flat” layer.
Edges are the bottom tier (or “leaves” to us CS geeks) of the cloud. Being flat creates a lot of edges because most of the components are self contained. To scale and reduce complexity, clouds must rely on the edges to make independent decisions such as how to route network traffic, where to replicate data, or when to throttle VMs. The anti-example of edge design is using VLANs to segment tenants because VLANs (a limited resource) require configuration at the switching tier to manage traffic generated by an edge component. We are effectively distributing an intelligence overhead tax on each component of the cloud rather than relying on a “centralized overcloud” to rule them all.
Combining flatness and edges evolves the sympathetic concepts into full-fledged cloud design principle.
Interested in discussing this face to face? I’ll presenting this and other cloud setup concepts that the SJC OpenStack meetup on 2/3.
Microsoft Azure Cloud – Top 20 Lessons Learned about MS’s PaaS December 13, 2010
Posted by Rob H in .NET, Architecture, Azure, Clouds.Tags: .NET, azure, cloud, Load Balanacer, PaaS, Partitioning, Walking Deployments
5 comments
Last week Dave McCrory (@McCrory) and I (@Zehicle) had the benefit of intensive Azure training at Microsoft HQ to support Dell’s Azure Stamp.
We’ve assembled a top 20 list of things to know about programming for Azure (and really any PaaS leaning cloud):
- If you want performance, optimize to reduce fees. Azure (and any cloud) is architected to penalize you if you use their resources poorly. The challenge is to fix this before your boss get the tab for your unenlightened design decisions.
- Coding .NET on Azure easy, architecting for Azure requires learning. Clouds put things in different places than you are used to and the rules are different. Expect a learning curve.
- Partitioning = parallelism. Learn to love partitions in all their forms, because your app will be throttled if you throw everything into a single partition! On the upside, each partition operates in parallel and even better, they usually don’t cost extra (SQL is the exception).
- Roles are flexible. You can run web servers (Apache, etc) on a worker and worker tasks on a web role. This is a good way to save some change since you pay per role instance. It’s counter to separation of concerns, but financially you should also combine workers into a single role where possible.
- Understand walking deployments. You can (and should) have simultaneous versions of the code operating against the same data so that you can roll upgrades (ala Timothy Fitz/Eric Ries) to reduce risk and without reducing performance. You should expect your data schema to simultaneously span mutiple code versions.
- Learn about Update Domains (UDs). Deployment domains allow rolling upgrades and changes to Applications and Services. They are part of how you partition your overall application. If you’re planning a VIP swap deployment, then you won’t care.
- Each service = ONE external IP. You can have many VMs backing each service (and multiple roles in a service) and Azure will load balance between them so you can scale out each service. Think of each service as a clonable entity: there will be at least 1 and more can be added if you want to scale.
- Understand between VIP and DIP. VIPs stand for Virtual IPs and are external, public, and metered. DIPs are internal, private, and load balanced. Azure provides an API to discover your DIPs – do not assume you know them because they are DYNAMIC IPs. Azure won’t let you see other DIPs inside the system.
- Azure has rich diagnostics, but beware. Azure leverages the existing diagnostics built into their system, but has to get the data off box since instances are volitile. This means that problems can be hard to isolate while excessive logging can impact performance and generate fees. Microsoft lets you target individual systems for elevated levels. You can also Terminal Server to a VM for troubleshooting (with caution).
- The new Azure admin console rocks. Take your pick between Silverlight or MMC Snap-in.
- Everything goes into Azure Storage. Learn to love it. Queues -> storage. Tables -> storage. Blobs -> storage. Logging -> storage. Code Repo -> storage. vDisk -> storage. SQL -> SQL (they march to their own drummer).
- Queues are essential, but tricky. Learn the meaning of idempotent because using queues requires you to handle failures and timeouts. The scary part is that it will work nicely until you exceed some limits and then you’ll experience cascading failure. Whee! Oh yea, and queues require polling (which stinks as a notification model).
- SQL Azure is
justmostly like MS SQL. Microsoft did a smart thing in keeping Cloud SQL so it was highly compatible with Local SQL. The biggest note is that limited in size of partition. If you embrace the size limits you will get better performance. So stop pushing BLOBs into databases and start sharding. - Duplicating data in tables will improve performance. This has to do with how partitions and keys operate but is an interesting architecture for NoSQL – stage data for use. Don’t be afraid to stage the same data in multiple ways. It may be faster/cheaper to write data twice if it becomes easier to find when you search it 1000s of times.
- Table data can be “warmed up.” Storage has logic that makes frequently accessed items faster (sort of like a cache
. If you can anticipate load spikes then you should warm the data just before the spike. - Storage billing is both amount and transactions. You can get burned on a small, but busy set of data. Note: you will pay even if you 404 a request.
- Azure has a CDN. Leveraging Microsoft’s Content Delivery Network (CDN) will improve performance for your users with small, low latency, high request items. You need to change your URLs for those assets. Best practice is to use some versioning in the URI so that you can force changes. Remember, CDN is SLOWER for the first hit when the data is not in cache so avoid CDN for low volume assets!
- Provisioning time is not instant. Azure needs anywhere from 1-3 minutes to spin a new instance of a role. Build this lag into your architecture and dynamic scale plans. New databases and partitions are fast.
- The VM Role is maintained by YOU. Using the VM role is a handy shortcut, but has a long list of gotcha’s. Some of note: 1) the VM can be “reset” to the last VM image state that you uploaded, 2) you are responsble for VM OS upgrades and patches, 3) VMs must be clonable because they will operate in parallel.
- Azure supports more than .NET. You can setup anything in a worker (and now VM) role, but there are nuances to doing this effectively. You really need to understand how Azure works and had better be ready to crack open Visual Studio for some things even if you’re writing in Java.
We hope this list helps you navigate Azure deployments. No matter what cloud you use, understanding Azure’s architecture will help you write better cloud scale applications.
We’d love to hear your suggestions and recommendations!
Mirrored on both blogs: Rob Hirschfeld’s Blog & Dave McCrory’s Blog






