Open Operations [4/4 series on Operating Open Source Infrastructure]

This post is the final in a 4 part series about Success factors for Operating Open Source Infrastructure.

tl;dr Note: This is really TWO tightly related posts: 
  part 1 is OpenOps background. 
  part 2 is about OpenStack, Tempest and DefCore.

2012-01-11_17-42-11_374One of the substantial challenges of large-scale deployments of open source software is that it is very difficult to come up with a best practice, or a reference implementation that can be widely explained or described by the community.

Having a best practice deployment is essential for the growth of the community because it enables multiple people to deploy the software in a repeatable, stable way. This, in turn, fosters community growth so that more people can adopt software in a consistent way. It does little good if operators have no consistent pattern for deployment, because that undermines the developers’ abilities to extend, the testers’ abilities to ensure quality, and users’ ability to repeat the success of others.

Fundamentally, the goal of an open source project, from a user’s perspective, is that they can quickly achieve and repeat the success of other people in the community.

When we look at these large-scale projects we really try to create a pattern of success that can be repeated over and over again. This ensures growth of the user base, and it also helps the developer reduce time spent troubleshooting problems.

That does not mean that every single deployment should be identical, but there is substantial value in having a limited number of success patterns. Customers can then be assured not only of quick time to value with these projects., They can also get help without having everybody else in the community attempt to untangle how one person created a site-specific. This is especially problematic if someone created an unnecessarily unique scenario. That simply creates noise and confusion in the environment., Noise is a huge cost for the community, and needs to be eliminated nor an open source project to flourish.

This isn’t any different from in proprietary software but most of these activities are hidden. A proprietary project vendor can make much stronger recommendations and install guidance because they are the only source of truth in that project. In an open source project, there are multiple sources of truth, and there are very few people who are willing to publish their exact reference implementation or test patterns. Consequently, my team has taken a strong position on creating a repeatable reference implementation for Openstack deployments, based on extensive testing. We have found that our test patterns and practices are grounded in successful customer deployments and actual, physical infrastructure deployments. So, they are very pragmatic, repeatable, and sustained.

We found that this type of testing, while expensive, is also a significant value to our customers, and something that they appreciate and have been willing to pay for.

OpenStack as an Example: Tempest for Reference Validation

The Crowbar project incorporated OpenStack Tempest project as an essential part of every OpenStack deployment. From the earliest introduction of the Tempest suite, we have understood the value of a baselining test suite for OpenStack. We believe that using the same tests the developers use for a single node test is a gate for code acceptance against a multi-node deployment, and creates significant value both for our customers and the OpenStack project as a whole.  This was part of my why I embraced the suggestion of basing DefCore on tests.

While it is important to have developer tests that gate code check-ins, the ultimate goal for OpenStack is to create scale-out multi-node deployments. This is a fundamental design objective for OpenStack.

With developers and operators using the same test suite, we are able to proactively measure the success of the code in the scale deployments in a way that provides quick feedback for the developers. If Tempest tests do not pass a multi-node environment, they are not providing significant value for developers to ensure that their code is operating against best practice scenarios. Our objective is to continue to extend the Tempest suite of tests so that they are an accurate reflection of the use cases that are encountered in a best practice, referenced deployment.

Along these lines, we expect that the community will continue to expand the Tempest test suite to match actual deployment scenarios reflected in scale and multi-node configurations. Having developers be responsible for passing these tests as part of their day-to-day activities ensures that development activities do not disrupt scale operations. Ultimately, making proactive gating tests ensures that we are creating scenarios in which code quality is continually increasing, as is our ability to respond and deploy the OpenStack infrastructure.

I am very excited and optimistic that the expanding the Tempest suite holds the key to making OpenStack the most stable, reliable, performance cloud implementation available in the market. The fact that this test suite can be extended in the community, and contributed to by a broad range of implementations, only makes that test suite more valuable and more likely to fully encompass all use cases necessary for reference implementations.

Networking in Cloud Environments, SDN, NFV, and why it matters [part 2 of 2]

scott_jensen2Scott Jensen is an Engineering Director and colleague of mine from Dell with deep networking and operations experience.  He had first hand experience deploying OpenStack and Hadoop and has a critical role in defining Dell’s Reference Architectures in those areas.  When I saw this writeup about cloud networking (first post), I asked if it would be OK to post it here and share it with you.

GUEST POST 2 OF 2 BY SCOTT JENSEN:

So what is different about Cloud and how does it impact on the network

In a traditional data center this was not all that difficult (relatively).  You knew what was going to running on what system (physically) and could plan your infrastructure accordingly.  The majority of the traffic moved in a North/South direction. Or basically from outside the infrastructure (the internet for example) to inside and then responded back out.  You knew that if you had to design a communication channel from an application server to a database server this could be isolated from the other traffic as they did not usually reside on the same system.

scj-net1

Virtualization made this more difficult.  In this model you are sharing systems resources for different applications.  From the networks point of view there are a large number of systems available behind a couple of links.  Live Migration puts another wrinkle in the design as you now have to deal with a specific system moving from one physical server to another.  Network Virtualization helps out a lot with this.  With this you can now move virtual ports from one physical server to another to ensure that when one virtual machine moves from a physical server to another that the network is still available.  In many cases you managed these virtual networks the same as you managed your physical network.  As a matter of fact they were designed to emulate the physical as much as possible.  The virtual machines still looked a lot like the physical ones they replaced and can be treated in vary much the same way from a traffic flow perspective.  The traffic still is primarily a North/South pattern.

Cloud, however, is a different ball of wax.  Think about the charistics of the Cattle described above.  A cloud application is smaller and purpose built.  The majority of its traffic is between VMs as different tiers which were traditionally on the same system or in the same VM are now spread across multiple VMs.  Therefore its traffic patterns are primarily East/West.  You cannot forget that there is a North/South pattern the same as what was in the other models which is typically user interaction.  It is stateless so that many copies of itself can run in tandem allowing it to elastically scale up and down based on need and as such they are appearing and disappearing from the network.  As these VMs are spawned on the system they may be right next to each other or on different servers or potentially in different Data Centers.  But it gets even better.   scj-net2

Cloud architectures are typically multi-tenant.  This means that multiple customers will utilize this infrastructure and need to be isolated from each other.  And of course Clouds are self-service.  Users/developers can design, build and deploy whenever they want.  Including designing the network interconnects that their applications need to function.  All of this will cause overlapping IP address domains, multiple virtual networks both L2 and L3, requirements for dynamically configuring QOS, Load Balancers and Firewalls.  Lastly in our list of headaches is not the least.  Cloud systems tend to breed like rabbits or multiply like coat hangers in the closet.  There are more and more systems as 10 servers become 40 which becomes 100 then 1000 and so on.

 

So what is a poor Network Engineer to do?

First get a handle on what this Cloud thing is supposed to be for.  If you are one of the lucky ones who can dictate the use of the infrastructure then rock on!  Unfortunately, that does not seem to be the way it goes for many.  In the case where you just cannot predict how the infrastructure will be used I am reminded of the phrase “there is not replacement for displacement”.  Fast links, non-blocking switches, Network Fabrics are all necessary for the physical network but will not get you there.  Sense as a network administrator you cannot predict the traffic patterns who can?  Well the developer and the application itself.  This is what SDN is all about.  It allows a programmatic interface to what is called an overlay network.  A series of tunnels/flows which can build virtual networks on top of the physical network giving that pesky application what it was looking for.  In some cases you may want to make changes to the physical infrastructure.  For example change the configuration of the Firewall or Load Balancer or other network equipment.  SDN vendors are creating plug-ins that can make those types of configurations.  But if this is not good enough for you there is NFV.  The basic idea here is that why have specialized hardware for your core network infrastructure when we can run them virtualized as well?  Let’s run those in VM’s as well, hook them into the virtual network and SDN to configure them and we now can virtualize the routers, load balancers, firewalls and switches.  These technologies are in very much a state of flux right now but they are promising none the less.  Now if we could just virtualize the monitoring and troubleshooting of these environments I’d be happy.

 

 

Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure]

This post is the third in a 4 part series about Success factors for Operating Open Source Infrastructure.

turning upIn an automated configuration deployment scenario, problems surface very quickly. They prevent deployment and force resolution before progress can be made. Unfortunately, many times this appears to be a failure within the deployment automation. My personal experience has been exactly the opposite: automation creates a “fail fast” environment in which critical issues are discovered and resolved during provisioning instead of sleeping until later.

Our ability to detect and stop until these issues are resolved creates exactly the type of repeatable, successful deployment that is essential to long-term success. When we look at these deployments, the most important success factors are that the deployment is consistent, known and predictable. Our ability to quickly identify and resolve issues that do not match those patterns dramatically improves the long-term stability of the system by creating an environment that has been benchmarked against a known reference.

Benchmarking against a known reference is ultimately the most significant value that we can provide in helping customers bring up complex solutions such as Openstack and Hadoop. Being successful with these deployments over the long term means that you have established a known configuration, and that you have maintained it in a way that is explainable and reference-able to other places.

Reference Implementation

The concept of a reference implementation provides tremendous value in deployment. Following a pattern that is a reference implementation enables you to compare notes, get help and ultimately upgrade and change deployment in known, predictable ways. Customers who can follow and implement a vendors’ reference, or the community’s reference implementation, are able to ask for help on the mailing lists, call in for help and work with the community in ways that are consistent and predictable.

Let’s explore what a reference implementation looks like.

In a reference implementation you have a consistent, known state of your physical infrastructure that has been implemented based upon a RA. That implementation follows a known best practice using standard gear in a consistent, known configuration. You can therefore explain your configuration to a community of other developers, or other people who have similar configuration, and can validate that your problem is not the physical configuration. Fundamentally, everything in a reference implementation is driving towards the elimination of possible failure cause. In this case, we are making sure that the physical infrastructure is not causing problems (getting to a ready state), because other people are using a similar (or identical) physical infrastructure configuration.

The next components of a reference implementation are the underlying software configurations for operating system management monitoring network configuration, IP networking stacks. Pretty much the entire component of the application is riding on. There are a lot of moving parts and complexity in this scenario, witha high likelihood of causing failures. Implementing and deploying the software stacks in an automated way, has enabled us to dramatically reduce the potential for problems caused by misconfiguration. Because the number of permutations of software in the reference stack is so high, it is essential that successful deployment tightly manages what exactly is deployed, in such a way that they can identify, name, and compare notes with other deployments.

Achieving Repeatable Deployments

In this case, our referenced deployment consists of the exact composition of the operating system, infrastructure tooling, and capabilities for the deployment. By having a reference capability, we can ensure that we have the same:

  • Operating system
  • Monitoring
  • Configuration stacks
  • Security tooling
  • Patches
  • Network stack (including bridges and VLAN, IP table configurations)

Each one of these components is a potential failure point in a deployment. By being able to configure and maintain that configuration automatically, we dramatically increase the opportunities for success by enabling customers to have a consistent configuration between sites.

Repeatable reference deployments enable customers to compare notes with Dell and with others in the community. It enables us to take and apply what we have learned from one site to another. For example, if a new patch breaks functionality, then we can quickly determine how that was caused. We can then fix the solution, add in the complimentary fix, and deploy it at that one site. If we are aware that 90% of our other sites have exactly the same configuration, it enables those other sites to avoid a similar problem. In this way, having both a pattern and practice referenced deployment enables the community to absorb or respond much more quickly, and be successful with a changing code base. We found that it is impractical to expect things not to change.

The only thing that we can do is build resiliency for change into these deployments. Creating an automated and tested referenceable deployment is the best way to cope with change.

 

 

 

OpenCrowbar.Anvil released – hammering out a gold standard in open bare metal provisioning

OpenCrowbarI’m excited to be announcing OpenCrowbar’s first release, Anvil, for the community.  Looking back on our original design from June 2012, we’ve accomplished all of our original objectives and more.
Now that we’ve got the foundation ready, our next release (OpenCrowbar Broom) focuses on workload development on top of the stable Anvil base.  This means that we’re ready to start working on OpenStack, Ceph and Hadoop.  So far, we’ve limited engagement on workloads to ensure that those developers would not also be trying to keep up with core changes.  We follow emergent design so I’m certain we’ll continue to evolve the core; however, we believe the Anvil release represents a solid foundation for workload development.
There is no more comprehensive open bare metal provisioning framework than OpenCrowbar.  The project’s focus on a complete operations model that comprehends hardware and network configuration with just enough orchestration delivers on a system vision that sets it apart from any other tool.  Yet, Crowbar also plays nicely with others by embracing, not replacing, DevOps tools like Chef and Puppet.
Now that the core is proven, we’re porting the Crowbar v1 RAID and BIOS configuration into OpenCrowbar.  By design, we’ve kept hardware support separate from the core because we’ve learned that hardware generation cycles need to be independent from the operations control infrastructure.  Decoupling them eliminates release disruptions that we experienced in Crowbar v1 and­ makes it much easier to use to incorporate hardware from a broad range of vendors.
Here are some key components of Anvil
  • UI, CLI and API stable and functional
  • Boot and discovery process working PLUS ability to handle pre-populating and configuration
  • Chef and Puppet capabilities including Birk Shelf v3 support to pull in community upstream DevOps scripts
  • Docker, VMs and Physical Servers
  • Crowbar’s famous “late-bound” approach to configuration and, critically, networking setup
  • IPv6 native, Ruby 2, Rails 4, preliminary scale tuning
  • Remarkably flexible and transparent orchestration (the Annealer)
  • Multi-OS Deployment capability, Ubuntu, CentOS, or Different versions of the same OS
Getting the workloads ported is still a tremendous amount of work but the rewards are tremendous.  With OpenCrowbar, the community has a new way to collaborate and integration this work.  It’s important to understand that while our goal is to start a quarterly release cycle for OpenCrowbar, the workload release cycles (including hardware) are NOT tied to OpenCrowbar.  The workloads choose which OpenCrowbar release they target.  From Crowbar v1, we’ve learned that Crowbar needed to be independent of the workload releases and so we want OpenCrowbar to focus on maintaining a strong ops platform.
This release marks four years of hard-earned Crowbar v1 deployment experience and two years of v2 design, redesign and implementation.  I’ve talked with DevOps teams from all over the world and listened to their pains and needs.  We have a long way to go before we’re deploying 1000 node OpenStack and Hadoop clusters, OpenCrowbar Anvil significantly moves the needle in that direction.
Thanks to the Crowbar community (Dell and SUSE especially) for nurturing the project, and congratulations to the OpenCrowbar team getting us this to this amazing place.

 

Reference Deployments are Critical [2/4 series on Operating Open Source Infrastructure]

This post is the second in a 4 part series about Success factors for Operating Open Source Infrastructure.

plansWhen we look at reference deployments, there are several things that make a good referenced deployment; and ones that are useful by the community.

First, a referenced deployment needs to be specific and useful. They have to be identified as solving a specific problem using the software. And they have to have a specific configuration that can be described in a way that creates a workable scenario for that. There may be multiple useful reference implementations. And in that case, each one needs to be identified as the – by the expected behavior. For example, our deployments include a compute centric configuration that has hardware configurations and network configurations adapted to compute focused applications.

They also have storage focused applications that are specifically targeted at enabling cheap and deep storage nodes for that type of situation. Both configurations are important and valid but they require different implementations, different details and different reference architectures. As long as it is clear that there are multiple patterns, the community is perfectly able to absorb and use these patterns.

Establishment of a widely adopted best practice is a central success criteria for any project.

Best practices ensure that deployers of the technology cannot only purchase implementations that will be successful, but they can also compare notes to work with their community. A significant adoption curve happens after the establishment of these best practices because at that point, the risk of purchase dramatically drops, and the ability to support radically increases. The next thing that’s important in the establishment of these technologies is that that reference implementation or the reference architecture has a way to be configured in a repeatable way.

Very often, this takes the form of deployment books from manuals. While useful in small deployments, in a hyperscale deployment the books really have diminishing value. This is because the level of human error – the chance of making a fundamental mistake during configuration – increases exponentially with the number of nodes, because each node is tightly interconnected with other nodes within the system.

My team at Dell launched the Crowbar project as a way to reduce or mitigate this effort substantially. We recognized that the number one cause of delays and impacts in time to value in a hyperscale deployment is configuration and set-up. Any simple mistake made during configuration, even down to ordering of the gear, or physical defects within the infrastructure, will create dramatic delays in troubleshooting and diagnosing those issues. By automating the process, we have ensured that we can bootstrap the system quickly.

The goal of automated best practice is to bootstrap in a conforming and repeatable way. This enables the community to work together immediately towards return on investment, and greatly reduces the risk of problems caused by human error. For example, it’s typical within a site for us to find that network configurations do not match the specifications. In many cases, we find issues with the core networking infrastructure not matching the way it was originally designed. We also find failures on physical infrastructure, disk failures, system mismatches,and unanticipated configuration. Any one of these problems with a human setup might be missed or overlooked.

Validated reference architectures, while valuable, are no longer sufficient.   Automated reference configurations have become the key to successfully delivered solutions.

Interested in more?  Read part 3

 

 

 

 

 

 

Success Factors of Operating Open Source Infrastructure [Series Intro]

2012-10-28_14-13-24_502Building a best practices platform is essential to helping companies share operations knowledge.   In the fast-moving world of open source software, sharing documentation about what to do is not sufficient.  We must share the how to do it also because the operations process is tightly coupled to achieving ongoing success.

Further, since change is constant, we need to change our definition of “stability” to reflect a much more iterative and fluid environment.

Baseline testing is an essential part of this platform. It enables customers to ensure not only fast time to value, but the tests are consistently conforming with industry best practices, even as the system is upgraded and migrates towards a continuous deployment infrastructure.

The details are too long for a single post so I’m going to explore this as three distinct topics over the next two weeks.

  1. Reference Deployments talks about needed an automated way to repeat configuration between sites.
  2. Ops Validation using Development Tests talks about having a way to verify that everyone uses a common reference platform
  3. Shared Open Operatons / DevOps (pending) talks about putting reference deployment and common validation together to create a true open operations practice.

OpenStack, Hadoop, Ceph, Docker and other open source projects are changing the landscape for information technology. Customers seeking to become successful with these evolving platforms must look beyond the software bits, and consider both the culture and operations.  The culture is critical because interacting with the open source projects community (directly or through a proxy) can help ensure success using the software. Operations are critical because open source projects expect the community to help find and resolve issues. This results in more robust and capable products. Consequently, users of open source software must operate in a more fluid environment.

My team at Dell saw this need as we navigated the early days of OpenStack.  The Crowbar project started because we saw that the community needed a platform that could adapt and evolve with the open source projects that our advanced customers were implementing. Our ability to deliver an open operations platform enables the community to collaborate, and to skip over routine details to refocus on shared best practices.

My recent focus on the OpenStack DefCore work reinforces these original goals.  Using tests to help provide a common baseline is a concrete, open and referenceable way to promote interoperability.  I hope that this in turn drives a dialog around best practices and shared operations because those help mature the community.

DevOps Concept: “Ready State” Infrastructure as hand-off milestone

Working for Dell, it’s no surprise that I have a lot of discussions around building up and maintaining the physical infrastructure to run a data centers at scale.  Generally the context is around OpenCrowbar, Hadoop or OpenStack Ironic/TripleO/Heat but the concerns are really universal in my cloud operations experience.

Three Teams

Typically, deployments have three distinct phases: 1) mechanically plug together the systems, 2) get the systems ready to the OS and network level andthen 3) install the application.  Often these phases are so distinct that they are handled by completely different teams!

That’s a problem because errors or unexpected changes from one phase are very expensive to address once you change teams.  The solution has been to become more and more prescriptive about what the system looks like between the second (“ready”) and third (“installed”) phase.  I’ve taken to calling this hand-off a achieving a ready state infrastructure.

I define a “ready state” infrastructure as having been configured so that the application lay down steps are simple and predictable.

In my experience, most application deployment guides start with a ready state assumption.  They read like “Step 0: rack, configure, provision and tweak the nodes and network to have this specific starting configuration.”   If you are really lucky then “specific configuration” is actually a documented and validated reference architecture.

The magic of cloud IaaS is that it always creates ready state infrastructure.  If I request 10 servers with 2 NICs running Ubuntu 14.04 then that’s exactly what I get.  The fact that cloud always provisions a ready state infrastructure has become an essential operating assumption for cloud orchestration and configuration management.

Unfortunately, hardware provisioning is messy.  It takes significant effort to configure a physical system into a ready state.  This is caused by a number of factors

  1. You can’t alter physical infrastructure with programming (an API) – for example, if the server enumerates the NICs differently than you expected, you have to adapt to that.
  2. You have to respect the physical topology of the system – for example, production deployments used teamed NICs that have to be use different switches for redundancy.  You can’t make assumptions, you have to setup the team based on the specific configuration.
  3. You have to build up the configuration in sequence – for example, you can’t setup the RAID configuration after the operating system is installed.  If you made a bad choice then you’ll likely have to repeat the whole sequence of the deployment and some bad choices (like using the wrong subnets) result in a total system rebuild.
  4. Hardware fails and is non-uniform – for example, in any order of sufficient size you will have NIC failures due to everything from simple mechanical card seating issues to BIOS interface mismatches.  Troubleshooting these issues can occupy significant time.
  5. Component configurations are interlocked – for example, a change to the switch settings could result in DHCP failures when systems are rebooted (real experience).  You cannot always work node-to-node, you must deal with the infrastructure as an integrated system.

Being consistent at turning discovered state into ready state is a complex and unique problem space.  As I explore this bare metal provisioning space in the community, I am more and more convinced that it has a distinct architecture from applications built for ready state operations.

My hope in this post is test if the concept of “ready state” infrastructure is helpful in describing the transition point between provisioning and installation.  Please let me know what you think!

Mayflies and Dinosaurs (extending Puppies and Cattle)

Dont Be FragileJosh McKenty and I were discussing the common misconception of the “Puppies and Cattle” analogy. His position is not anti-puppy! He believes puppies are sometimes unavoidable and should be isolated into portable containers (VMs) so they can be shuffled around seamlessly. His more provocative point is that we want our underlying infrastructure to be cattle so it remains highly elastic and flexible. More cattle means a more resilient system. To me, this is a fundamental CloudOps design objective.

We realized that the perfect cloud infrastructure would structurally discourage the creation of puppies.

Imagine a cloud in which servers were automatically decommissioned after a week of use. In a sort of anti-SLA, any VM running for more than 168 hours would be (gracefully) terminated. This would force a constant churn of resources within the infrastructure that enables true cattle-like management. This cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.

We called these servers mayflies due to their limited life span.

While this approach requires a high degree of automation, the most successful cloud operators I have met are effectively building workloads with this requirement. If we require application workloads to be elastic and fault-resilient then we have a much higher degree of flexibility with the underlying infrastructure. I’ve seen this in practice with several OpenStack clouds: operators with helped applications deploy using automation were able to decommission “old” clouds much more gracefully. They effectively turned their entire cloud into a cow. Sadly, the ones without that investment puppified™ the ops infrastructure and created a much more brittle environment.

The opposite of a mayfly is the dinosaur: a server that is so brittle and locked that the slightest disturbance wipes out everything it touches.

Dinosaurs are puppies grown into a T-Rex with rows of massive razor sharp teeth and tiny manicured hands. These are systems that are so unique and historical that there’s no way to recreate them if there’s a failure. The original maintainers exit happy hour was celebrated by people who were laid-off two CEOs ago. The impact of dinosaurs goes beyond their operational risk; they are typically impossible to extend or maintain and, consequently, ossify other server around them. This type of server drains elasticity from your ops team.

Puppies do not grow up to become dogs, they become dinosaurs.

It’s a classic lean adage to do hard things more frequently. Perhaps it’s time to start creating mayflies in your ops infrastructure.

5 differences between Cloud ops and Bare Metal ops

OpenStack SummitCloud APIs are about abstracting operations to simplify deployment.  We want users of our cloud infrastructure to operate with blissful unawareness of the underlying networking topology, storage configuration and physical infrastructure.  For their perspective, the cloud is perfectly elastic, totally configurable and wonderfully consistent. Cloud Admins on the other hand need visibility and controls that expose the complexity while keeping it rational. These are profoundly different concerns.

Maintaining the illusion of clean and simple Cloud ops infrastructure is very valuable; however, it’s just an illusion.  The black metal box behind those APIs is complex, messy, unpredictable and dynamic.

1. Metal Ops has to deal with network topology and details like if an operating system enumerates the NICs correctly, bonding the correct NIC pair and which 10g network to use for the storage traffic. In networking, the topology determines how much traffic you can subscribe to a link and how to provide resliency. Networking does not exist in isolation: you must consider the boundary firewalls and routers to either block or allow traffic because without connectivity the cloud is useless. Details like the access and registration in DNS, NTP and DHCP provide foundations our stable operations. These details are (and should be) hidden from the cloud user.

2. Metal Ops has to deal with firmware issues at every level.  It matters to the server if it boots into BIOS or UEFI mode.  We have to manage the fact that RAID partitions need to be optimized based on the workload and type of drive.  We have to consider if there are specialized drivers and caches to manage and security features (like Intel TXT) to activate.  These details are (and should be) hidden from the cloud user.

3. Metal Ops have to consider the security of their infrastructure.  We have to manage where the admin control network crosses security domains.  It matters which layer 2 networks have access to which parts of the infrastructure.  Separation of responsibility for network vs. storage vs. compute is a reality that it not going away. These details are (and should be) hidden from the cloud user.

4. Metal Ops have to manage operating system compatibility.  I know personally that vendors test and certify their operating systems on an enormous matrix of silicon.  I also have learned that the matrix of possible combinations is far larger and fundamentally impossible at the edges.  There’s a reason that operators seek homogeneous environments and LTS releases. These details are (and should be) hidden from the cloud user.

5. Metal Ops have to deal with hardware failures. By simple statistics, the larger the system the more things will break and metal ops have to cope with this reality. We have to expose failure zones and boundaries to make intelligent responses (like moving data from a failed drive to a non-adjacent one) that require intimate knowledge of system topography that are intentionally hidden in cloud ops. Further, we have to have monitoring and management tooling that knows how to identify which NIC in a bond failed or flash the lights on the failed drive of an array. These details are (and should be) hidden from the cloud user.

Cloud’s power is being able to abstract away this complexity.  Dealing with it gracefully behind the scenes requires transparency and details that make Metal Ops job fundamentally different.

While both can be highly automated and pass my “Cloud is Infrastructure with an API” test, their objectives are different.

OpenStack Havana provides foundation for XXaaS you need

Folsom SummitIt’s been a long time, and a lot of summits, since I posted how OpenStack was ready for workloads (back in Cactus!).  We’ve seen remarkable growth of both the platform technology and the community surrounding it.  So much growth that now we’re struggling to define “what is core” for the project and I’m proud be on the Foundation Board helping to lead that charge.

So what’s exciting in Havana?

There’s a lot I am excited about in the latest OpenStack release.

Complete Split of Compute / Storage / Network services

In the beginning, OpenStack IaaS was one service (Nova).  We’ve been breaking that monolith into distinct concerns (Compute, Network, Storage) for the last several releases and I think Havana is the first release where all of the three of the services are robust enough to take production workloads.

This is a major milestone for OpenStack because knowledge that the APIs were changing inhibited adoption.

ENABLING TECH INTEGRATION: Docker & Ceph

We’ve been hanging out with the Ceph and Docker teams, so you can expect to see some interesting.  These two are proof of the a fallacy that only OpenStack projects are critical to OpenStack because neither of these technologies are moving under the official OpenStack umbrella.  I am looking forward to seeing both have dramatic impacts in how cloud deployments.

Docker promises to make Linux Containers (LXC) more portable and easier to use.  This paravirtualization approach provides near bear metal performance without compromising VM portability.  More importantly, you can oversubscribe LXC much more than VMs.  This allows you to dramatically improve system utilization and unlocks some other interesting quality of service tricks.

Ceph is showing signs of becoming the scale out storage king.  Beyond its solid data dispersion algorithm, a key aspect of its mojo is that is delivers both block and object storage.  I’ve seen a lot of interest in consolidating both types of storage into a single service.  Ceph delivers on that plus performance and cost.  It’s a real winner.

Crowbar Integration & High Availability Configuration!

We’ve been making amazing strides in the Crowbar + OpenStack integration!  As usual, we’re planning our zero day community build (on the “Roxy” branch) to get people started thinking about operationalizing OpenStack.   This is going to be especially interesting because we’re introducing it first on Crowbar 1 with plans to quickly migrate to Crowbar 2 where we can leverage the attribute injection pattern that OpenStack cookbooks also use.  Ultimately, we expect those efforts to converge.  The fact that Dell is putting reference implementations of HA deployment best practices into the open community is a major win for OpenStack.

Tests, Tests, Tests & Continuous Delivery

OpenStack continue to drive higher standards for reviews, integration and testing.  I’m especially excited to the volume and activity around our review system (although backlogs in reviews are challenges).  In addition, the community continues to invest in the test suites like the Tempest project.  These are direct benefits to operators beyond simple code quality.  Our team uses Tempest to baseline field deployments.  This means that OpenStack test suites help validate live deployments, not just lab configurations.

We achieve a greater level of quality when we gate code check-ins on tests that matter to real deployments.   In fact, that premise is the basis for our “what is core” process.  It also means that more operators can choose to deploy OpenStack continuously from trunk (which I consider to be a best practice scale ops).

Where did we fall short?

With growth comes challenges, Havana is most complex release yet.  The number of projects that are part the OpenStack integrated release family continues to expand.  While these new projects show the powerful innovation engine at work with OpenStack, they also make the project larger and more difficult to comprehend (especially for n00bs).  We continue to invest in Crowbar as a way to serve the community by making OpenStack more accessible and providing open best practices.

We are still struggling to resolve questions about interoperability (defining core should help) and portability.  We spent a lot of time at the last two summits on interoperability, but I don’t feel like we are much closer than before.  Hopefully, progress on Core will break the log jam.

Looking ahead to Ice House?

I and many leaders from Dell will be at the Ice House Summit in Hong Kong listening and learning.

The top of my list is the family of XXaaS services (Database aaS, Load Balanacer aaS, Firewall aaS, etc) that have appeared.  I’m a firm believer that clouds are more than compute+network+storage.  With a stable core, OpenStack is ready to expand into essential platform services.

If you are at the summit, please join Dell (my employer) and Intel for the OpenStack Summit Welcome Reception (RSVP!) kickoff networking and social event on Tuesday November 5, 2013 from 6:30 – 8:30pm at the SkyBistro in the SkyCity Marriott.   My teammate, Kamesh Pemmaraju, has a complete list of all Dell the panels and events.