Josh McKenty and I were discussing the common misconception of the “Puppies and Cattle” analogy. His position is not anti-puppy! He believes puppies are sometimes unavoidable and should be isolated into portable containers (VMs) so they can be shuffled around seamlessly. His more provocative point is that we want our underlying infrastructure to be cattle so it remains highly elastic and flexible. More cattle means a more resilient system. To me, this is a fundamental CloudOps design objective.
We realized that the perfect cloud infrastructure would structurally discourage the creation of puppies.
Imagine a cloud in which servers were automatically decommissioned after a week of use. In a sort of anti-SLA, any VM running for more than 168 hours would be (gracefully) terminated. This would force a constant churn of resources within the infrastructure that enables true cattle-like management. This cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.
We called these servers mayflies due to their limited life span.
While this approach requires a high degree of automation, the most successful cloud operators I have met are effectively building workloads with this requirement. If we require application workloads to be elastic and fault-resilient then we have a much higher degree of flexibility with the underlying infrastructure. I’ve seen this in practice with several OpenStack clouds: operators with helped applications deploy using automation were able to decommission “old” clouds much more gracefully. They effectively turned their entire cloud into a cow. Sadly, the ones without that investment puppified™ the ops infrastructure and created a much more brittle environment.
The opposite of a mayfly is the dinosaur: a server that is so brittle and locked that the slightest disturbance wipes out everything it touches.
Dinosaurs are puppies grown into a T-Rex with rows of massive razor sharp teeth and tiny manicured hands. These are systems that are so unique and historical that there’s no way to recreate them if there’s a failure. The original maintainers exit happy hour was celebrated by people who were laid-off two CEOs ago. The impact of dinosaurs goes beyond their operational risk; they are typically impossible to extend or maintain and, consequently, ossify other server around them. This type of server drains elasticity from your ops team.
Puppies do not grow up to become dogs, they become dinosaurs.
The interoperability challenge was a major theme of the Havana Summit in Portland last week (panel I moderated) . Solving it creates significant benefits for the OpenStack community. These benefits have significant financial opportunities for the OpenStack ecosystem.
This is a journey that we are on together – it’s not a deliverable from a single company or a release that we will complete and move on.
During the session, I think we did a good job stating how we can use Heat for an RA to make incremental steps. and I had a session about upgrade (slides).
Even with all this progress, Testing for interoperability was one of the largest gaps.
The challenge is not if we should test, but how to create a set of tests that everyone will accept as adequate. Approach that goal with standardization or specification objective is likely an impossible challenge.
We should question the assumption that faithful implementation test specifications (FITS) for interoperability are only useful with a matching specification and significant API coverage. Any level of coverage provides useful information and, more importantly, visibility accelerates contributions to the test base.
I can speak from experience that this approach has merit. The Crowbar team at Dell has been including OpenStack Tempest as part of our reference deployment since Essex and it runs as part of our automated test infrastructure against every build. This process does not catch every issue, but passing Tempest is a very good indication that you’ve got the a workable OpenStack deployment.
The OpenStack Board spent several hours (yes, hours) discussing interoperability related topics at the last board meeting. Fundamentally, the community benefits when uses can operate easily across multiple OpenStack deployments (their own and/or public clouds).
Cloud interoperability: the ability to transfer workloads between systems without changes to the deployment operations management infrastructure.
This is NOT hybrid (which I defined as a workload transparently operating in multiple systems); however it is a prereq to achieve scalable hybrid operation.
Interoperability matters because the OpenStack value proposition is all about creating a common platform. IT World does a good job laying out the problem (note, I work for Dell). To create sites that can interoperate, we have to some serious lifting:
I could not be happier with the results Crowbar collaborators and my team at Dell achieved around the 1st Crowbar design summit. We had great discussions and even better participation.
The attendees represented major operating system vendors, configuration management companies, OpenStack hosting companies, OpenStack cloud software providers, OpenStack consultants, OpenStack private cloud users, and (of course) a major infrastructure provider. That’s a very complete cross-section of the cloud community.
I knew from the start that we had too little time and, thankfully, people were tolerant of my need to stop the discussions. In the end, we were able to cover all the planned topics. This was important because all these features are interlocked so discussions were iterative. I was impressed with the level of knowledge at the table and it drove deep discussion. Even so, there are still parts of Crowbar that are confusing (networking, late binding, orchestration, chef coupling) even to collaborators.
In typing up these notes, it becomes even more blindingly obvious that the core features for Crowbar 2 are highly interconnected. That’s no surprise technically; however, it will make the notes harder to follow because of knowledge bootstrapping. You need take time and grok the gestalt and surf the zeitgeist.
Collaboration Invitation: I wanted to remind readers that this summit was just the kick-off for a series of open weekly design (Tuesdays 10am CDT) and coordination (Thursdays 8am CDT) meetings. Everyone is welcome to join in those meetings – information is posted, recorded, folded, spindled and mutilated on the Crowbar 2 wiki page.
These notes are my reflection of the online etherpad notes that were made live during the meeting. I’ve grouped them by design topic.
We are refactoring Crowbar at this time because we have a collection of interconnected features that could not be decoupled
Some items (Database use, Rails3, documentation, process) are not for debate. They are core needs but require little design.
There are 5 key topics for the refactor: online mode, networking flexibility, OpenStack pull from source, heterogeneous/multi operating systems, being CDMB agnostic
Due to time limits, we have to stop discussions and continue them online.
We are hoping to align Crowbar 2 beta and OpenStack Folsom release.
Online / Connected Mode
Online mode is more than simply internet connectivity. It is the foundation of how Crowbar stages dependencies and components for deploy. It’s required for heterogeneous O/S, pull from source and it has dependencies on how we model networking so nodes can access resources.
We are thinking to use caching proxies to stage resources. This would allow isolated production environments and preserves the run everything from ISO without a connection (that is still a key requirement to us).
Suse’s Crowbar fork does not build an ISO, instead it relies on RPM packages for barclamps and their dependencies.
Pulling packages directly from the Internet has proven to be unreliable, this method cannot rely on that alone.
Install From Source
This feature is mainly focused on OpenStack, it could be applied more generally. The principals that we are looking at could be applied to any application were the source code is changing quickly (all of them?!). Hadoop is an obvious second candidate.
We spent some time reviewing the use-cases for this feature. While this appears to be very dev and pre-release focused, there are important applications for production. Specifically, we expect that scale customers will need to run ahead of or slightly adjacent to trunk due to patches or proprietary code. In both cases, it is important that users can deploy from their repository.
We discussed briefly our objective to pull configuration from upstream (not just OpenStack, but potentially any common cookbooks/modules). This topic is central to the CMDB agnostic discussion below.
The overall sentiment is that this could be a very powerful capability if we can manage to make it work. There is a substantial challenge in tracking dependencies – current RPMs and Debs do a good job of this and other configuration steps beyond just the bits. Replicating that functionality is the real obstacle.
CMDB agnostic (decoupling Chef)
This feature is confusing because we are not eliminating the need for a configuration management database (CMDB) tool like Chef, instead we are decoupling Crowbar from the a single CMDB to a pluggable model using an abstraction layer.
It was stressed that Crowbar does orchestration – we do not rely on convergence over multiple passes to get the configuration correct.
We had strong agreement that the modules should not be tightly coupled but did need a consistent way (API? Consistent namespace? Pixie dust?) to share data between each other. Our priority is to maintain loose coupling and follow integration by convention and best practices rather than rigid structures.
The abstraction layer needs to have both import and export functions
Crowbar will use attribute injection so that Cookbooks can leverage Crowbar but will not require Crowbar to operate. Crowbar’s database will provide the links between the nodes instead of having to wedge it into the CMDB.
In 1.x, the networking was the most coupled into Chef. This is a major part of the refactor and modeling for Crowbar’s database.
There are a lot of notes captured about this on the etherpad – I recommend reviewing them
Heterogeneous OS (bare metal provisioning and beyond)
This topic was the most divergent of all our topics because most of the participants were using some variant of their own bare metal provisioning project (check the etherpad for the list).
Since we can’t pack an unlimited set of stuff on the ISO, this feature requires online mode.
Most of these projects do nothing beyond OS provisioning; however, their simplicity is beneficial. Crowbar needs to consider users who just want a stream-lined OS provisioning experience.
Given Crowbar‘s frenetic Freshman year, it’s impossible to predict everything that Crowbar could become. I certainly aspire to see the project gain a stronger developer community and the seeds of this transformation are sprouting. I also see that community driven work is positioning Crowbar to break beyond being platforms for OpenStack and Apache Hadoop solutions that pay the bills for my team at Dell to invest in Crowbar development.
I don’t have to look beyond the summer to see important development for Crowbar because of the substantial goals of the Crowbar 2.0 refactor.
Crowbar 2.0 is really just around the corner so I’d like to set some longer range goals for our next year.
Growing acceptance of Crowbar as an in data center extension for DevOps tools (what I call CloudOps)
Deeper integration into more operating environments beyond the core Linux flavors (like virtualization hosts, closed and special purpose operating systems.
Taking on production ops challenges of scale, high availability and migration
Formalization of our community engagement with summits, user groups, and broader developer contributions.
For example, Crowbar 2.0 will be able to handle downloading packages and applications from the internet. Online content is not a major benefit without being able to stage and control how those new packages are deployed; consequently, our goals remains tightly focused improvements in orchestration.
These changes create a foundation that enables a more dynamic operating environment. Ultimately, I see Crowbar driving towards a vision of fully integrated continuous operations; however, Greg & Rob’s Crowbar vision is the topic for tomorrow’s post.
When Greg Althaus and I first proposed the project that would become Dell’s Crowbar, we had already learned first-hand that there was a significant gap in both the technologies and the processes for scale operations. Our team at Dell saw that the successful cloud data centers were treating their deployments as integrated systems (now called DevOps) in which configuration of many components where coordinated and orchestrated; however, these approaches feel short of the mark in our opinion. We wanted to create a truly integrated operational environment from the bare metal through the networking up to the applications and out to the operations tooling.
Our ultimate technical nirvana is to achieve closed-loop continuous deployments. We want to see applications that constantly optimize new code, deployment changes, quality, revenue and cost of operations. We could find parts but not a complete adequate foundation for this vision.
The business driver for Crowbar is system thinking around improved time to value and flexibility. While our technical vision is a long-term objective, we see very real short-term ROI. It does not matter if you are writing your own software or deploying applications; the faster you can move that code into production the sooner you get value from innovation. It is clear to us that the most successful technology companies have reorganized around speed to market and adapting to pace of change.
System flexibility & acceleration were key values when lean manufacturing revolution gave Dell a competitive advantage and it has proven even more critical in today’s dynamic technology innovation climate.
We hope that this post helps define a vision for Crowbar beyond the upcoming refactoring. We started the project with the idea that new tools meant we could take operations to a new level.
While that’s a great objective, we’re too pragmatic in delivery to rest on a broad objective. Let’s take a look at Crowbar’s concrete strengths and growth areas.
Key strength areas for Crowbar
Late binding – hardware and network configuration is held until software configuration is known. This is a huge system concept.
Dynamic and Integrated Networking – means that we treat networking as a 1st class citizen for ops (sort of like software defined networking but integrated into the application)
System Perspective – no Application is an island. You can’t optimize just the deployment, you need to consider hardware, software, networking and operations all together.
Bootstrapping (bare metal) – while not “rocket science” it takes a lot of careful effort to get this right in a way that is meaningful in a continuous operations environment.
Open Source / Open Development / Modular Design – this problem is simply too complex to solve alone. We need to get a much broader net of environments and thinking involved.
Continuing Areas of Leadership
Open / Lean / Incremental Architecture – these are core aspects of our approach. While we have a vision, we also are very open to ways that solve problems faster and more elegantly than we’d expected.
Continuous deployment – we think the release cycles are getting faster and the only way to survive is the build change into the foundation of operations.
Integrated networking – software defined networking is cool, but not enough. We need to have semantics that link applications, networks and infrastructure together.
Equilivent physical / virtual – we’re not saying that you won’t care if it’s physical or virtual (you should), we think that it should not impact your operations.
Scale / Hybrid - the key element to hybrid is scale and to hybrid is scale. The missing connection is being able to close the loop.
Closed loop deployment – seeking load management, code quality, profit, and cost of operations as factor in managed operations.
Getting the core Crowbar 2.0 changes working is not a major refactoring effort in calendar time; however, it will impact current Crowbar developers by changing improving the programming APIs. The Dell Crowbar team decided to treat this as a focused refactoring effort because several important changes are tightly coupled. We cannot solve them independently without causing a larger disruption.
All of the Crowbar 2.0 changes address issues and concerns raised in the community and are needed to support expanding of our OpenStack and Hadoop application deployments.
Our technical objective for Crowbar 2.0 is to simplify and streamline development efforts as the development and user community grows. We are seeking to:
simplify our use of Chef and eliminate Crowbar requirements in our Opscode Chef recipes.
reduce the initial effort required to leverage Crowbar
provide heterogeneous / multiple operating system deployments. This enables:
multiple versions of the same OS running for upgrades
different operating systems operating simultaneously (and deal with heterogeneous packaging issues)
accommodation of no-agent systems like locked systems (e.g.: virtualization hosts) and switches (aka external entities)
UEFI booting in Sledgehammer
strengthen networking abstractions
allow networking configurations to be created dynamically (so that users are not locked into choices made before Crowbar deployment)
better manage connected operations
enable pull-from-source deployments that are ahead of (or forked from) available packages.
improvements in Crowbar’s core database and state machine to enable
larger scale concerns
controlled production migrations and upgrades
other important items
make documentation more coupled to current features and easier to maintain
upgrade to Rails 3 to simplify code base, security and performance
deepen automated test coverage and capabilities
Beyond these great technical targets, we want Crowbar 2.0 is to address barriers to adoption that have been raised by our community, customers and partners. We have been tracking concerns about the learning curve for adding barclamps, complexity of networking configuration and packaging into a single ISO.
My team at Dell does not take on any refactoring changes lightly because they are disruptive to our community; however, a convergence of requirements has made it necessary to update several core components simultaneously. Specifically, we found that desired changes in networking, operating systems, packaging, configuration management, scale and hardware support all required interlocked changes. We have been bringing many of these changes into the code base in preparation and have reached a point where the next steps require changing Crowbar 1.0 semantics.
We are first and foremost an incremental architecture & lean development team – Crowbar 2.0 will have the smallest footprint needed to begin the transformations that are currently blocking us. There is significant room during and after the refactor for the community to shape Crowbar.