5 things keeping DevOps from playing well with others (Chef, Crowbar and Upstream Patterns)

Sharing can be hardSince my earliest days on the OpenStack project, I’ve wanted to break the cycle on black box operations with open ops. With the rise of community driven DevOps platforms like Opscode Chef and Puppetlabs, we’ve reached a point where it’s both practical and imperative to share operational practices in the form of code and tooling.

Being open and collaborating are not the same thing.

It’s a huge win that we can compare OpenStack cookbooks. The real victory comes when multiple deployments use the same trunk instead of forking.

This has been an objective I’ve helped drive for OpenStack (with Matt Ray) and it has been the Crowbar objective from the start and is the keystone of our Crowbar 2 work.

This has proven to be a formidable challenge for several reasons:

  1. diverging DevOps patterns that can be used between private, public, large, small, and other deployments -> solution: attribute injection pattern is promising
  2. tooling gaps prevent operators from leveraging shared deployments -> solution: this is part of Crowbar’s mission
  3. under investing in community supporting features because they are seen as taking away from getting into production -> solution: need leadership and others with join
  4. drift between target versions creates the need for forking even if the cookbooks are fundamentally the same -> solution: pull from source approaches help create distro independent baselines
  5. missing reference architectures interfere with having a stable baseline to deploy against -> solution: agree to a standard, machine consumable RA format like OpenStack Heat.

Unfortunately, these five challenges are tightly coupled and we have to progress on them simultaneously. The tooling and community requires patterns and RAs.

The good news is that we are making real progress.

Judd Maltin (@newgoliath), a Crowbar team member, has documented the emerging Attribute Injection practice that Crowbar has been leading. That practice has been refined in the open by ATT and Rackspace. It is forming the foundation of the OpenStack cookbooks.

Understanding, discussing and supporting that pattern is an important step toward accelerating open operations. Please engage with us as we make the investments for open operations and help us implement the pattern.

The Atlantic magazine explains why Lean process rocks (and saves companies $$)

GearsI’m certain that the Atlantic‘s Charles Fishman was not thinking software and DevOps when he wrote the excellent article about “The Insourcing Boom.”  However, I strongly recommend reading this report for anyone who is interested in a practical example of the inefficiencies of software lean process (If you are impatient, jump to page 2 and search for toaster).

It’s important to realize that this article is not about software! It’s an article about industrial manufacturing and the impact that lean process has when you are making stuff.  It’s about how US companies are using Lean to make domestic plants more profitable than Asian ones.  It turns out that how you make something really matters – you can’t really optimize the system if you treat major parts like a black box.

When I talk about Agile and Lean, I am talking about proven processes being applied broadly to companies that want to make profit selling stuff. That’s what this article is about

If you are making software then you are making stuff! Your install and deploy process is your assembly line. Your unreleased code is your inventory.

This article does a good job explaining the benefits of being close to your manufacturing (DevOps) and being flexible in deployment (Agile) and being connected to customers (Lean).  The software industry often acts like it’s inventing everything from scratch. When it comes to manufacturing processes, we can learn a lot from industry.

Unlike software, industry has real costs for scrap and lost inventory. Instead of thinking “old school” perhaps we should be thinking of it as the school of hard knocks.

My Dilemma with Folsom – why I want to jump to G

When your ship sailsThese views are my own.  Based on 1×1 discussions I’ve had in the OpenStack community, I am not alone.

If you’ve read my blog then you know I am a vocal and active supporter of OpenStack who is seeking re-election to the OpenStack Board.  I’m personally and professionally committed to the project’s success. And, I’m confident that OpenStack’s collaborative community approach is out innovating other clouds.

A vibrant project requires that we reflect honestly: we have an equal measure of challenges: shadow free fall Dev, API vs implementation, forking risk and others.  As someone helping users deploy OpenStack today, I find my self straddling between a solid release (Essex) and a innovative one (Grizzly). Frankly, I’m finding it very difficult to focus on Folsom.

Grizzly excites me and clearly I’m not alone.  Based on pace of development, I believe we saw a significant developer migration during feature freeze free fall.

In Grizzly, both Cinder and Quantum will have progressed to a point where they are ready for mainstream consumption. That means that OpenStack will have achieved the cloud API trifecta of compute-store-network.

  • Cinder will get beyond the “replace Nova Volume” feature set and expands the list of connectors.
  • Quantum will get to parity with Nova Network, addresses overlapping VM IPs and goes beyond L2 with L3 feature enablement like  load balancing aaS.
  • We are having a real dialog about upgrades while the code is still in progress
  • And new projects like Celio and Heat are poised to address real use problems in billing and application formation.

Everything I hear about Folsom deployment is positive with stable code and significant improvements; however, we’re too late to really influence operability at the code level because the Folsom release is done.  This is not a new dilemma.  As operators, we seem to be forever chasing the tail of the release.

The perpetual cycle of implementing deployment after release is futile, exhausting and demoralizing because we finish just in time for the spotlight to shift to the next release.

I don’t want to slow the pace of releases.  In Agile/Lean, we believe that if something is hard then we do should it more.  Instead, I am looking at Grizzly and seeing an opportunity to break the cycle.  I am looking at Folsom and thinking that most people will be OK with Essex for a little longer.

Maybe I’m a dreamer, but if we can close the deployment time gap then we accelerate adoption, innovation and happy hour.  If that means jilting Folsom at the release altar to elope with Grizzly then I can live with that.

OpenStack Summit: Let’s talk DevOps, Fog, Upgrades, Crowbar & Dell

If you are coming to the OpenStack summit in San Diego next week then please find me at the show! I want to hear from you about the Foundation, community, OpenStack deployments, Crowbar and anything else.  Oh, and I just ordered a handful of Crowbar stickers if you wanted some CB bling.

Matt Ray (Opscode), Jason Cannavale (Rackspace) and I were Ops track co-chairs. If you have suggestions, we want to hear. We managed to get great speakers and also some interesting sessions like DevOps panel and up streaming deploy working sessions. It’s only on Monday and Tuesday, so don’t snooze or you’ll miss it.

My team from Dell has a lot going on, so there are lots of chances to connect with us:

At the Dell booth, Randy Perryman will be sharing field experience about hardware choices. We’ve got a lot of OpenStack battle experience and we want to compare notes with you.

I’m on the board meeting on Monday so likely occupied until the Mirantis party.

See you in San Diego!

PS: My team is hiring for Dev, QA and Marketing. Let me know if you want details.

Crowbar 2.0 Design Summit Notes (+ open weekly meetings starting)

I could not be happier with the results Crowbar collaborators and my team at Dell achieved around the 1st Crowbar design summit. We had great discussions and even better participation.

The attendees represented major operating system vendors, configuration management companies, OpenStack hosting companies, OpenStack cloud software providers, OpenStack consultants, OpenStack private cloud users, and (of course) a major infrastructure provider. That’s a very complete cross-section of the cloud community.

I knew from the start that we had too little time and, thankfully, people were tolerant of my need to stop the discussions. In the end, we were able to cover all the planned topics. This was important because all these features are interlocked so discussions were iterative. I was impressed with the level of knowledge at the table and it drove deep discussion. Even so, there are still parts of Crowbar that are confusing (networking, late binding, orchestration, chef coupling) even to collaborators.

In typing up these notes, it becomes even more blindingly obvious that the core features for Crowbar 2 are highly interconnected. That’s no surprise technically; however, it will make the notes harder to follow because of knowledge bootstrapping. You need take time and grok the gestalt and surf the zeitgeist.

Collaboration Invitation: I wanted to remind readers that this summit was just the kick-off for a series of open weekly design (Tuesdays 10am CDT) and coordination (Thursdays 8am CDT) meetings. Everyone is welcome to join in those meetings – information is posted, recorded, folded, spindled and mutilated on the Crowbar 2 wiki page.

These notes are my reflection of the online etherpad notes that were made live during the meeting. I’ve grouped them by design topic.

Introduction

  • Contributors need to sign CLAs
  • We are refactoring Crowbar at this time because we have a collection of interconnected features that could not be decoupled
  • Some items (Database use, Rails3, documentation, process) are not for debate. They are core needs but require little design.
  • There are 5 key topics for the refactor: online mode, networking flexibility, OpenStack pull from source, heterogeneous/multi operating systems, being CDMB agnostic
  • Due to time limits, we have to stop discussions and continue them online.
  • We are hoping to align Crowbar 2 beta and OpenStack Folsom release.

Online / Connected Mode

  • Online mode is more than simply internet connectivity. It is the foundation of how Crowbar stages dependencies and components for deploy. It’s required for heterogeneous O/S, pull from source and it has dependencies on how we model networking so nodes can access resources.
  • We are thinking to use caching proxies to stage resources. This would allow isolated production environments and preserves the run everything from ISO without a connection (that is still a key requirement to us).
  • Suse’s Crowbar fork does not build an ISO, instead it relies on RPM packages for barclamps and their dependencies.
  • Pulling packages directly from the Internet has proven to be unreliable, this method cannot rely on that alone.

Install From Source

  • This feature is mainly focused on OpenStack, it could be applied more generally. The principals that we are looking at could be applied to any application were the source code is changing quickly (all of them?!). Hadoop is an obvious second candidate.
  • We spent some time reviewing the use-cases for this feature. While this appears to be very dev and pre-release focused, there are important applications for production. Specifically, we expect that scale customers will need to run ahead of or slightly adjacent to trunk due to patches or proprietary code. In both cases, it is important that users can deploy from their repository.
  • We discussed briefly our objective to pull configuration from upstream (not just OpenStack, but potentially any common cookbooks/modules). This topic is central to the CMDB agnostic discussion below.
  • The overall sentiment is that this could be a very powerful capability if we can manage to make it work. There is a substantial challenge in tracking dependencies – current RPMs and Debs do a good job of this and other configuration steps beyond just the bits. Replicating that functionality is the real obstacle.

CMDB agnostic (decoupling Chef)

  • This feature is confusing because we are not eliminating the need for a configuration management database (CMDB) tool like Chef, instead we are decoupling Crowbar from the a single CMDB to a pluggable model using an abstraction layer.
  • It was stressed that Crowbar does orchestration – we do not rely on convergence over multiple passes to get the configuration correct.
  • We had strong agreement that the modules should not be tightly coupled but did need a consistent way (API? Consistent namespace? Pixie dust?) to share data between each other. Our priority is to maintain loose coupling and follow integration by convention and best practices rather than rigid structures.
  • The abstraction layer needs to have both import and export functions
  • Crowbar will use attribute injection so that Cookbooks can leverage Crowbar but will not require Crowbar to operate. Crowbar’s database will provide the links between the nodes instead of having to wedge it into the CMDB.
  • In 1.x, the networking was the most coupled into Chef. This is a major part of the refactor and modeling for Crowbar’s database.
  • There are a lot of notes captured about this on the etherpad – I recommend reviewing them

Heterogeneous OS (bare metal provisioning and beyond)

  • This topic was the most divergent of all our topics because most of the participants were using some variant of their own bare metal provisioning project (check the etherpad for the list).
  • Since we can’t pack an unlimited set of stuff on the ISO, this feature requires online mode.
  • Most of these projects do nothing beyond OS provisioning; however, their simplicity is beneficial. Crowbar needs to consider users who just want a stream-lined OS provisioning experience.
  • We discussed Crowbar’s late binding capability, but did not resolve how to reconcile that with these other projects.
  • Critical use cases to consider:
    • an API for provisioning (not sure if it needs to be more than the current one)
    • pick which Operating Systems go on which nodes (potentially with a rules engine?)
    • inventory capabilities of available nodes (like ohai and factor) into a database
    • inventory available operating systems

Ops “Late Binding” is critical best practice and key to Crowbar differentiation

From OSCON 2012: Portland's light rail understands good dev practice.Late binding is a programming term that I’ve commandeered for Crowbar’s DevOps design objectives.

We believe that late binding is a best practice for CloudOps.

Understanding this concept is turning out to be an important but confusing differentiation for Crowbar. We’ve effectively inverted the typical deploy pattern of building up a cloud from bare metal; instead, Crowbar allows you to build a cloud from the top down.  The difference is critical – we delay hardware decisions until we have the information needed to do the correct configuration.

If Late Binding is still confusing, the concept is really very simple: “we hold off all work until you’ve decided how you want to setup your cloud.”

Late binding arose from our design objectives. We started the project with a few critical operational design objectives:

  1. Treat the nodes and application layers as an interconnected system
  2. Realize that application choices should drive down the entire application stack including BIOS, RAID and networking
  3. Expect the entire system to be in a constantly changing so we must track state and avoid locked configurations.

We’d seen these objectives as core tenets in hyperscale operators who considered bare metal and network configuration to be an integral part of their application deployment. We know it is possible to build the system in layers that only (re)deploy once the application configuration is defined.

We have all this great interconnected automation! Why waste it by having to pre-stage the hardware or networking?

In cloud, late binding is known as “elastic computing” because you wait until you need resources to deploy. But running apps on cloud virtual machines is simple when compared to operating a physical infrastructure. In physical operations, RAID, BIOS and networking matter a lot because there are important and substantial variations between nodes. These differences are what drive late binding as a one of Crowbar’s core design principles.

Late Binding Visualized

Our Vision for Crowbar – taking steps towards closed loop operations

When Greg Althaus and I first proposed the project that would become Dell’s Crowbar, we had already learned first-hand that there was a significant gap in both the technologies and the processes for scale operations. Our team at Dell saw that the successful cloud data centers were treating their deployments as integrated systems (now called DevOps) in which configuration of many components where coordinated and orchestrated; however, these approaches feel short of the mark in our opinion. We wanted to create a truly integrated operational environment from the bare metal through the networking up to the applications and out to the operations tooling.

Our ultimate technical nirvana is to achieve closed-loop continuous deployments. We want to see applications that constantly optimize new code, deployment changes, quality, revenue and cost of operations. We could find parts but not a complete adequate foundation for this vision.

The business driver for Crowbar is system thinking around improved time to value and flexibility. While our technical vision is a long-term objective, we see very real short-term ROI. It does not matter if you are writing your own software or deploying applications; the faster you can move that code into production the sooner you get value from innovation. It is clear to us that the most successful technology companies have reorganized around speed to market and adapting to pace of change.

System flexibility & acceleration were key values when lean manufacturing revolution gave Dell a competitive advantage and it has proven even more critical in today’s dynamic technology innovation climate.

We hope that this post helps define a vision for Crowbar beyond the upcoming refactoring. We started the project with the idea that new tools meant we could take operations to a new level.

While that’s a great objective, we’re too pragmatic in delivery to rest on a broad objective. Let’s take a look at Crowbar’s concrete strengths and growth areas.

Key strength areas for Crowbar

  1. Late binding – hardware and network configuration is held until software configuration is known.  This is a huge system concept.
  2. Dynamic and Integrated Networking – means that we treat networking as a 1st class citizen for ops (sort of like software defined networking but integrated into the application)
  3. System Perspective – no Application is an island.  You can’t optimize just the deployment, you need to consider hardware, software, networking and operations all together.
  4. Bootstrapping (bare metal) – while not “rocket science” it takes a lot of careful effort to get this right in a way that is meaningful in a continuous operations environment.
  5. Open Source / Open Development / Modular Design – this problem is simply too complex to solve alone.  We need to get a much broader net of environments and thinking involved.

Continuing Areas of Leadership

  1. Open / Lean / Incremental Architecture – these are core aspects of our approach.  While we have a vision, we also are very open to ways that solve problems faster and more elegantly than we’d expected.
  2. Continuous deployment – we think the release cycles are getting faster and the only way to survive is the build change into the foundation of operations.
  3. Integrated networking – software defined networking is cool, but not enough.  We need to have semantics that link applications, networks and infrastructure together.
  4. Equilivent physical / virtual – we’re not saying that you won’t care if it’s physical or virtual (you should), we think that it should not impact your operations.
  5. Scale / Hybrid – the key element to hybrid is scale and to hybrid is scale.  The missing connection is being able to close the loop.
  6. Closed loop deployment – seeking load management, code quality, profit, and cost of operations as factor in managed operations.

What does “enable upstream recipes” mean? Not just fishing for community goodness!

One of the major Crowbar 2.0 design targets is to allow you to “upstream” operations scripts more easily.  “Upstream code” means that parts of Crowbar’s source code could be maintained in other open source repositories.  This is beyond a simple dependency (like Rails, Curl, Java or Apache): Upstreaming allows Crowbar can use code managed in the other open source repositories for more general application.  This is important because Crowbar users can leverage DevOps logic that is more broadly targeted than just Crowbar.  Even more importantly, upstreaming means that we can contribute and take advantage of community efforts to improve the upstream source.

Specifically, Crowbar maintains a set of OpenStack cookbooks that make up the core of our OpenStack deployment.  These scripts have been widely cloned (not forked) and deCrowbarized for other deployments.  Unfortunately, that means that we do not benefit from downstream improvements and the cloners cannot easily track our updates.  This happened because Crowbar was not considered a valid upstream OpenStack repository because our deployment scripts required Crowbar.  The consequence of this cloning is that incompatible OpenStack recipes have propagated like cracks in a windshield.

While there are concrete benefits to upstreaming, there are risks too.  We have to evaluate if the upstream code has been adequately tested, operates effectively, implements best practices and leverages Crowbar capabilities.  I believe strongly that untested deployment code is worse than useless; consequently, the Dell Crowbar team provides significant value by validating that our deployments work as an integrated system.  Even more importantly, we will not upstream from unmoderated sources where changes are accepted without regard for downstream impacts.  There is a significant amount of trust required for upstreaming to work.

If upstreaming is so good, why did we not start out with upstream code?  It was simply not an option at the time – Crowbar was the first (and is still!) most complete set of DevOps deployment scripts for OpenStack in a public repository.
By design, Crowbar 1.0 was tightly coupled to Opscode Chef and required users to inject Crowbar dependencies into their Chef Recipes.  This approach allowed us to more quickly integrate capabilities between recipes and with nascent Crowbar features.  Our top design requirement was that our deployment was tightly integrated between hardware, networking, operating system, operations infrastructure and the application.  Figuring out the correct place to separate concerns was impractical; consequently, we injected dependencies into our Chef code.
We have reached a point with Crowbar development that we can correctly decouple Crowbar and Chef.
The benefits to upstreaming go far beyond enabling more collaboration on OpenStack deployments.  These same changes make it easier for Crowbar to leverage community deployment scripts without one-way modifications.  If you have a working Chef Recipe then making it work with Crowbar will no longer require changes that break it outside of Crowbar; therefore, you can leverage Crowbar capabilities without losing community input and without being locked into Crowbar.

OSCON preso graphic about Upstreaming added 7/23:

With Dell ARM-based “Copper” servers, Crowbar footprint grows

One of my team at Dell’s most critical lessons from hyperscale cloud deployments was the DevOps tooling and operations processes are key to success.  Our crowbar project was born out of this realization.

I have been tracking the progress the Copper ARM-based server from design to implementation internally.  Now, I’m excited to see it getting some deserved attention.

The Copper platform is really cool because the cost, power, and density ratios of the nodes are unparalleled.  This makes it an ideal platform for distributed mixed compute/store workloads like Hadoop.  The nodes in the platform have excellent RAM/CPU/Spindle ratios.

While Copper is driving huge density, it also drives forward the same hyperscale challenges that we’ve been trying to address with Crowbar; consequently, we’re already working to ensure that we can deploy and manage Copper with Crowbar at scale.

Copper and Crowbar make a natural team and we’re excited to be part of today’s announcement:

Dell is staging clusters of the Dell “Copper” ARM server within the Dell Solution Centers and with TACC so developers may book time on the platforms. Dell also will deliver an ARM-supported version of Crowbar, Dell’s open-source management infrastructure software, to the industry in the future.

Congratulations to the Copper team!

Asia-Pac Session for OpenStack Essex Global Deploy day

I did not want us to neglect Asia-Pac for the upcoming OpenStack Deploy day, so I was delighted when Mike Pittaro offered to help host the online content for the last session. Mike is an OpenStack contributor who recently joined my team at Dell.

This addresses the concern that our first Essex hack day was America’s daytime only so it was difficult for time zones east of GMT to participate.

We are working with Dell teams in Asia-Pac to setup more information to support Japan, China, Korea and Australia.

This picture, taken by Dan Choquette (my team too!), is from Toyko DevOpsDays.