Continuous Release combats disruptions of “Free Fall” development

Since I posted the “Free Fall” development post, I’ve been thinking a bit about the pros and cons of this type of off-release development.

The OpenStack Swift project does not do free fall because they are on a constant “ship ready” state for the project and only loosely flow the broader OpenStack release track.  My team at Dell also has minimal free fall development because we have a more frequent release clock and choose to have the team focus together through dev/integrate/harden cycles as much as possible.

From a Lean/Agile/CI perspective, I would work to avoid hidden development where possible.  New features are introduced by split test (they are in the code, but not active for most users) so that the all changes in incremental.  That means that refactoring, rearchitecture and new capabilities appear less disruptively.  While it may this approach appears to take more effort in the short term; my experience is that it accelerates delivery because we are less likely to over develop code.

Unfortunately, free fall development has the opposite effect.  Having code that appears in big blocks is contrary to best practices in my opinion.  Further, it rewards groups that work asynchronously and

While I think that OpenStack benefits from free fall work, I think that it is ultimately counter-productive.

Cloud Dev Laptop

This post is in response to multiple requests I’ve gotten from people outside of Dell.  My apologies if it is too commercial.  I work for Dell  and we make hot laptops AND clouds.
When you’re building clouds (as opposed to cloud applications), you need heavy equipment.  So it’s no surprise that I use a Precision M6600 17″ laptop that is capable of running a complete multi-node cloud data center.
IMHO, here are the core requirements for a Cloud Builder laptop:
  1. SSDs (I have two 1/4 TB of SSD):  We are constantly building/installing operating systems.  These are high I/O activities so SSDs are essential.  I’m constantly on the edge of no free space even with 1/2 TB .
  2. RAM (I have 32 Gb): It’s normal for us to run multiple VMs.  If you RAM starve your VMs (I used to have 16) then they page fault and you’re back to constrained disk I/O.  We assign 4 GB RAM per VM because it’s just faster.
  3. Many Cores: VMs w/ 1 CPU = thread contention.  Adding RAM and Disk can’t fix a threading issues.
  4. Bonus: I like a good keyboard and big display – I code, type & read a lot so the 17″ display helps.
For our devs, a normal cycle is write (desktop) -> build (in a VM) -> deploy (on additional VMs) -> full test requires >4 VMs (that’s over 16 GB RAM).  I don’t want to check in code until I complete that cycle.  On small RAM and spinning HDD  that cycle takes >1 hour.  On my laptop it is <15 minutes!
There are only a few models of laptop that can pack that type of power and they demand a premium; however, the extra umph translates into at least 3 or 4 more full cycles per day.  That’s a whole lot of extra productivity.

The real workloads begin: Crowbar’s Sophomore Year

Given Crowbar‘s frenetic Freshman year, it’s impossible to predict everything that Crowbar could become. I certainly aspire to see the project gain a stronger developer community and the seeds of this transformation are sprouting. I also see that community driven work is positioning Crowbar to break beyond being platforms for OpenStack and Apache Hadoop solutions that pay the bills for my team at Dell to invest in Crowbar development.

I don’t have to look beyond the summer to see important development for Crowbar because of the substantial goals of the Crowbar 2.0 refactor.

Crowbar 2.0 is really just around the corner so I’d like to set some longer range goals for our next year.

  • Growing acceptance of Crowbar as an in data center extension for DevOps tools (what I call CloudOps)
  • Deeper integration into more operating environments beyond the core Linux flavors (like virtualization hosts, closed and special purpose operating systems.
  • Improvements in dynamic networking configuration
  • Enabling more online network connected operating modes
  • Taking on production ops challenges of scale, high availability and migration
  • Formalization of our community engagement with summits, user groups, and broader developer contributions.

For example, Crowbar 2.0 will be able to handle downloading packages and applications from the internet. Online content is not a major benefit without being able to stage and control how those new packages are deployed; consequently, our goals remains tightly focused improvements in orchestration.

These changes create a foundation that enables a more dynamic operating environment. Ultimately, I see Crowbar driving towards a vision of fully integrated continuous operations; however, Greg & Rob’s Crowbar vision is the topic for tomorrow’s post.

Our Vision for Crowbar – taking steps towards closed loop operations

When Greg Althaus and I first proposed the project that would become Dell’s Crowbar, we had already learned first-hand that there was a significant gap in both the technologies and the processes for scale operations. Our team at Dell saw that the successful cloud data centers were treating their deployments as integrated systems (now called DevOps) in which configuration of many components where coordinated and orchestrated; however, these approaches feel short of the mark in our opinion. We wanted to create a truly integrated operational environment from the bare metal through the networking up to the applications and out to the operations tooling.

Our ultimate technical nirvana is to achieve closed-loop continuous deployments. We want to see applications that constantly optimize new code, deployment changes, quality, revenue and cost of operations. We could find parts but not a complete adequate foundation for this vision.

The business driver for Crowbar is system thinking around improved time to value and flexibility. While our technical vision is a long-term objective, we see very real short-term ROI. It does not matter if you are writing your own software or deploying applications; the faster you can move that code into production the sooner you get value from innovation. It is clear to us that the most successful technology companies have reorganized around speed to market and adapting to pace of change.

System flexibility & acceleration were key values when lean manufacturing revolution gave Dell a competitive advantage and it has proven even more critical in today’s dynamic technology innovation climate.

We hope that this post helps define a vision for Crowbar beyond the upcoming refactoring. We started the project with the idea that new tools meant we could take operations to a new level.

While that’s a great objective, we’re too pragmatic in delivery to rest on a broad objective. Let’s take a look at Crowbar’s concrete strengths and growth areas.

Key strength areas for Crowbar

  1. Late binding – hardware and network configuration is held until software configuration is known.  This is a huge system concept.
  2. Dynamic and Integrated Networking – means that we treat networking as a 1st class citizen for ops (sort of like software defined networking but integrated into the application)
  3. System Perspective – no Application is an island.  You can’t optimize just the deployment, you need to consider hardware, software, networking and operations all together.
  4. Bootstrapping (bare metal) – while not “rocket science” it takes a lot of careful effort to get this right in a way that is meaningful in a continuous operations environment.
  5. Open Source / Open Development / Modular Design – this problem is simply too complex to solve alone.  We need to get a much broader net of environments and thinking involved.

Continuing Areas of Leadership

  1. Open / Lean / Incremental Architecture – these are core aspects of our approach.  While we have a vision, we also are very open to ways that solve problems faster and more elegantly than we’d expected.
  2. Continuous deployment – we think the release cycles are getting faster and the only way to survive is the build change into the foundation of operations.
  3. Integrated networking – software defined networking is cool, but not enough.  We need to have semantics that link applications, networks and infrastructure together.
  4. Equilivent physical / virtual – we’re not saying that you won’t care if it’s physical or virtual (you should), we think that it should not impact your operations.
  5. Scale / Hybrid – the key element to hybrid is scale and to hybrid is scale.  The missing connection is being able to close the loop.
  6. Closed loop deployment – seeking load management, code quality, profit, and cost of operations as factor in managed operations.

Addressing OpenStack API equality: rethinking API via implementation over specification

This post is an extension of my post about OpenStack’s top 5 challenges from my perspective working on Dell’s OpenStack team.  This issue has been the subject of much public debate on the OpenStack lists, at the conference and online (Wilamuna & Shuttleworth).  So far, I have held back from the community discussion because I think both sides are being well represented by active contributors to trunk.

I have been, and remain, convinced that a key to OpenStack success as a cloud API is having a proven scale implementation; however, we need to find a compromise between implementation and specification.

The community needs code that specifies an API and offers an implementation backing that API.  OpenStack has been building an API by implementing the backing functionality (instead of vice versa).  This ordering is important because the best APIs are based on doing real work not by trying to anticipate potential needs.  I’m also a fan of implementation driven API because it emerges quickly.  That does not mean it’s all wild west and cowboy coding.  OpenStack has the concept of API extensions; consequently, its possible to offer new services in a safe way allowing new features to literally surface overnight.

Unfortunately, API solely via implementation risks being fragile, unpredictable, and unfair. 

It is fragile because the scope is reflected by what the implementors choose to support.  They often either expose internals or restrict capabilities in ways that a leading with design specification would not.  This same challenge makes them less predictable because the API may change as the implementation progresses or become locked into exposing internals when downstream consumers depend on exposed functions.

I am ready to forgive fragile and unpredictable, but unfair may prove to be expensive for the OpenStack community.

Here’s the problem: there is more than on right way to solve a problem.  The benefit of an API is that it allows multiple implementations to emerge and prove their benefits.  We’ve seen many cases where, even with a weak API, a later implementation is the one that carries the standard (I’m thinking browsers here).  When your API is too bound to implementation it locks out alternatives that expand your market.  It makes it unfair to innovators who come to the party in the second wave.

I think that it is possible to find a COMPROMISE between API implementation and specification; however, it takes a lot of maturity to make that work.  First, it requires the implementors to slow down and write definitions outside of their implementation.  Second, it requires implementors to emotionally decouple from their code and accept other API implementations.  Of course, a dose of test driven design (TDD) and continuous integration (CI) discipline does not hurt either.

I’m interested in hearing more opinions about this… come to the 10/27 OpenStack meetup to discuss and/or please comment!

Crowbar’s surprise value proposition: continous integration (#ci) testing

As part of our Agile/Lean methodologies, our team at Dell is highly invested in automated testing and continuous integration.  We’re running Jenkins to coordinate builds and EVERY CHECK-IN launches our full integration suite that tests our system end-to-end.  It may not be typical, but I don’t consider that to be particularly note worthy because it’s best practice.    (Rob’s note: if you write code and don’t think you have the authority then you need to geek-up and just do it – that’s our MO at Dell)

It’s important to understand that since Crowbar is an installer, every check-in does a FULL CLEAN INSTALL of all the Cactus OpenStack components.  Our verification requires that we test OpenStack because that’s our #1 exit requirement.  Consequently, we have built an automated build system that does a continuous integration test of a full, multi-node Nova/Glance/Swift deployment.

Automated end-to-end integration tests of OpenStack are a very handy thing!

In the last few weeks, we’ve heard from Dell internal groups and partners who are contributing to OpenStack Diablo that they want to leverage our work in continuous integration.  This will allow them to make sure that their development work does not regress other functions.  It’s a significant opportunity to ensure that we can collaborate between organizations.  It also promotes early development and distribution of Diablo installation scripts.

To support this in Crowbar, we are already planning incorporate more sophisticated revision control (likely based on Git) into Crowbar.

Note: YES, we consider our CI scripts to be part of our open source code.