Our Vision for Crowbar – taking steps towards closed loop operations

When Greg Althaus and I first proposed the project that would become Dell’s Crowbar, we had already learned first-hand that there was a significant gap in both the technologies and the processes for scale operations. Our team at Dell saw that the successful cloud data centers were treating their deployments as integrated systems (now called DevOps) in which configuration of many components where coordinated and orchestrated; however, these approaches feel short of the mark in our opinion. We wanted to create a truly integrated operational environment from the bare metal through the networking up to the applications and out to the operations tooling.

Our ultimate technical nirvana is to achieve closed-loop continuous deployments. We want to see applications that constantly optimize new code, deployment changes, quality, revenue and cost of operations. We could find parts but not a complete adequate foundation for this vision.

The business driver for Crowbar is system thinking around improved time to value and flexibility. While our technical vision is a long-term objective, we see very real short-term ROI. It does not matter if you are writing your own software or deploying applications; the faster you can move that code into production the sooner you get value from innovation. It is clear to us that the most successful technology companies have reorganized around speed to market and adapting to pace of change.

System flexibility & acceleration were key values when lean manufacturing revolution gave Dell a competitive advantage and it has proven even more critical in today’s dynamic technology innovation climate.

We hope that this post helps define a vision for Crowbar beyond the upcoming refactoring. We started the project with the idea that new tools meant we could take operations to a new level.

While that’s a great objective, we’re too pragmatic in delivery to rest on a broad objective. Let’s take a look at Crowbar’s concrete strengths and growth areas.

Key strength areas for Crowbar

  1. Late binding – hardware and network configuration is held until software configuration is known.  This is a huge system concept.
  2. Dynamic and Integrated Networking – means that we treat networking as a 1st class citizen for ops (sort of like software defined networking but integrated into the application)
  3. System Perspective – no Application is an island.  You can’t optimize just the deployment, you need to consider hardware, software, networking and operations all together.
  4. Bootstrapping (bare metal) – while not “rocket science” it takes a lot of careful effort to get this right in a way that is meaningful in a continuous operations environment.
  5. Open Source / Open Development / Modular Design – this problem is simply too complex to solve alone.  We need to get a much broader net of environments and thinking involved.

Continuing Areas of Leadership

  1. Open / Lean / Incremental Architecture – these are core aspects of our approach.  While we have a vision, we also are very open to ways that solve problems faster and more elegantly than we’d expected.
  2. Continuous deployment – we think the release cycles are getting faster and the only way to survive is the build change into the foundation of operations.
  3. Integrated networking – software defined networking is cool, but not enough.  We need to have semantics that link applications, networks and infrastructure together.
  4. Equilivent physical / virtual – we’re not saying that you won’t care if it’s physical or virtual (you should), we think that it should not impact your operations.
  5. Scale / Hybrid – the key element to hybrid is scale and to hybrid is scale.  The missing connection is being able to close the loop.
  6. Closed loop deployment – seeking load management, code quality, profit, and cost of operations as factor in managed operations.

Crowbar 2.0 Objectives: Scalable, Heterogeneous, Flexible and Connected

The seeds for Crowbar 2.0 have been in the 1.x code base for a while and were recently accelerated by SuSE.  With the Dell | Cloudera 4 Hadoop and Essex OpenStack-powered releases behind us, we will now be totally focused bringing these seeds to fruition in the next two months.

Getting the core Crowbar 2.0 changes working is not a major refactoring effort in calendar time; however, it will impact current Crowbar developers by changing improving the programming APIs. The Dell Crowbar team decided to treat this as a focused refactoring effort because several important changes are tightly coupled. We cannot solve them independently without causing a larger disruption.

All of the Crowbar 2.0 changes address issues and concerns raised in the community and are needed to support expanding of our OpenStack and Hadoop application deployments.

Our technical objective for Crowbar 2.0 is to simplify and streamline development efforts as the development and user community grows. We are seeking to:

  1. simplify our use of Chef and eliminate Crowbar requirements in our Opscode Chef recipes.
    1. reduce the initial effort required to leverage Crowbar
    2. opens Crowbar to a broader audience (see Upstreaming)
  2. provide heterogeneous / multiple operating system deployments. This enables:
    1. multiple versions of the same OS running for upgrades
    2. different operating systems operating simultaneously (and deal with heterogeneous packaging issues)
    3. accommodation of no-agent systems like locked systems (e.g.: virtualization hosts) and switches (aka external entities)
    4. UEFI booting in Sledgehammer
  3. strengthen networking abstractions
    1. allow networking configurations to be created dynamically (so that users are not locked into choices made before Crowbar deployment)
    2. better manage connected operations
    3. enable pull-from-source deployments that are ahead of (or forked from) available packages.
  4. improvements in Crowbar’s core database and state machine to enable
    1. larger scale concerns
    2. controlled production migrations and upgrades
  5. other important items
    1. make documentation more coupled to current features and easier to maintain
    2. upgrade to Rails 3 to simplify code base, security and performance
    3. deepen automated test coverage and capabilities

Beyond these great technical targets, we want Crowbar 2.0 is to address barriers to adoption that have been raised by our community, customers and partners. We have been tracking concerns about the learning curve for adding barclamps, complexity of networking configuration and packaging into a single ISO.

We will kick off to community part of this effort with an online review on 7/16 (details).

PS: why a refactoring?

My team at Dell does not take on any refactoring changes lightly because they are disruptive to our community; however, a convergence of requirements has made it necessary to update several core components simultaneously. Specifically, we found that desired changes in networking, operating systems, packaging, configuration management, scale and hardware support all required interlocked changes. We have been bringing many of these changes into the code base in preparation and have reached a point where the next steps require changing Crowbar 1.0 semantics.

We are first and foremost an incremental architecture & lean development team – Crowbar 2.0 will have the smallest footprint needed to begin the transformations that are currently blocking us. There is significant room during and after the refactor for the community to shape Crowbar.

Extending Chef’s reach: “Managed Nodes” for External Entities.

Note: this post is very technical and relates to detailed Chef design patterns used by Crowbar. I apologize in advance for the post’s opacity. Just unleash your inner DevOps geek and read on. I promise you’ll find some gems.

At the Opscode Community Summit, Dell’s primary focus was creating an “External Entity” or “Managed Node” model. Matt Ray prefers the term “managed node” so I’ll defer to that name for now. This model is needed for Crowbar to manage system components that cannot run an agent such as a network switch, blade chassis, IP power distribution unit (PDU), and a SAN array. The concept for a managed node is that there is an instance of the chef-client agent that can act as a delegate for the external entity. We’ve been reluctant to call it a “proxy” because that term is so overloaded.

My Crowbar vision is to manage an end-to-end cloud application life-cycle. This starts from power and network connections to hardware RAID and BIOS then up to the services that are installed on the node and ultimately reaches up to applications installed in VMs on those nodes.

Our design goal is that you can control a managed node with the same Chef semantics that we already use. For example, adding a Network proposal role to the Switch managed node will force the agent to update its configuration during the next chef-client run. During the run, the managed node will see that the network proposal has several VLANs configured in its attributes. The node will then update the actual switch entity to match the attributes.

Design Considerations

There are five key aspects of our managed node design. They are configuration, discovery, location, relationships, and sequence. Let’s explore each in detail.

A managed node’s configuration is different than a service or actuator pattern. The core concept of a node in chef is that the node owns the configuration. You make changes to the nodes configuration and it’s the nodes job to manage its state to maintain that configuration. In a service pattern, the consumer manages specific requests directly. At the summit (with apologies to Bill Clinton), I described Chef configuration as telling a node what it “is” while a service provide verbs that change a node. The critical difference is that a node is expected to maintain configuration as its composition changes (e.g.: node is now connected for VLAN 666) while a service responds to specific change requests (node adds tag for VLAN 666). Our goal is the maintain Chef’s configuration management concept for the external entities.

Managed nodes also have a resource discovery concept that must align with the current ohai discovery model. Like a regular node, the manage node’s data attributes reflect the state of the managed entity; consequently we’d expect a blade chassis managed node to enumerate the blades that are included. This creates an expectation that the manage node appears to be “root” for the entity that it represents. We are also assuming that the Chef server can be trusted with the sharable discovered data. There may be cases where these assumptions do not have to be true, but we are making them for now.

Another essential element of managed nodes is that their agent location matters because the external resource generally has restricted access. There are several examples of this requirement. Switch configuration may require a serial connection from a specific node. Blade SANs and PDUs management ports are restricted to specific networks. This means that the manage node agents must run from a specific location. This location is not important to the Chef server or the nodes’ actions against the managed node; however, it’s critical for the system when starting the managed node agent. While it’s possible for managed nodes to run on nodes that are outside the overall Chef infrastructure, our use cases make it more likely that they will run as independent processes from regular nodes. This means that we’ll have to add some relationship information for managed nodes and perhaps a barclamp to install and manage managed nodes.

All of our use cases for managed nodes have a direct physical linkage between the managed node and server nodes. For a switch, it’s the ports connected. For a chassis, it’s the blades installed. For a SAN, it’s the LUNs exposed. These links imply a hierarchical graph that is not currently modeled in Chef data – in fact, it’s completely missing and difficult to maintain. At this time, it’s not clear how we or Opscode will address this. My current expectation is that we’ll use yet more roles to capture the relationships and add some hierarchical UI elements into Crowbar to help visualize it. We’ll also need to comprehend node types because “managed nodes” are too generic in our UI context.

Finally, we have to consider the sequence of action for actions between managed nodes and nodes.  In all of our uses cases, steps to bring up a node requires orchestration with the managed node.  Specifically, there needs to be a hand-off between the managed node and the node.  For example, installing an application that uses VLANs does not work until the switch has created the VLAN,  There are the same challenges on LUNs and SAN and blades and chassis.  Crowbar provides orchestration that we can leverage assuming we can declare the linkages.

For now, a hack to get started…

For now, we’ve started on a workable hack for managed nodes. This involves running multiple chef-clients on the admin server in their own paths & processes. We’ll also have to add yet more roles to comprehend the relationships between the managed nodes and the things that are connected to them. Watch the crowbar listserv for details!

Extra Credit

Notes on the Opscode wiki from the Crowbar & Managed Node sessions

The 451 Group Cloudscape report strikes chord misses harmony (DevOps, Hybrid Cloud, Orchestration)

It’s impossible to resist posting about this month’s  451 Group Cloudscape report when it calls me out by name as a leading cloud innovator:

… ProTier founders Dave McCrory and Rob Hirschfeld. ProTier [note: now part of Quest] was, indeed, the first VMware ecosystem vendor to be tracked by The 451 Group. In the face of a skeptical world, these entrepreneurs argued that virtualization needed automation in order to realize its full potential, and that the test lab was the low-hanging fruit. Subsequent events have more than vindicated their view (pg. 33).

It’s even better when the report is worth reading and offers insights into forces shaping the industry.  It’s nice to be “more than vindicated” on an amazing journey we started over 10 years ago!

Rather than recite 451’s points (hybrid cloud = automation + orchestration + devops + pixie dust), I’d rather look at the problem different way as a counterpoint.

The problem is “how do we deal with applications that are scattered over multiple data centers?”

I do not think orchestration is the complete answer.  Current orchestration is too focused on moving around virtual machines (aka workloads).

Ultimately, the solution lies in application architecture; however, I feel that is also a misdirection because cloud is redefining what an “application architecture” means.

Applications are a dynamic mix of compute, storage, and connectivity.

We’re entering an age when all of these ingredients will be delivered as elastic services that will be managed by the applications themselves.  The concept of self management is an extension of DevOps principles that fuse application function and deployment. There are missing pieces, but I’m seeing the innovation moving to fill those gaps.

If you want to see the future of cloud applications then look at the network and storage services that are emerging.  They will tell you more about the future than orchestration.

 

Addressing OpenStack API equality: rethinking API via implementation over specification

This post is an extension of my post about OpenStack’s top 5 challenges from my perspective working on Dell’s OpenStack team.  This issue has been the subject of much public debate on the OpenStack lists, at the conference and online (Wilamuna & Shuttleworth).  So far, I have held back from the community discussion because I think both sides are being well represented by active contributors to trunk.

I have been, and remain, convinced that a key to OpenStack success as a cloud API is having a proven scale implementation; however, we need to find a compromise between implementation and specification.

The community needs code that specifies an API and offers an implementation backing that API.  OpenStack has been building an API by implementing the backing functionality (instead of vice versa).  This ordering is important because the best APIs are based on doing real work not by trying to anticipate potential needs.  I’m also a fan of implementation driven API because it emerges quickly.  That does not mean it’s all wild west and cowboy coding.  OpenStack has the concept of API extensions; consequently, its possible to offer new services in a safe way allowing new features to literally surface overnight.

Unfortunately, API solely via implementation risks being fragile, unpredictable, and unfair. 

It is fragile because the scope is reflected by what the implementors choose to support.  They often either expose internals or restrict capabilities in ways that a leading with design specification would not.  This same challenge makes them less predictable because the API may change as the implementation progresses or become locked into exposing internals when downstream consumers depend on exposed functions.

I am ready to forgive fragile and unpredictable, but unfair may prove to be expensive for the OpenStack community.

Here’s the problem: there is more than on right way to solve a problem.  The benefit of an API is that it allows multiple implementations to emerge and prove their benefits.  We’ve seen many cases where, even with a weak API, a later implementation is the one that carries the standard (I’m thinking browsers here).  When your API is too bound to implementation it locks out alternatives that expand your market.  It makes it unfair to innovators who come to the party in the second wave.

I think that it is possible to find a COMPROMISE between API implementation and specification; however, it takes a lot of maturity to make that work.  First, it requires the implementors to slow down and write definitions outside of their implementation.  Second, it requires implementors to emotionally decouple from their code and accept other API implementations.  Of course, a dose of test driven design (TDD) and continuous integration (CI) discipline does not hurt either.

I’m interested in hearing more opinions about this… come to the 10/27 OpenStack meetup to discuss and/or please comment!

PaaS Simplified: an application architecture that responds to load

handoff

In addition to attending the great sessions at the OpenStack Design Conference, our Dell team realized that we’ve been making Platform as a Service (PaaS) much more complex.  Stripping away the detritus is important because it looks like “What is a PaaS” is changing on a daily basis so boiling it down to the must fundamental is essential.

At its core, a PaaS is an application that changes its architecture based on the load.   That’s it no further definition is required.

I’ve been playing with this definition since April and am finding that it’s a much more productive definition of PaaS than any that I’ve used so far.  The reason is that it’s

  1. application focused,
  2. not language or services bound and
  3. captures the business use cases

Of course, I’m going to have to provide more backup in future posts.  I want to invite discussion about this perspective on PaaS.  I’m especially interesting in seeing how recent offerings from VMware (OpenPaaS/CloudFoundry) or Amazon (Elastic Beanstalk) measure against this concept.

Bad Premise: Cloud Outages are *not* driving IT back to premises

trapped

I wrote this responding to Lauren Carlson‘s (Software Advice) Blog Post.  Lauren – I’d be more likely to agree with the statement that “SLAs are dead”  Here’s why…

<soapbox>

Recent industry buzz about cloud service level agreements (SLAs) and reliability miss the core point about cloud.  Cloud is about agility, business models, consumerization of software and merciless pursuit of efficiency.

The fact that Amazon EC2 built its base without an “enterprise” SLA is exhibit #1 that the IT world changed and it’s not going back.

Here are my reasons why IT pandoras can’t get cloud back into the box.

#1. Cloud has vastly superior network connectivity

The concept of your users accessing your applications from inside your firewall is so 2005.  Today’s reality is that significant amounts of network access is externally routed means that applications need to live where they have excellent bandwidth to their users and to other applications.

#2. Cloud has elastic consumption of resources

Cloud is not less expensive infrastructure, it is mainly more flexible.  If you’re worried about an outage, then cloud is exactly the investment for you because you position a backup site at another location without having to pay for online resources.  It’s much harder to take down a site that invests the time to design a system that dynamically reallocates load between sites.

#3. Cloud drives more robust architecture

The fact that cloud delivery is more opaque and modular without a five 9s SLA has driven a cloud application architecture revolution (see CAP).  We have shifted the app paradigm from robust scale up hardware to robust scale out software.  Also significant, DevOps innovations have made deployments repeatable and adaptable.

The only “logical” argument for pulling applications back from the cloud is to assert control over more of the delivery chain for your application.  It the same reason that we think that driving is safer than flying – we’re the ones sitting behind the wheel when we drive.  News flash – driving is NOT safer than flying.

Cloud applications are not about hardware infrastructure, they are about SOFTWARE.  Perhaps one of the greatest disservices foisted on the market was saying cloud is synonymous with “Infrastructure as a Service” and “Virtualization.”  Cloud applications are powerful because we created ways that circumvent the limitations of IaaS and VMs!

</soapbox>

Not all APIs are equal: the power of API + implementation (OpenStack vs LibCloud vs DeltaCloud)

sky

I’ve been getting a lot of questions about Apache LibCloud and RedHat’s DeltaCloud vs. OpenStack.  While all of these projects offer APIs, only OpenStack is based on an implementation.

Having an implementation means that the API is reflected by code that delivers the functionality of the API.  This means that the implementation based API more closely reflects the actual workings of the system while the “pure” API must abstract the working of multiple systems.   The API only approach ends up having to become a least common deminator instead of a vision of the pure use cases.

LibCloud and DeltaCloud are important and useful.  They provide abstractions that help developers write applications without being tied to a specific cloud vendor.  While lack of lock-in is a concrete benefit, it comes at a price.  The price is that the API shim cannot expose features that differentiate the platforms.  This may represent a significant loss of functionality or performance.

When developers implement directly against an implemented API, they can take advantage of the full feature set of their target cloud.  They can also test and verify more directly.  These are significant benefits that result in richer, more robust and faster to market products.

Both approaches have their place and are needed in the market.  If I needed to write against multiple clouds for portability then Libcloud is a slam dunk.  If I needed rich features and an ecosystem then OpenStack or Amazon are better choices.

32nd rule to measure complexity + 6 hyperscale network design rules

If you’ve studied computer science then you know there are algorithms that calculate “complexity.” Unfortunately, these have little practical use for data center operators.  My complexity rule does not require a PhD:

The 32nd rule: If it takes more than 30 seconds to pick out what would be impacted by a device failure then your design is too complex.

6 Hyperscale Network Design Rules

  1. Cost Matters
  2. Keep Networks Flat
  3. Filter at the Edge
  4. Design Fault Zones
  5. Plan for Local Traffic
  6. Offer load balancers (to your users)

Sorry for the teaser… I’ll be able to release more substance behind this list soon.   Until then comments are (as always) welcome!