7 takeaways from DevOps Days Austin

Block Tables

I spent Tuesday and Wednesday at DevOpsDays Austin and continue to be impressed with the enthusiasm and collaborative nature of the DOD events.  We also managed to have a very robust and engaged twitter backchannel thanks to an impressive pace set by Gene Kim!

I’ve still got a 5+ post backlog from the OpenStack summit, but wanted to do a quick post while it’s top of mind.

My takeaways from DevOpsDays Austin:

  1. DevOpsDays spends a lot of time talking about culture.  I’m a huge believer on the importance of culture as the foundation for the type of fundamental changes that we’re making in the IT industry; however, it’s also a sign that we’re still in the minority if we have to talk about culture evangelism.
  2. Process and DevOps are tightly coupled.  It’s very clear that Lean/Agile/Kanban are essential for DevOps success (nice job by Dominica DeGrandis).  No one even suggested DevOps+Waterfall as a joke (but Patrick Debois had a picture of a xeroxed butt in his preso which is pretty close).
  3. Still need more Devs people to show up!  My feeling is that we’ve got a lot of operators who are engaging with developers and fewer developers who are engaging with operators (the “opsdev” people).
  4. Chef Omnibus installer is very compelling.  This approach addresses issues with packaging that were created because we did not have configuration management.  Now that we have good tooling we separate the concerns between bits, configuration, services and dependencies.  This is one thing to watch and something I expect to see in Crowbar.
  5. The old mantra still holds: If something is hard, do it more often.
  6. Eli Goldratt’s The Goal is alive again thanks to Gene Kims’s smart new novel, The Phoenix project, about DevOps and IT  (I highly recommend both, start with Kim).
  7. Not DevOps, but 3D printing is awesome.  This is clearly a game changing technology; however, it takes some effort to get right.  Dell brought a Solidoodle 3D printer to the event to try and print OpenStack & Crowbar logos (watch for this in the future).

I’d be interested in hearing what other people found interesting!  Please comment here and let me know.

Crowbar and our Pivot (or, how we slipped and shipped Grizzly)

Crowbar Grizzly PostMy team at Dell uses Lean process because it forces us to be honest about making hard choices. Our recent decision to pivot back to Crowbar 1.x for the OpenStack Grizzly release is a great example how the pivot process works.

4/24 note: I have a longer post and ISO for Grizzly on Crowbar waiting until we enter QA. The Crowbar community is already very active around this work and you’re encouraged to join.

Like any refactor, there was schedule risk when we started the Crowbar 2.x release. To mitigate this risk, we made two critical choices. First, we choose to advance the OpenStack barclamps on the 1.x code base in parallel with the 2.x work. Second, we chose a pivot date for the team to choose releasing Grizzly on the 1.x or 2.x trunks.

Choosing to jump back to 1.x was one of the hardest choices I’ve made in my career. I’m proud that we had the foresight to keep that as an option and prouder that our team rallied to make it happen.

I acknowledge that 1.x has gaps; however, getting Grizzly into the field for PoCs and pilots with 1.x provide substantial benefits to the community.  That said, there are barclamps for HA deployments and other production features that are under development on the 1.x branch and will be available in the community.

The 2.x code base provides important features but we are building from on the 1.x deployment recipes. This means that development, testing and tuning applied to the Grizzly barclamps will translates directly into Crowbar 2.x field readiness. In fact, more completeness on OpenStack can dramatically simplify Crowbar 2.x testing efforts.  This is especially true on the OpenStack Networking (fka Quantum) barclamps because they are new work.

Delivering solutions is a balance between features, timing and field experience.  The Crowbar team’s preference is to collaborate with operators in the field and that means making workable software available quickly.

I hope that you’ll agree with our approach and help us make Grizzly the most deployable OpenStack yet.

OpenStack’s next hurdle: Interoperability. Why should you care?

SXSW life size Newton's Cradle

SXSW life size Newton’s Cradle

The OpenStack Board spent several hours (yes, hours) discussing interoperability related topics at the last board meeting.  Fundamentally, the community benefits when uses can operate easily across multiple OpenStack deployments (their own and/or public clouds).

Cloud interoperability: the ability to transfer workloads between systems without changes to the deployment operations management infrastructure.

This is NOT hybrid (which I defined as a workload transparently operating in multiple systems); however it is a prereq to achieve scalable hybrid operation.

Interoperability matters because the OpenStack value proposition is all about creating a common platform.  IT World does a good job laying out the problem (note, I work for Dell).  To create sites that can interoperate, we have to some serious lifting:

At the OpenStack Summit, there are multiple chances to engage on this.   I’m moderating a panel about Interop and also sharing a session about the highly related topic of Reference Architectures with Monty Tayor.

The Interop Panel (topic description here) is Tuesday @ 5:20pm.  If you join, you’ll get to see me try to stump our awesome panelists

  • Jonathan LaCour, DreamHost
  • Troy Toman, Rackspace
  • Bernard Golden,  Enstratius
  • Monty Taylor, OpenStack Board (and HP)
  • Peter Pouliot, Microsoft

PS: Oh, and I’m also talking about DevOps Upgrades Patterns during the very first session (see a preview).

DevOps approaches to upgrade: Cube Visualization

I’m working on my OpenStack summit talk about DevOps upgrade patterns and got to a point where there are three major vectors to consider:

  1. Step Size (shown as X axis): do we make upgrades in small frequent steps or queue up changes into larger bundles? Larger steps mean that there are more changes to be accommodated simultaneously.
  2. Change Leader (shown as Y axis): do we upgrade the server or the client first? Regardless of the choice, the followers should be able to handle multiple protocol versions if we are going to have any hope of a reasonable upgrade.
  3. Safeness (shown as Z axis): do the changes preserve the data and productivity of the entity being upgraded? It is simpler to assume to we simply add new components and remove old components; this approach carries significant risks or redundancy requirements.

I’m strongly biased towards continuous deployment because I think it reduces risk and increases agility; however, I laying out all the vertices of the upgrade cube help to visualize where the costs and risks are being added into the traditional upgrade models.

Breaking down each vertex:

  1. Continuous Deploy – core infrastructure is updated on a regular (usually daily or faster) basis
  2. Protocol Driven – like changing to HTML5, the clients are tolerant to multiple protocols and changes take a long time to roll out
  3. Staged Upgrade – tightly coordinate migration between major versions over a short period of time in which all of the components in the system step from one version to the next together.
  4. Rolling Upgrade – system operates a small band of versions simultaneously where the components with the oldest versions are in process of being removed and their capacity replaced with new nodes using the latest versions.
  5. Parallel Operation – two server systems operate and clients choose when to migrate to the latest version.
  6. Protocol Stepping – rollout of clients that support multiple versions and then upgrade the server infrastructure only after all clients have achieved can support both versions.
  7. Forced Client Migration – change the server infrastructure and then force the clients to upgrade before they can reconnect.
  8. Big Bang – you have to shut down all components of the system to upgrade it

This type of visualization helps me identify costs and options. It’s not likely to get much time in the final presentation so I’m hoping to hear in advance if it resonates with others.

PS: like this visualization? check out my “magic 8 cube” for cloud hosting options.

What foo is “contribution” to open source? Mik Kersten & Tasktop @ SXSW

Nested

How do we really know who influences most in a software project?  We can easily track code commits, but there are more bits to the project than the commits.

I had the good fortune to attend Mik Kersten’s Code Graph presentation at SXSW last week. Mik started the Eclipse Mylyn project and went on to found Tasktop. Both are built on the very intriguing concepts that software development production (aka work) is organized around tasks.

His premise is that organizing around tasks provides a more manageable and actionable view of a project than a more traditional application life-cycle management (ALM) approaches.  I’m a sucker for any presentation about lean development process that includes references to both DevOps and industrial engineering (I have an MS in IE), but Mik surprised me by taking his code graph concept to a whole ‘nutha level.

The software value chain is much deeper than just the people who write code. Mik’s approach included managers, testers and operators in the interaction graphs for his projects.

By including all of the ALM artifacts in the analysis, you get a much richer picture of the influencers for a project.

For example, the development manager may never show up as a code committer; however, they are hugely influential in which work gets prioritized. If your graph includes who is touching the work assignments and stories then the manager’s influence jumps out immediately. That knowledge would completely change how and who you may interact with a team. It effectively brings a shadow contributor into the light.

The same is true for QA members who are running tests and opening defects and operators who are building deployment scripts. Ideally, it should include users who exercise different parts of the applications capabilities.

Mik’s graphs clearly showed the influence impact of managers because they touched all of the story cards for the project.  The people who own the story cards are the most potent influencers in a project, yet they are invisible in code repositories!

I would love to see an impact graph for a software project that equally reflected the wide range of contributions that people make to its life-cycle.  This type of information helps rebalance the power in a project.

Industrial engineering legend W.E. Demming‘s advice is to look at production as a system.  Finding ways to show everyone’s contributions is an important step towards bringing lean processes fully into software manufacturing.

5 things keeping DevOps from playing well with others (Chef, Crowbar and Upstream Patterns)

Sharing can be hardSince my earliest days on the OpenStack project, I’ve wanted to break the cycle on black box operations with open ops. With the rise of community driven DevOps platforms like Opscode Chef and Puppetlabs, we’ve reached a point where it’s both practical and imperative to share operational practices in the form of code and tooling.

Being open and collaborating are not the same thing.

It’s a huge win that we can compare OpenStack cookbooks. The real victory comes when multiple deployments use the same trunk instead of forking.

This has been an objective I’ve helped drive for OpenStack (with Matt Ray) and it has been the Crowbar objective from the start and is the keystone of our Crowbar 2 work.

This has proven to be a formidable challenge for several reasons:

  1. diverging DevOps patterns that can be used between private, public, large, small, and other deployments -> solution: attribute injection pattern is promising
  2. tooling gaps prevent operators from leveraging shared deployments -> solution: this is part of Crowbar’s mission
  3. under investing in community supporting features because they are seen as taking away from getting into production -> solution: need leadership and others with join
  4. drift between target versions creates the need for forking even if the cookbooks are fundamentally the same -> solution: pull from source approaches help create distro independent baselines
  5. missing reference architectures interfere with having a stable baseline to deploy against -> solution: agree to a standard, machine consumable RA format like OpenStack Heat.

Unfortunately, these five challenges are tightly coupled and we have to progress on them simultaneously. The tooling and community requires patterns and RAs.

The good news is that we are making real progress.

Judd Maltin (@newgoliath), a Crowbar team member, has documented the emerging Attribute Injection practice that Crowbar has been leading. That practice has been refined in the open by ATT and Rackspace. It is forming the foundation of the OpenStack cookbooks.

Understanding, discussing and supporting that pattern is an important step toward accelerating open operations. Please engage with us as we make the investments for open operations and help us implement the pattern.

Behavior Driven Development (BDD) and Crowbar

Test Test TestI’m a huge advocate of both behavior and test driven development (BDD & TDD). For the Crowbar 2 refactor, I’ve insisted (with community support) that new code has test coverage to the largest extent possible. While this inflicted some technical debt, it dramatically reduces the risk and effort for new developers to contribute.

For open source projects, they are even more important because they allow the community to contribute to the project with confidence.

A core part of this effort has been the Erlang BDD (“bravo delta”) tool that I had started before my team began Crowbar (code link).

I’m a big fan of BDD domain specific languages (DSL) because I think that they are descriptive. Ideally, everyone on the team including marketing and documentation authors should be able to understand and contribute to these tests.

I’ve been training our QA team on how the BDD system works and they are surprised at the clarity of the DSL. By reading the DSL for a feature, then can figure out what the developer had in mind for the system to do. Even better, they can see which use-cases the developer is already testing. Yet the real excitement comes from the potential to collaborate on the feature definitions before the code is written. My blue-sky-with-rainbows hope is that developers and testers will sit down together and review the BDD feature descriptions before code is written (perhaps even during planning). Even short of that nirvana, the BDD feature descriptions provide something that everyone can review and discuss where code (even with verbose documentation) falls short.

Ok, so you already know the benefits of BDD. Why didn’t I do this in the Cucumber? It’s the leading tool and a logical fit for a Rails project like Crowbar. Frankly, I have a love-hate relationship with Cucumber.

  1. It’s slow. And that does not scale for testing. I’m of the belief that slows tests destroy developer productivity because they encourage distractions.  Our BDD is fast and is not yet optimized.
  2. Too coupled to app framework – you can bypass the UI/API for testing if needed. If I’m doing behavior testing then I want to make sure that everything I test is accessible to the user too.
  3. While Cucumber has a lot of good “webrat” steps to validate basic web pages, I found that these were very basic and I quickly had to write my own.
  4. Erlang pattern matching made it much easier to define steps in a logical way with much less RegEx than Cucumber
  5. Erlang is designed to let us parallelize the tests.
  6. I like programming in Erlang (and I had started BDD before I started Crowbar)

And it goes beyond just testing our code…

We ultimately want to use the BDD infrastructure to gate Crowbar deployments not just code check-ins. That means that when Crowbar orchestrates an application deployment, the BDD infrastructure will execute tests to verify that the deployment is exhibiting the expected behaviors. If the installation does not pass then Crowbar would roll-back or hold the deployment.

This objective is not new or unique – it’s modus operandi at advanced companies who practice continuous deployment. Our position is that this should be an integral part of the orchestration framework.

One side benefit of the BDD system as designed is that it is also a simulator. We are able to take the same core infrastructure and turn it into a load generator and database populator. Unlike more coupled tools, you can run these from anywhere.

Post Script: Here’s the topic that I’m submitting for presentation at OSCON

Continue reading

Don’t complicate my cloud! It’s just infrastructure with an API

Getty MazeI’ve been “in cloud” for over 13 years (@dmcrory and I submitted patents using it starting in 2001) and I’m continually amazed at how complicated people want to make it.

For my role at Dell, I’m continually invited to seasons of meetings to define cloud, cloud architecture and cloud strategy. The reason these meetings go on and on is that everyone wants to make cloud complicated when it’s really very simple.

Cloud is infrastructure with an API.

That’s it. Everything else is just a consequence of having infrastructure with an API because API provides the ability to provide remote control.

What else do people try to lump into cloud?  Here are some of my topic cloud obfuscators:

  • (inter)network.  Yes, networks make an API interesting.  They are just an essential component but they are not cloud.  Most technologies are interesting because of networks: can we stop turning everything networked into cloud?  Thanks to nonsensical mega-dollar marketing campaigns, I despair this is a moot point.
  • as-a-service.  That’s another way of saying “accessible via an API.”  We have many flavors of Platform, Data, Application, Love or whatever as a Service.  That means they have a API.  Infrastructure as a Service (IaaS) is a cloud.
  • virtualization.  VMs were the first good example of hardware with an API; however, we had virtual containers (on Mainframes!) long before we had “cloud.”    They make cloud easier but they are not cloud.
  • pay-as-you-go (service pricing).  This is a common cloud requirement but it’s really a business model.  If someone builds a private cloud then it is still a cloud.

  • multi-tenant.  Another common requirement where we expect a cloud to be able to isolate users.  I agree that this is a highly desirable attribute of a good API implementation; however, it’s not essential to a cloud.  For example, most public clouds do not have true network isolation model.
  • elastic demand.  IMHO, another word for API driven provisioning.
  • live migration.  This is a cool feature often implemented on top of virtualization, but it’s not cloud.  We were doing live migrate with shared storage and clusters before anyone heard of cloud.   I don’t think this is cloud at all but someone out there does so I included it in the list.
  • security.  Totally important consideration and required for deployments large and small, but not presence/lack does not make something cloud.

We start talking about these points and then forget the whole API thing.  These items are important, but they do not make it “a cloud.”  When Dave McCrory and I first discussed API Infrastructure as “cloud,” it was driven by the fact that you could hide the actual infrastructure behind the API.  The critical concept was that the API allowed a you to manage a server anywhere from anywhere.

When Amazon offered the first EC2 service, it had to be a cloud because the servers were remote. It was not a cloud because it was on the internet; plenty of other companies were offering hosted servers. It was a cloud because their offering allowed required operators to use and API to interact with the infrastructure.  I remember that EC2’s lack of UI (and SLA) causing many to predict it would be a failure; instead, it sparked a revolution.

I’m excited now because we’re entering a new generation of cloud where Infrastructure APIs include networking and storage in addition to compute.  Mix in some of the interesting data and network services and we’re going to have truly dynamic and powerful clouds.  More importantly, we going to have some truly amazing applications.

What do you think?  Is API a sufficient definition of cloud in your opinion?

PS: Yes, if you have a physical server/network/store that is completely controllable by an API then you’ve got a cloud on your hands.

I respectfully disagree – we are totally aligned on your lack of understanding

Team FacesOccasionally, my journeys into Agile and Lean process force me down to its foundation: cultural fit.  Frankly, there is nothing more central to the success of a team than culture. That’s especially true about Lean because of the humility and honesty required. If your team is not built on a foundation of trust and shared values then it’s impossible keep having the listening and responsive dialog with our customers.

Successful teams have to be honest about taking negative feedback and you cannot do that without trust.

Trust is built on working out differences. Ideally, it would be as simple as “we agree” or “we disagree.” In an ideal world, every team would be that binary.    Remember, that no team always agrees – it’s how we resolve those differences that makes the team successful.  That’s something we know as “diversity” and it’s like annealing of steel to increase its strength.

Unfortunately, there are four  modes of agreement and two are team poison.

  1. Yes: We agree! Let’s get to work!
  2. No: We disagree! Let’s figure out what’s different so that we’re stronger!
  3. Artificial Warfare:  We disagree!  While we are fundamentally aligned, everyone else thinks that the team does not have consensus and ignores the teams decisions.  We also waste a lot of time talking instead of acting.
  4. Artificial Harmony: We agree!  But then we don’t support each other in getting the work done or message alignment.  We never spend time talking about the real issues so we constantly have to redo our actions.

I’ve never seen a team that is as simple as agree/disagree but I’ve been at companies (Surgient) that tried to build a culture to support trust and conflict resolution (based on Lencioni’s excellent 5 dysfunctions book).  However, there’s a major gap between a team that needs to build trust through healthy conflict and one that wraps itself in the dysfunctions of artificial harmony and warfare.

If you find yourself on a team with this problem then you’ll need management by-in to fix it.  I have not seen it be a self-correcting problem.  I’d love to hear if you’ve gotten yourself healthy from a team with these issues.

Signs of artificial agreement syndrome include

  1. Lack of broad participation – discussions are dominated by a few voices
  2. Discussions that always seem to run to the meta topic instead of the actual problem
  3. Issues are not resolved and come up over and over
  4. People are still upset after the meeting because issues have not been resolved
  5. People have different versions of events
  6. Lack of trust for some people to speak for the group
  7. Outcomes of decision making meetings are surprises
  8. Lack of results or missed commitments by the team

Lean Process’ strength is being Honest and Humble

Lean process and methodology is important to me because I think it is central to the work that we are doing in the community.  Even more, it’s changing how my team at Dell creates and delivers products for customers?
This post may be long, but my answer to “why Lean” ends up being very simple: Lean process is honest and humble.
I believe Lean process is more honest because it assumes a lack of knowledge.  It’s more “truthy” to admit there are a lot of things that we don’t know (we can’t know!) until we’ve started doing the work.  It’s very hard to admit we don’t have answers for things until we are further along because we want to feel like  experts and we to lock deliveries.
The “building software is like building a house” analogy is often used to claim that Lean lacks the design “blue prints” that other processes have.  The argument goes that builders needed to understand how the entire house works with structural support, plumbing lines and electrical circuits and things like that.  However, if I was going to build a house I would still leave a lot of things to the last-minute.  The process of building a house evolves so that the basic outlines of structural elements are known. In a lot of cases the position of rooms, the outlets, air conditioning ducts, a lot of the functional components, even windows and doors while they are often placed in the design can easily be moved and changed as you go through.  You can do a walk-through of a house after it’s been framed out and make all sorts of changes and adjustments.  As things go forward in the design of the house things become more and more difficult to change. You are building a brick façade, moving the windows within the façade are very difficult. However, interior places they aren’t.  Likewise, I don’t want to order my life-gem counter-tops from the blue-prints – it’s much safer to order off actual measurements.
Software projects are also building projects. You build a façade, you build a structure and within that structure you have a lot of flexibility. As you go you make more decisions and your choices become more limited. But, that is the nature of building.  For that reason, saying “we don’t know everything we want” is not just good practice, it is much more honest.
But honesty is not enough for a strong Lean process.  The need for humility in Lean architects and business people really stands out.  The Lean process is humble because it starts with the assumption that we don’t really understand the value, drivers, interests and features that make our product special.
We need very strong ideas and a vision; however, we need to be motivated by making something that is significant to other people.  They are the ones who give it value.
We have to give up the idea that we can convince someone who our idea will be significant to them – we have to show and collaborate instead.  The most important thing in building any project and taking any product to market is listening to the people who are using your product and understanding what their needs.  Instead of telling them what they need;  show them something interesting, interact with them and get their opinion.
Contrast that to waterfall methodology where the assumption is that we can put smart people in a room, have them figure out what the requirements are, build a team, get everything ready to go and then start executing.  That assumption is highly optimized and seems very efficient, but it has a huge amount of hubris in the process.  The idea that we can sit down two years in advance of market need and identify what those features and capabilities are seems outrageous to me in the current technology market.  It is so much harder to try to get that information correct and then execute on it that get a directional statement and begin and then get feedback and interact, it is a world of difference between the two processes.
Ultimately, Lean process about having requirements that are less defined or well-known.  It’s driven by giving respect to the people consuming the product.  We can hear their ideas and their reactions.  Where the users’ input can be evaluated and taken in to account.  It’s about collaboration.
Humility it not just about listening and collecting feedback: it is about interacting and building relationships.
So just as our customers are building a relationship with our product, they are also building a relationship with the people creating that product. And that relationship is what drives the product forward and what makes it a great product and it is what gives you a strong and loyal customer base, rather than dictating, “This is what you wanted. Here it is. I hope you enjoy it.”
This is a completely different and powerful way of delivering product.  I believe that honesty and humility in a Lean process inherently creates stronger products and ones that are both faster delivered and better suited to their markets.