OpenStack automated high-availability deploy reality, SUSE shows off chops with Crowbar

While I’ve been focused on delivering next-generation kick-aaS-i-ness with Crowbar v2 (now called OpenCrowbar) and helping the Dell and Red Hat co-engineer a OpenStack Powered Cloud, SUSE has been continuing to expand and polish the OpenStack deployment on Crowbar v1.  I’m always impressed by commit activity (SUSE is the top committer in the Crowbar project) and was excited to see their Havana launch announcement.

Using Crowbar v1, SUSE is delivering a seriously robust automated OpenStack Havana implementation.  They have taken the time to build high availability (HA) across the framework including for Neutron, Heat and Ceilometer.

As an OpenStack Foundation board member, I hear a lot of hand-wringing in the community about ops practices and asking “is OpenStack is ready for the enterprise?”  While I’m not sure how to really define “enterprise,” I do know that SUSE Cloud Havana release version also) shows that it’s possible to deliver a repeatable and robust OpenStack deployment.

This effort shows some serious DevOps automation chops and, since Crowbar is open, everyone in the community can benefit from their tuning.   Of course, I’d love to see these great capabilities migrate into the very active StackForge Chef OpenStack cookbooks that OpenCrowbar is designed to leverage.

Creating HA automation is a great achievement and an important milestone in capturing the true golden fleece – automated release-to-release upgrades.  We built the OpenCrowbar annealer with this objective in mind and I feel like it’s within reach.

Can’t Contain(erize) the Hype – is Docker real or a bubble?

The new application portability darling, Docker, was so popular at this week’s Red Hat Summit that I was expecting Miley Cyrus’ flock of paparazzi to abandon in her favor of Ben Golub.

Personally, I find Docker to be a useful tool and we’ve been embedding it into our dev and test processes in useful ways for DefCore TCUP (at Conference), OpenCrowbar Admin and Dev Nodes.  To me, these a concrete and clear use cases.

There are clearly a lot more great use cases for Docker but I can’t help but feel like it’s being thrown into architectural layer cakes and markitectures as a substitute for the non-worlds “cloud”, “amazing” and “revolutionary.”

How do I distinguish hot from hype?  I look for places where Docker is solving just one problem set instead being a magic wand solution to a raft of systemic issues.

Places where I think Docker is potent and disruptive

  • Creating a portable and consistent environment for dev, test and delivery
  • Helping Linux distros keep updating the kernel without breaking user space (RHEL 7 anyone?)
  • Reducing the virtualization overhead of tenant isolation (containers are lighter)
  • Reducing the virtualization overhead for DevOps developers testing multi-node deployments

But I’m concerned that we’re expecting too many silver bullets

  • Packaging is still tricky:  Creating a locked box helps solve part of downstream problem (you know what you have) but not the upstream problem (you don’t know what you depend on).
  • Container sprawl: Breaking deployments into more functional discrete parts is smart, but that means we have MORE PARTS to manage.   There’s an inflection point between separation of concerns and sprawl.
  • PaaS Adoption: Docker helps with PaaS but it does not solve neither the “you have to model your apps for a PaaS” nor the “PaaS needs scalable data services” problems

Speaking of Miley Cyrus, it’s not the container that matters, but what’s on the inside.  Docker can take a lesson from Miley: attention is great but you’ve still got to be able to sing.    I’m not sure about Miley, but I am digging the tracks that Docker is laying down.  Docker is worth putting on your play list.

Rocking Docker – OpenCrowbar builds solid foundation & life-cycle [VIDEOS]

Docker has been gathering a substantial about of interest as an additional way to solve application portability and dependency hell.  We’ve been enthusiastic participants in this fledgling community (Docker in OpenStack) and my work in DefCore’s Tempest in a Container (TCUP).

flying?  not flying!In OpenCrowbar, we’ve embedded Docker much deeper to solve a few difficult & critical problems: speeding up developing multi-node deployments and building the environment for the containers.  Check out my OpenCrowbar does Docker video or the community demo!

Bootstrapping Docker into a DevOps management framework turns out to be non-trivial because integrating new nodes into a functioning operating environment is very different on Docker than using physical servers or a VMs.  Containers don’t PXE boot and have more limited configuration options.

How did we do this?  Unlike other bare metal provisioning frameworks, we made sure that Crowbar did not require DHCP+PXE as the only node discovery process.  While we default to and fully support PXE with our sledgehammer discovery image, we also allow operators to pre-populate the Crowbar database using our API and make configuration adjustments before the node is discovered/created.

We even went a step farther and enabled the Crowbar dependency graph to take alternate routes (we call it the “provides” role).  This enhancement is essential for dealing with “alike but different” infrastructure like Docker.

The result is that you can request Docker nodes in OpenCrowbar (using the API only for now) and it will automatically create the containers and attach them into Crowbar management.  It’s important to stress that we are not adding existing containers to Crowbar by adding an agent; instead, Crowbar manages the container’s life-cycle and then then work inside the container.

Getting around the PXE cycle using containers as part of Crowbar substantially improves Ops development cycle time because we don’t have to wait for boot > discovery > reboot > install to create a clean environment.  Bringing fresh Docker containers into a dev system takes seconds instead,

The next step is equally powerful: Crowbar should be able to configure the Docker host environment on host nodes (not just the Admin node as we are now demonstrating).  Setting up the host can be very complex: you need to have the correct RAID, BIOS, Operating System and multi-NIC networking configuration.  All of these factors must be done with a system perspective that match your Ops environment.  Luckily, this is exactly Crowbar’s sweet spot!

Until we’ve got that pulled together, OpenCrowbar’s ability to use upstream cookbooks and this latest Dev/Test focused step provides remarkable out of the gate advantages for everyone build multi-node DevOps tools.


PS: It’s worth noting that we’ve already been using Docker to run & develop the Crowbar Admin server.  This extra steps makes Crowbar even more Dockeriffic.

OpenCrowbar Multi-OS deploy from Docker Admin

Last week I talked about OpenCrowbar reaching a critical milestone and this week I’ve posted two videos demonstrating how the new capabilities work.

annealingThe first video highlights the substantial improvements we’ve made testing and developing OpenCrowbar.  By using Docker containers, OpenCrowbar is fast and reliable to setup and test.  We’ve dramatically streamlined the development environment and consolidated the whole code base into logical groups with logical names.

The second video shows off the OpenCrowbar doing it’s deployment work (including setting up Docker nodes!).  This demonstration goes through the new node discovery and install process.  The new annealing process is very transparent and gives clear and immediate feedback about the entire discovery and provisioning process.  I also show how to configure networks (IPv4 and IPv6) and choose which operating system gets installed.

Note: In the videos, I demonstrate using our Docker install process.  Part of moving from Crowbar v2 (in the original Crowbar repo) to OpenCrowbar was so that we could also organize the code for an RPM install.  In either install process, OpenCrowbar no longer uses bloated ISOs with all components pre-cached so you must be connected to the Internet to complete the installation.

Mayflies and Dinosaurs (extending Puppies and Cattle)

Dont Be FragileJosh McKenty and I were discussing the common misconception of the “Puppies and Cattle” analogy. His position is not anti-puppy! He believes puppies are sometimes unavoidable and should be isolated into portable containers (VMs) so they can be shuffled around seamlessly. His more provocative point is that we want our underlying infrastructure to be cattle so it remains highly elastic and flexible. More cattle means a more resilient system. To me, this is a fundamental CloudOps design objective.

We realized that the perfect cloud infrastructure would structurally discourage the creation of puppies.

Imagine a cloud in which servers were automatically decommissioned after a week of use. In a sort of anti-SLA, any VM running for more than 168 hours would be (gracefully) terminated. This would force a constant churn of resources within the infrastructure that enables true cattle-like management. This cloud would be able to very gracefully rebalance load and handle disruptive management operations because the workloads are designed for the churn.

We called these servers mayflies due to their limited life span.

While this approach requires a high degree of automation, the most successful cloud operators I have met are effectively building workloads with this requirement. If we require application workloads to be elastic and fault-resilient then we have a much higher degree of flexibility with the underlying infrastructure. I’ve seen this in practice with several OpenStack clouds: operators with helped applications deploy using automation were able to decommission “old” clouds much more gracefully. They effectively turned their entire cloud into a cow. Sadly, the ones without that investment puppified™ the ops infrastructure and created a much more brittle environment.

The opposite of a mayfly is the dinosaur: a server that is so brittle and locked that the slightest disturbance wipes out everything it touches.

Dinosaurs are puppies grown into a T-Rex with rows of massive razor sharp teeth and tiny manicured hands. These are systems that are so unique and historical that there’s no way to recreate them if there’s a failure. The original maintainers exit happy hour was celebrated by people who were laid-off two CEOs ago. The impact of dinosaurs goes beyond their operational risk; they are typically impossible to extend or maintain and, consequently, ossify other server around them. This type of server drains elasticity from your ops team.

Puppies do not grow up to become dogs, they become dinosaurs.

It’s a classic lean adage to do hard things more frequently. Perhaps it’s time to start creating mayflies in your ops infrastructure.

OpenCrowbar reaches critical milestone – boot, discover and forge on!

OpenCrowbarWe started the Crowbar project because we needed to make OpenStack deployments to be fast, repeatable and sharable.  We wanted a tool that looked at deployments as a system and integrated with our customers’ operations environment.  Crowbar was born as an MVP and quickly grew into a more dynamic tool that could deploy OpenStack, Hadoop, Ceph and other applications, but most critically we recognized that our knowledge gaps where substantial and we wanted to collaborate with others on the learning.  The result of that learning was a rearchitecture effort that we started at OSCON in 2012.

After nearly two years, I’m proud to show off the framework that we’ve built: OpenCrowbar addresses the limitations of Crowbar 1.x and adds critical new capabilities.

So what’s in OpenCrowbar?  Pretty much what we targeted at the launch and we’ve added some wonderful surprises too:

  • Heterogeneous Operating Systems – chose which operating system you want to install on the target servers.
  • CMDB Flexibility – don’t be locked in to a devops toolset.  Attribute injection allows clean abstraction boundaries so you can use multiple tools (Chef and Puppet, playing together).
  • Ops Annealer –the orchestration at Crowbar’s heart combines the best of directed graphs with late binding and parallel execution.  We believe annealing is the key ingredient for repeatable and OpenOps shared code upgrades
  • Upstream Friendly – infrastructure as code works best as a community practice and Crowbar use upstream code without injecting “crowbarisms” that were previously required.  So you can share your learning with the broader DevOps community even if they don’t use Crowbar.
  • Node Discovery (or not) – Crowbar maintains the same proven discovery image based approach that we used before, but we’ve streamlined and expanded it.  You can use Crowbar’s API outside of the PXE discovery system to accommodate Docker containers, existing systems and VMs.
  • Hardware Configuration – Crowbar maintains the same optional hardware neutral approach to RAID and BIOS configuration.  Configuring hardware with repeatability is difficult and requires much iterative testing.  While our approach is open and generic, my team at Dell works hard to validate a on specific set of gear: it’s impossible to make statements beyond that test matrix.
  • Network Abstraction – Crowbar dramatically extended our DevOps network abstraction.  We’ve learned that a networking is the key to success for deployment and upgrade so we’ve made Crowbar networking flexible and concise.  Crowbar networking works with attribute injection so that you can avoid hardwiring networking into DevOps scripts.
  • Out of band control – when the Annealer hands off work, Crowbar gives the worker implementation flexibility to do it on the node (using SSH) or remotely (using an API).  Making agents optional means allows operators and developers make the best choices for the actions that they need to take.
  • Technical Debt Paydown – We’ve also updated the Crowbar infrastructure to use the latest libraries like Ruby 2, Rails 4, Chef 11.  Even more importantly, we’re dramatically simplified the code structure including in repo documentation and a Docker based developer environment that makes building a working Crowbar environment fast and repeatable.

Why change to OpenCrowbar?  This new generation of Crowbar is structurally different from Crowbar 1 and we’ve investing substantially in refactoring the tooling, paying down technical debt and cleanup up documentation.  Since Crowbar 1 is still being actively developed, splitting the repositories allow both versions to progress with less confusion.  The majority of the principles and deployment code is very similar, I think of Crowbar as a single community.

Interested?  Our new Docker Admin node is quick to setup and can boot and manage both virtual and physical nodes.

Mark Stouse’s “Making Predictions for 14″ series

I was invited to be part of Mark Stouse’s 2014 big data & cloud predictions series.  His questions had me thinking deeply about the past year and I’m happy to repost them here with links to the other predictors too including (Robert ScobleShel Israel, and David H. Deans).

1.  Describe in one sentence what you do and why you’re good at it.

I specialize in architecture for infrastructure software for scale data center operations (aka “cloud”) and I have 14 years of battle scars that inform my designs.

 2.  Cloud Computing, Big Data or Consumerization: Which trend do you feel is having the most impact on IT today and why?

Cloud, Data & Consumerization are all connected, so there’s no one clear “most impactful” winner except that all three are forcing IT to rethink how we handle operations.   The pace of change for these categories (many of which are open source driven) is so fast that traditional IT governance cannot keep up.  I’m specifically talking about the DevOps and Lean Software Delivery paradigms.  These approaches do not mean that we’re trading speed for quality; in fact, I’ve seen that we’re adopting techniques that deliver both higher quality and speed.

 3.  What do you think is the biggest misconception about Cloud computing/Big Data/Consumerization?

That someone can purchase them as a SKU.  These are really architectural concepts that impact how we solve problems instead of specific products.  My experience is that customers overlook their need to understand how to change their business to take advantage of these technologies.  It’s the same classic challenge for ROI from most new technologies – they don’t exist apart from the business matching changes to the business to leverage them.

 4.  Which (Cloud Computing/Big Data/Consumerization) trend has surprised you most in the last five years?

Open source has surprised me because we’ve seen it transform from a cost concern into a supply chain concern.  When I started doing open source work for Dell, customers were very interested in innovation and controlling license costs.  This has really changed over the last few years.  Today, customers are more concerned with community participation and transparency of their product code base.  This surprised me until I realized that they are really seeking to ensure that they had maximum control and visibility into their “IT Supply Chain.”   It may seem like a paradox, but open source software is uniquely positioned to help companies maintain more control of their critical IT because they are not tightly coupled to a single vendor.

 5.  How has Cloud Computing/Big Data/Consumerization had the biggest impact in YOUR life to date?

Beyond it being my career, I believe these technologies have created a new degree of freedom for me.  I’m answering these questions from the SFO airport where I’m carrying all of the tools I need to do my job in a space small enough to fit under the seat in front of me plus a free Wifi connection.  I believe we are only just learning how access to information and portable computing will change our experience.  This learning process will be both liberating and painful as we work out the right balances between access, identity and privacy.

 6.  On a lighter note – If Cloud/Big Data/Consumerization could be personified by a superhero, which superhero would it be and why?

The Hulk.   Looks like a friendly geek but it’s going to crush you if you’re not careful.

 7.  What aspect of (Cloud Computing/Big Data/ Consumerization) are you most excited about in the future, and what excites you about it?

The Internet of Things (even if I hate the term) is very exciting because we’re moving into a place where we have real ways to connect our virtual and physical lives.  That translates into cool technologies like self-driving cars and smart power utilities.  I think it will also motivate a revolution in how people interact with computers and each other.  It’s going to open up a whole new dimension on our personal interaction with our surroundings.   I’m specifically thinking about a book “Rainbows End” by Vernor Vinge that paints this future in vivid detail.

OpenStack Havana provides foundation for XXaaS you need

Folsom SummitIt’s been a long time, and a lot of summits, since I posted how OpenStack was ready for workloads (back in Cactus!).  We’ve seen remarkable growth of both the platform technology and the community surrounding it.  So much growth that now we’re struggling to define “what is core” for the project and I’m proud be on the Foundation Board helping to lead that charge.

So what’s exciting in Havana?

There’s a lot I am excited about in the latest OpenStack release.

Complete Split of Compute / Storage / Network services

In the beginning, OpenStack IaaS was one service (Nova).  We’ve been breaking that monolith into distinct concerns (Compute, Network, Storage) for the last several releases and I think Havana is the first release where all of the three of the services are robust enough to take production workloads.

This is a major milestone for OpenStack because knowledge that the APIs were changing inhibited adoption.


We’ve been hanging out with the Ceph and Docker teams, so you can expect to see some interesting.  These two are proof of the a fallacy that only OpenStack projects are critical to OpenStack because neither of these technologies are moving under the official OpenStack umbrella.  I am looking forward to seeing both have dramatic impacts in how cloud deployments.

Docker promises to make Linux Containers (LXC) more portable and easier to use.  This paravirtualization approach provides near bear metal performance without compromising VM portability.  More importantly, you can oversubscribe LXC much more than VMs.  This allows you to dramatically improve system utilization and unlocks some other interesting quality of service tricks.

Ceph is showing signs of becoming the scale out storage king.  Beyond its solid data dispersion algorithm, a key aspect of its mojo is that is delivers both block and object storage.  I’ve seen a lot of interest in consolidating both types of storage into a single service.  Ceph delivers on that plus performance and cost.  It’s a real winner.

Crowbar Integration & High Availability Configuration!

We’ve been making amazing strides in the Crowbar + OpenStack integration!  As usual, we’re planning our zero day community build (on the “Roxy” branch) to get people started thinking about operationalizing OpenStack.   This is going to be especially interesting because we’re introducing it first on Crowbar 1 with plans to quickly migrate to Crowbar 2 where we can leverage the attribute injection pattern that OpenStack cookbooks also use.  Ultimately, we expect those efforts to converge.  The fact that Dell is putting reference implementations of HA deployment best practices into the open community is a major win for OpenStack.

Tests, Tests, Tests & Continuous Delivery

OpenStack continue to drive higher standards for reviews, integration and testing.  I’m especially excited to the volume and activity around our review system (although backlogs in reviews are challenges).  In addition, the community continues to invest in the test suites like the Tempest project.  These are direct benefits to operators beyond simple code quality.  Our team uses Tempest to baseline field deployments.  This means that OpenStack test suites help validate live deployments, not just lab configurations.

We achieve a greater level of quality when we gate code check-ins on tests that matter to real deployments.   In fact, that premise is the basis for our “what is core” process.  It also means that more operators can choose to deploy OpenStack continuously from trunk (which I consider to be a best practice scale ops).

Where did we fall short?

With growth comes challenges, Havana is most complex release yet.  The number of projects that are part the OpenStack integrated release family continues to expand.  While these new projects show the powerful innovation engine at work with OpenStack, they also make the project larger and more difficult to comprehend (especially for n00bs).  We continue to invest in Crowbar as a way to serve the community by making OpenStack more accessible and providing open best practices.

We are still struggling to resolve questions about interoperability (defining core should help) and portability.  We spent a lot of time at the last two summits on interoperability, but I don’t feel like we are much closer than before.  Hopefully, progress on Core will break the log jam.

Looking ahead to Ice House?

I and many leaders from Dell will be at the Ice House Summit in Hong Kong listening and learning.

The top of my list is the family of XXaaS services (Database aaS, Load Balanacer aaS, Firewall aaS, etc) that have appeared.  I’m a firm believer that clouds are more than compute+network+storage.  With a stable core, OpenStack is ready to expand into essential platform services.

If you are at the summit, please join Dell (my employer) and Intel for the OpenStack Summit Welcome Reception (RSVP!) kickoff networking and social event on Tuesday November 5, 2013 from 6:30 – 8:30pm at the SkyBistro in the SkyCity Marriott.   My teammate, Kamesh Pemmaraju, has a complete list of all Dell the panels and events.

In scale-out infrastructure, tools & automation matter

WiseScale out platforms like Hadoop have different operating rules.  I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.

In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.

Finding and correcting this type of failure can be difficult.  While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause.   If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.

Consequently, I’ve found that management tooling and automation are essential for success.  Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.

I’d also like to give a shout out to benchmarks as part of your tooling suite.  Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.

Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic.  A single person can have a disproportionate impact on overall team performance.

Thanks! I’m enjoying my conversation with you

I write because I love to tell stories and to think about how actions we take today will impact tomorrow.  Ultimately, everything here is about a dialog with you because you are my sounding board and my critic.  I appreciate when people engage me about posts here and extend the conversation into other dimensions.  Feel free to call me on points and question my position – that’s what this is all about.

Thank you for being at part of my blog and joining in.  I’m looking forward to hearing more from you.

During the OpenStack Summit, I got to lead and participate in some excellent presentations and panels.  While my theme for this summit was interoperability, there are many other items discussed.

I hope you enjoy them.

Did one of these topics stand out?  Is there something I missed?  Please let me know!