Three reasons why Ops Composition works: Cluster Linking, Services and Configuration (pt 2)

In part pt 1, we reviewed the RackN team’s hard won insights from previous deployment automation. We feel strongly that prioritizing portability in provisioning automation is important. Individual sites may initially succeed building just for their own needs; however, these divergences limit future collaboration and ultimately make it more expensive to maintain operations.

aid1165255-728px-install-pergo-flooring-step-5-version-2If it’s more expensive isolate then why have we failed to create shared underlay? Very simply, it’s hard to encapsulate differences between sites in a consistent way.

What makes cluster construction so hard?

There are a three key things we have to solve together: cross-node dependencies (linking), a lack of service configuration (services) and isolating attribute chains (configuration).  While they all come back to thinking of the whole system as a cluster instead of individual nodes. let’s break them down:

Cross Dependencies (Cluster Linking) – The reason for building a multi-node system, is to create an interconnected system. For example, we want a database cluster with automated fail-over or we want a storage system that predictably distributes redundant copies of our data. Most critically and most overlooked, we also want to make sure that we can trust cluster members before we share secrets with them.

These cluster building actions require that we synchronize configuration so that each step has the information it requires. While it’s possible to repeatedly bang on the configure until it converges, that approach is frustrating to watch, hard to troubleshoot and fraught with timing issues.  Taking this to the next logical steps, doing upgrades, require sequence control with circuit breakers – that’s exactly what Digital Rebar was built to provide.

Service Configuration (Cluster Services) – We’ve been so captivated with node configuration tools (like Ansible) that we overlook the reality that real deployments are intertwined mix of service, node and cross-node configuration.  Even after interacting with a cloud service to get nodes, we still need to configure services for network access, load balancers and certificates.  Once the platform is installed, then we use the platform as a services.  On physical, there are even more including DNS, IPAM and Provisioning.

The challenge with service configurations is that they are not static and generally impossible to predict in advance.  Using a load balancer?  You can’t configure it until you’ve got the node addresses allocated.  And then it needs to be updated as you manage your cluster.  This is what makes platforms awesome – they handle the housekeeping for the apps once they are installed.

Digital Rebar decomposition solves this problem because it is able to mix service and node configuration.  The orchestration engine can use node specific information to update services in the middle of a node configuration workflow sequence.  For example, bringing a NIC online with a new IP address requires multiple trusted DNS entries.  The same applies for PKI, Load Balancer and Networking.

Isolating Attribute Chains (Cluster Configuration) – Clusters have a difficult duality: they are managed as both a single entity and a collection of parts. That means that our configuration attributes are coupled together and often iterative. Typically, we solve this problem by front loading all the configuration. This leads to several problems: first, clusters must be configured in stages and, second, configuration attributes are predetermined and then statically passed into each component making variation and substitution difficult.

Our solution to this problem is to treat configuration more like functional programming where configuration steps are treated as isolated units with fully contained inputs and outputs. This approach allows us to accommodate variation between sites or cluster needs without tightly coupling steps. If we need to change container engines or networking layers then we can insert or remove modules without rewriting or complicating the majority of the chain.

This approach is a critical consideration because it allows us to accommodate both site and time changes. Even if a single site remains consistent, the software being installed will not. We must be resilient both site to site and version to version on a component basis. Any other pattern forces us to into an unmaintainable lock step provisioning model.

To avoid solving these three hard issues in the past, we’ve built provisioning monoliths. Even worse, we’ve seen projects try to solve these cluster building problems within their own context. That leads to confusing boot-strap architectures that distract from making the platforms easy for their intended audiences. It is OK for running a platform to be a different problem than using the platform.
In summary, we want composition because we are totally against ops magic.  No unicorns, no rainbows, no hidden anything.

Basically, we want to avoid all magic in a deployment. For scale operations, there should never be a “push and prey” step where we are counting on timing or unknown configuration for it to succeed. Those systems are impossible to maintain, share and scale.

I hope that this helps you look at the Digital Rebar underlay approach in a holistic why and see how it can help create a more portable and sustainable IT foundation.

Ops is Ops, except when it ain’t. Breaking down the impedance mismatches between physical and cloud ops.

We’ve made great strides in ops automation, but there’s no one-size-fits-all approach to ops because abstractions have limitations.

IMG_20141108_035537967Perhaps it’s my Industrial Engineering background, I’m a huge fan of operational automation and tooling. I can remember my first experience with VMware ESX and thinking that it needed tooling automation.  Since then, I’ve watched as cloud ops has revolutionized application development and deployment.  We are just at the beginning of the cloud automation curve and our continuous deployment tooling and platform services deliver exponential increases in value.

These cloud breakthroughs are fundamental to Ops and uncovered real best practices for operators.  Unfortunately, much of the cloud specific scripts and tools do not translate well to physical ops.  How can we correct that?

Now that I focus on physical ops, I’m in awe of the capabilities being unleashed by cloud ops. Looking at Netflix chaos monkeys pattern alone, we’ve reached a point where it’s practical to inject artificial failures to improve application robustness.  The idea of breaking things on purpose as an optimization is both terrifying and exhilarating.

In the last few years, I’ve watched (and lead) an application of these cloud tool chains down to physical infrastructure.  Fundamentally, there’s a great fit between DevOps configuration management (Chef, Puppet, Salt, Ansible) tooling and physical ops.  Most of the configuration and installation work (post-ready state) is fundamentally the same regardless if the services are physical, virtual or containerized.  Installing PostgreSQL is pretty much the same for any platform.

But pretty much the same is not exactly the same.  The differences between platforms often prevent us from translating useful work between frames.  In physics, we’d call that an impedance mismatch: where similar devices cannot work together dues to minor variations.

An example of this Ops impedance mismatch is networking.  Virtual systems present interfaces and networks that are specific to the desired workload while physical systems present all the available physical interfaces plus additional system interfaces like VLANs, bridges and teams.  On a typical server, there at least 10 available interfaces and you don’t get to choose which ones are connected – you have to discover the topology.  To complicate matters, the interface list will vary depending on both the server model and the site requirements.

It’s trivial in virtual by comparison, you get only the NICs you need and they are ordered consistently based on your network requests.  While the basic script is the same, it’s essential that it identify the correct interface.  That’s simple in cloud scripting and highly variable for physical!

Another example is drive configuration.  Hardware presents limitless options of RAID, JBOD plus SSD vs HDD.  These differences have dramatic performance and density implications that are, by design, completely obfuscated in cloud resources.

The solution is to create functional abstractions between the application configuration and the networking configuration.  The abstraction isolates configuration differences between the scripts.  So the application setup can be reused even if the networking is radically different.

With some of our OpenCrowbar latest work, we’re finally able to create practical abstractions for physical ops that’s repeatable site to site.  For example, we have patterns that allow us to functionally separate the network from the application layer.  Using that separation, we can build network interfaces in one layer and allow the next to assume the networking is correct as if it was a virtual machine.  That’s a very important advance because it allows us to finally share and reuse operational scripts.

We’ll never fully eliminate the physical vs cloud impedance issue, but I think we can make the gaps increasingly small if we continue to 1) isolate automation layers with clear APIs and 2) tune operational abstractions so they can be reused.

OpenStack Deploy Day generates lots of interest, less coding

Last week, my team at Dell led a world-wide OpenStack Essex Deploy event. Kamesh Pemmaraju, our OpenStack-powered solution product manager, did a great summary of the event results (200+ attendees!). What started as a hack-a-thon for deploy scripts morphed into a stunning 14+ hour event with rotating intro content and an ecosystem showcase (videos).  Special kudos to Kamesh, Andi Abes, Judd Maltin, Randy Perryman & Mike Pittaro for leadership at our regional sites.

Clearly, OpenStack is attracting a lot of interest. We’ve been investing time in content to help people who are curious about OpenStack to get started.

While I’m happy to be fueling the OpenStack fervor with an easy on-ramp, our primary objective for the Deploy Day was to collaborate on OpenStack deployments.

On that measure, we have room for improvement. We had some great discussions about how to handle upgrades and market drivers for OpenStack; however, we did not spend the time improving Essex deployments that I was hoping to achieve. I know it’s possible – I’ve talked with developers in the Crowbar community who want this.

If you wanted more expert interaction, here are some of my thoughts for future events.

  • Expert track did not get to deploy coding. I think that we need to simply focus more even tightly on to Crowbar deployments. That means having a Crowbar Hack with an OpenStack focus instead of vice versa.
  • Efforts to serve OpenStack n00bs did not protect time for experts. If we offer expert sessions then we won’t try to have parallel intro sessions. We’ll simply have to direct novices to the homework pages and videos.
  • Combining on-site and on-line is too confusing. As much as I enjoy meeting people face-to-face, I think we’d have a more skilled audience if we kept it online only.
  • Connectivity! Dropped connections, sigh.
  • Better planning for videos (not by the presenters) to make sure that we have good results on the expert track.
  • This event was too long. It’s just not practical to serve Europe, US and Asia in a single event. I think that 2-3 hours is a much more practical maximum. 10-12am Eastern or 6-8pm Pacific would be much more manageable.

Do you have other comments and suggestions? Please let me know!

OpenStack Austin: What we’d like to see at the Design Summit

Last week, the OpenStack Austin user group discussed what we’d like to see at the upcoming OpenStack Design Summit. We had a strong turnout (48?!).

  1. To get the meeting started, Marc Padovani from HP (this month’s sponsor) provided some lessons learned from the HP OpenStack-Powered Cloud. While Marc noted that HP has not been able to share much of their development work on OpenStack; he was able to show performance metrics relating to a fix that HP contributed back to the OpenStack community. The defect related to the scheduler’s ability to handle load. The pre-fix data showed a climb and then a gap where the scheduler simply stopped responding. Post-fix, the performance curve is flat without any “dead zones.” (sharing data like this is what I call “open operations“)
  2. Next, I (Rob Hirschfeld) gave a brief overview of the OpenStack Essex Deploy Day (my summary) that Dell coordinated with world-wide participation. The Austin deploy day location was in the same room as the meetup so several of the OSEDD participants were still around.
  3. The meat of the meetup was a freeform discussion about what the group would like to see discussed at the Design Summit. My objective for the discussion was that the Austin OpenStack community could have a broader voice is we showed consensus for certain topics in advance of the meeting.

At Jim Plamondon‘s suggestion, we captured our brain storming on the OpenStack etherpad. The Etherpad is super cool – it allows simultaneous editing by multiple parties, so the notes below were crowd sourced during the meeting as we discussed topics that we’d like to see highlighted at the conference. The etherpad preserves editors, but I removed the highlights for clarity.

The next step is for me to consolidate the list into a voting page and ask the membership to rank the items (poll online!) below.

Brain storm results (unedited)

Stablity vs. Features

API vs. Code

  • What is the measurable feature set?
  • Is it an API, or an implementation?
  • Is the Foundation a formal-ish standards body?
  • Imagine the late end-game: can Azure/VMWare adopt OPenStack’s APIs and data formats to deliver interop, without running OpenStack’s code? Is this good? Are there conversations on displacing incumbents and spurring new adoption?
  • Logo issues

Documentation Standards

  • Dev docs vs user docs
  • Lag of update/fragmentation (10 blogs, 10 different methods, 2 “work”)
  • Per release getting started guide validated and available prior or at release.

Operations Focus

  • Error messages and codes vs python stack traces
  • Alternatively put, “how can we make error messages more ops-friendly, without making them less developer-friendly?”
  • Upgrade and operations of rolling updates and upgrades. Hot migrations?

If OpenStack was installable on Windows/Hyper-V as a simple MSI/Service installer – would you try it as a node?

  • Yes.

Is Nova too big?  How does it get fixed?

  • libraries?
  • sections?
  • make it smaller sub-projects
  • shorter release cycles?

nova-volume

  • volume split out?
  • volume expansion of backend storage systems
  • Is nova-volume the canonical control plane for storage provisioning?  Regardless of transport? It presently deals in block devices only… is the following blueprint correctly targeted to nova-volume?

https://blueprints.launchpad.net/nova/+spec/filedriver

Orchestration

  • Is the Donabe project dead?

Discussion about invitations to Summit

  • What is a contribution that warrants an invitation
  • Look at Launchpad’s Karma system, which confers karma for many different “contributory” acts, including bug fixes and doc fixes, in addition to code commitments

Summit Discussions

  • Is there a time for an operations summit?
  • How about an operators’ track?
  • Just a note: forums.openstack.org for users/operators to drive/show need and participation.

How can we capture the implicit knowledge (of mailing list and IRC content) in explicit content (documentation, forums, wiki, stackexchange, etc.)?

Hypervisors: room for discussion?

  • Do we want hypervisor featrure parity?
  • From the cloud-app developer’s perspective, I want to “write once, run anywhere,” and if hypervisor features preclude that (by having incompatible VM images, foe example)
  • (RobH: But “write once, run anywhere” [WORA] didn’t work for Java, right?)
  • (JimP: Yeah, but I was one of Microsoft’s anti-Java evangelists, when we were actively preventing it from working — so I know the dirty tricks vendors can use to hurt WORA in OpenStack, and how to prevent those trick from working.)

CDMI

Swift API is an evolving de facto open alternative to S3… CDMI is SNIA standards track.  Should Swift API become CDMI compliant?  Should CDMI exist as a shim… a la the S3 stuff.

OpenStack Essex Deploy Day: First Steps to Production

One March 8th, 70 people from around the world gathered on the Crowbar IM channel to begin building a production grade OpenStack Essex deployment. The event was coordinated as meet-ups by the Dell OpenStack/Crowbar team (my team) in two physical locations: the Nokia offices in Boston and the TechRanch in Austin.

My objective was to enable the community to begin collaboration on Essex Deployment. At that goal, we succeeded beyond my expectations.

IMHO, the top challenge for OpenStack Essex is to build a community of deploying advocates. We have a strong and dynamic development community adding features to the project. Now it is time for us to build a comparable community of deployers. By providing a repeatable, shared and open foundation for OpenStack deployments, we create a baseline that allows collaboration and co-development. Not only must we make deployments easy and predictable, we must also ensure they are scalable and production ready.

Having solid open production deployment infrastructure drives OpenStack adoption.

Our goal on the 8th was not to deliver finished deployments; it was to the start of Essex deployment community collaboration. To ensure that we could focus on getting to an Essex baseline, our team invested substantial time before the event to make sure that participants had a working Essex reference deployment.

By the nature of my team’s event leadership and our approach to OpenStack, the event was decidedly Crowbar focused. I feel like this is an acceptable compromise because Crowbar is open and provides a repeatable foundation. If everyone has the same foundation then we can focus on the truly critical challenges of ensuring consistent OpenStack deployments. Even using Crowbar, we waste a lot of time trying to figure out the differences between configurations. Lack of baseline consistency seriously impedes collaboration.

The fastest way to collaborate on OpenStack deployment is to have a reference deployment as a foundation.

Success By The Numbers

This was a truly international community collaborative event. Here are some of the companies that participated:

Dell (sponsor), Nokia (sponsor), Rackspace, Opscode, Canonical, Fedora, Mirantis, Morphlabs, Nicira, Enstratus, Deutsche Telekom Innovation Laboratories, Purdue University, Orbital Software Solutions, XepCloud and others.

PLEASE COMMENT here if I missed your company and I will add it to the list.

On the day of the event, we collected the following statistics:

  • 70 people on Skype IM channel (it’s not too late to join by pinging DellCrowbar with “Essex barclamps”).
  • 14+ companies
  • 2 physical sites with 10-15 people at each
  • 4 fold increase in traffic on the Crowbar Github to 813 hits.
  • 66 downloads of the Deploy day ISO
  • 8 videos capture from deploy day sessions.
  • World-wide participation

For over 70 people to spend a day together at this early stage in deployment is a truly impressive indication of the excitement that is building around OpenStack.

Improvements for Next Deploy Day

This was a first time that Andi Abes (Boston event lead), Rob Hirschfeld (Austin event lead) or Jean-Marie Martini (Dell event lead) had ever coordinated an event like this. We owe much of the success to efforts by Greg Althaus, Victor Lowther and the Canonical 12.04/Essex team before the event. Also, having physical sites was very helpful.

We are planning to do another event, so we are carefully tracking ways to improve.

Here are some issues we are tracking.

  • Issues with setting up a screen and voice share that could handle 70 people.
  • Lack of test & documentation on Crowbar meant too much time focused on Crowbar
  • Connectivity issues distributed voice
  • Should have started with DevStack as a baseline
  • more welcome in the comments!

Thank you!

I want to thank everyone who participated in making this event a huge success!

OpenStack Essex Deploy Day 3/8 – Get involved and install with us

My team at Dell has been avidly tracking the upsdowns, and breakthroughs of the OpenStack Essex release.  While we still have a few milestones before the release is cut, we felt like the E4 release was a good time to begin the work on Essex deployment.  Of course, the final deployment scripts will need substantial baking time after the final release on April 5th; however, getting deployments working will help influence the quality efforts and expand the base of possible testers.

To rally behind Essex Deployments, we are hosting a public work day on Thursday March 8th.

For this work day, we’ll be hosting all-day community events online and physically in Austin and Boston.  We are getting commitments from other Dell teams, partners and customers around the world to collaborate.  The day is promising to deliver some real Essex excitement.

The purpose of these events is to deliver the core of a working OpenStack Essex deployment.  While my team is primarily focused on deploys via Crowbar/Chef, we are encouraging anyone interested in laying down OpenStack Essex to participate.  We will be actively engaged on the OpenStack IRC and mailing lists too.

We have experts in OpenStack, Chef, Crowbar and Operating Systems (Canonical, SUSE, and RHEL) engaged in these activities.

This is a great time to start learning about OpenStack (or Crowbar) with hands-on work.  We are investing substantial upfront time (checkout out the Crowbar wiki for details) to ensure that there is a working base OpenStack Essex deploy on Ubuntu 12.04 beta.  This deploy includes the Crowbar 1.3 beta with some new features specifically designed to make testing faster and easier than ever before.

In the next few days, I’ll cut a 12.04 ISO and OpenStack Barclamp TARs as the basis for the deploy day event.  I’ll also be creating videos that help you quickly get a test lab up and running.  Visit the wiki or meetup sites to register and stay tuned for details!