SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Originally posted on Kubernetes Blog.  I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early.  Please join us in this effort!

openWe think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

Operators, they don’t want to swim Upstream

Operators Dinner 11/10

Nov 10, Palo Alto Operators Dinner

Last Tuesday, I had the honor of joining an OpenStack scale operators dinner. Foundation executives, Jonathan Bryce and Lauren Sell, were also on the guest list so talk naturally turned to “how can OpenStack better support operators.” Notably, the session was distinctly not OpenStack bashing.

The conversation was positive, enthusiastic and productive, but one thing was clear: the OpenStack default “we’ll fix it in the upstream” answer does not work for this group of operators.

What is upstreaming?  A sans nuance answer is that OpenStack drives fixes and changes in the next community release (longer description).  The project and community have a tremendous upstream imperative that pervades the culture so deeply that we take it for granted.  Have an issue with OpenStack?  Submit a patch!  Is there any other alternative?

Upstreaming [to trunk] makes perfect sense considering the project vendor structure and governance; however, it is a very frustrating experience for operators.   OpenStack does have robust processes to backport fixes and sustain past releases and documentation; yet, the feeling at the table was that they are not sufficiently operator focused.

Operators want fast, incremental and pragmatic corrections to the code and docs they are deploying (which is often two releases back).  They want it within the community, not from individual vendors.

There are great reasons for focusing on upstream trunk.  It encourages vendors to collaborate and makes it much easier to add and expand the capabilities of the project.  Allowing independent activity on past releases creates a forward integration mess and could make upgrades even harder.  It will create divergence on APIs and implementation choices.

The risk of having a stable, independently sustained release is that operators have less reason to adopt the latest shiny release.  And that is EXACTLY what they are asking for.

Upstreaming is a core value to OpenStack and essential to our collaborative success; however, we need to consider that it is not the right answer to all questions.  Discussions at that dinner reinforced that pushing everything to latest trunk creates a significant barrier for OpenStack operators and users.

What are your experiences?  Is there a way to balance upstreaming with forking?  How can we better serve operators?

Hidden costs of Cloud? No surprises, it’s still about complexity = people cost

Last week, Forbes and ZDnet posted articles discussing the cost of various cloud (451 source material behind wall) full of dollar per hour costs analysis.  Their analysis talks about private infrastructure being an order of magnitude cheaper (yes, cheaper) to own than public cloud; however, the open source price advantages offered by OpenStack are swallowed by added cost of finding skilled operators and its lack of maturity.

At the end of the day, operational concerns are the differential factor.

The Magic 8 Cube

The Magic 8 Cube

These articles get tied down into trying to normalize clouds to $/vm/hour analysis and buried the lead that the operational decisions about what contributes to cloud operational costs.   I explored this a while back in my “magic 8 cube” series about six added management variations between public and private clouds.

In most cases, operations decisions is not just about cost – they factor in flexibility, stability and organizational readiness.  From that perspective, the additional costs of public clouds and well-known stacks (VMware) are easily justified for smaller operations.  Using alternatives means paying higher salaries and finding talent that requires larger scale to justify.

Operational complexity is a material cost that strongly detracts from new platforms (yes, OpenStack – we need to address this!)

Unfortunately, it’s hard for people building platforms to perceive the complexity experienced by people outside their community.  We need to make sure that stability and operability are top line features because complexity adds a very real cost because it comes directly back to cost of operation.

In my thinking, the winners will be solutions that reduce BOTH cost and complexity.  I’ve talked about that in the past and see the trend accelerating as more and more companies invest in ops automation.

Apply, Rinse, Repeat! How do I get that DevOps conditioner out of my hair?

I’ve been trying to explain the pain Tao of physical ops in a way that’s accessible to people without scale ops experience.   It comes down to a yin-yang of two elements: exploding complexity and iterative learning.

Science = Explosions!Exploding complexity is pretty easy to grasp when we stack up the number of control elements inside a single server (OS RAID, 2 SSD cache levels, 20 disk JBOD, and UEFI oh dear), the networks that server is connected to, the multi-layer applications installed on the servers, and the change rate of those applications.  Multiply that times 100s of servers and we’ve got a problem of unbounded scope even before I throw in SDN overlays.

But that’s not the real challenge!  The bigger problem is that it’s impossible to design for all those parameters in advance.

When my team started doing scale installs 5 years ago, we assumed we could ship a preconfigured system.  After a year of trying, we accepted the reality that it’s impossible to plan out a scale deployment; instead, we had to embrace a change tolerant approach that I’ve started calling “Apply, Rinse, Repeat.”

Using Crowbar to embrace the in-field nature of design, we discovered a recurring pattern of installs: we always performed at least three full cycle installs to get to ready state during every deployment.

  1. The first cycle was completely generic to provide a working baseline and validate the physical environment.
  2. The second cycle attempted to integrate to the operational environment and helped identify gaps and needed changes.
  3. The third cycle could usually interconnect with the environment and generally exposed new requirements in the external environment
  4. The subsequent cycles represented additional tuning, patches or redesigns that could only be realized after load was applied to the system in situ.

Every time we tried to shortcut the Apply-Rinse-Repeat cycle, it actually made the total installation longer!  Ultimately, we accepted that the only defense was to focus on reducing A-R-R cycle time so that we could spend more time learning before the next cycle started.

OpenCrowbar Design Principles: Simulated Annealing [Series 4 of 6]

This is part 4 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar.  The content is reposted from the OpenCrowbar docs repo.

Simulated Annealing

simulated_annealingSimulated Annealing is a modeling strategy from Computer Science for seeking optimum or stable outcomes through iterative analysis. The physical analogy is the process of strengthening steel by repeatedly heating, quenching and hammering. In both computer science and metallurgy, the process involves evaluating state, taking action, factoring in new data and then repeating. Each annealing cycle improves the system even though we may not know the final target state.

Annealing is well suited for problems where there is no mathematical solution, there’s an irregular feedback loop or the datasets change over time. We have all three challenges in continuous operations environments. While it’s clear that a deployment can modeled as directed graph (a mathematical solution) at a specific point in time, the reality is that there are too many unknowns to have a reliable graph. The problem is compounded because of unpredictable variance in hardware (NIC enumeration, drive sizes, BIOS revisions, network topology, etc) that’s even more challenging if we factor in adapting to failures. An operating infrastructure is a moving target that is hard to model predictively.

Crowbar implements the simulated annealing algorithm by decomposing the operations infrastructure into atomic units, node-roles, that perform the smallest until of work definable. Some or all of these node-roles are changed whenever the infrastructure changes. Crowbar anneals the environment by exercising the node-roles in a consistent way until system re-stabilizes.

One way to visualize Crowbar annealing is to imagine children who have to cross a field but don’t have a teacher to coordinate. Once student takes a step forward and looks around then another sees the first and takes two steps. Each child advances based on what their peers are doing. None wants to get too far ahead or be left behind. The line progresses irregularly but consistently based on the natural relationships within the group.

To understand the Crowbar Annealer, we have to break it into three distinct components: deployment timeline, annealing and node-role state. The deployment timeline represents externally (user, hardware, etc) initiated changes that propose a new target state. Once that new target is committed, Crowbar anneals by iterating through all the node-roles in a reasonable order. As the Annealer runs the node-roles they update their own state. The aggregate state of all the node-roles determines the state of the deployment.

A deployment is a combination of user and system defined state. Crowbar’s job is to get deployments stable and then maintain over time.

OpenCrowbar Design Principles: Late Binding [Series 3 of 6]

This is part 3 of 6 in a series discussing the principles behind the “ready state” and other concepts implemented in OpenCrowbar.  The content is reposted from the OpenCrowbar docs repo.

2013-09-13_18-56-39_197Ops Late Binding

In terms of computer science languages, late binding describes a class of 4th generation languages that do not require programmers to know all the details of the information they will store until the data is actually stored. Historically, computers required very exact and prescriptive data models, but later generation languages embraced a more flexible binding.

Ops is fluid and situational.

Many DevOps tooling leverages eventual consistency to create stable deployments. This iterative approach assumes that repeated attempts of executing the same idempotent scripts do deliver this result; however, they are do not deliver predictable upgrades in situations where there are circular dependencies to resolve.

It’s not realistic to predict the exact configuration of a system in advance –

  • the operational requirements recursively impact how the infrastructure is configured
  • ops environments must be highly dynamic
  • resilience requires configurations to be change tolerant

Even more complex upgrade where the steps cannot be determined in advanced because the specifics of the deployment direct the upgrade.

Late Binding is a  foundational topic for Crowbar that we’ve been talking about since mid-2012.  I believe that it’s an essential operational consideration to handle resiliency and upgrades.  We’ve baked it deeply into OpenCrowbar design.

Continue Reading > post 4

Substituting Action for Knowledge – adopting “ready, fire, aim” as a strategy (and when to run like hell)

Today my mother-in-law (a practicing psychiatrist) was bemoaning the current medical practice of substituting action for knowledge. In her world, many doctors will make rapid changes to their patients’ therapy. Their goal is to address the issues immediately presented (patient feels sad so Dr prescribes antidepressants) rather than taking time to understand the patients’ history or make changes incrementally and measure impacts. It feels like another example of our cultural compulsion to fix problems as quickly as possible.

Her comments made me question the core way that I evangelize!

Do Lean and Agile substitute action for knowledge? No. We use action to acquire knowledge.

The fundamental assumption that drives poor decision-making is that we have enough information to make a design, solve a problem or define a market. Lean and Agile’s more core tenet is that we must attack this assumption. We must assume that we can’t gather enough information to fully define our objective. The good news, is that even without much analysis we know a lot! We know:

  • roughly what we want to do (road map)
  • the first steps we should take (tactics)
  • who will be working on the problem (team members)
  • generally how much effort it will take (time & team size)
  • who has the problem that we are trying to solve (market)

We also know that we’ll learn a lot more as we get closer to our target. Every delay in starting effectively pushed our “day of clarity” further into the future. For that reason, it is essential that we build a process that constantly reviews and adjusts its targets.

We need to build a process that acquires knowledge as progress is made and makes rapid progress.

In Agile, we translate this need into the decorations of our process: reviews for learning, retrospectives for adjustments, planning for taking action and short iterations to drive the feedback loop.  Agile’s mantra is “ready, fire, aim, fire, aim, fire, aim, …” which is very different from simply jumping out of a plane without a parachute and hoping you’ll find a haystack to land in.

For cloud deployments, this means building operational knowledge in stages.  Technology is simply evolving too quickly and best practices too slowly for anyone to wait for a packaged solution to solve all their cloud infrastructure problems.  We tried this and it does not work: clouds are a mixture hardware, software and operations.  More accurately, clouds are an operational model supported by hardware and software.

Currently, 80% of cloud deployment effort is operations (or “DevOps“).

When I listen to people’s plans about building product or deploying cloud, I get very skeptical when they take a lot of time to aim at objects far off on the horizon.  Perhaps they are worried that they will substitute action for knowledge; however, I think they would be better served to test their knowledge with a little action.

My MIL agrees – she sees her patients frequently and makes small adjustments to their treatment as needed.  Wow, that’s an Rx for Agile!