Problems with the “Give me a Wookiee” hybrid API

Greg Althaus, RackN CTO, creates amazing hybrid DevOps orchestration that spans metal and cloud implementations.  When it comes to knowing the nooks and crannies of data centers, his ops scar tissue has scar tissue.  So, I knew you’d all enjoy this funny story he wrote after previewing my OpenStack API report.  

“APIs are only valuable if the parameters mean the same thing and you get back what you expect.” Greg Althaus

The following is a guest post by Greg:

While building the Digital Rebar OpenStack node provider, Rob Hirschfeld tried to integrate with 7+ OpenStack clouds.  While the APIs matched across instances, there are all sorts of challenges with what comes out of the API calls.  

The discovery made me realize that APIs are not the end of interoperability.  They are the beginning.  

I found I could best describe it with a story.

I found an API on a service and that API creates a Wookiee!

I can tell the API that I want a tall or short Wookiee or young or old Wookiee.  I test against the Kashyyyk service.  I consistently get a 8ft Brown 300 year old Wookiee when I ask for a Tall Old Wookiee.  

I get a 6ft Brown 50 Year old Wookiee when I ask for a Short Young Wookiee.  Exactly what I want, all the time.  

My pointy-haired emperor boss says I need to now use the Forest Moon of Endor (FME) Service.  He was told it is the exact same thing but cheaper.  Okay, let’s do this.  It consistently gives me 5 year old 4 ft tall Brown Ewok (called a Wookiee) when I ask for the Tall Young Wookiee.  

This is a fail.  I mean, yes, they are both furry and brown, but the Ewok can’t reach the top of my bookshelf.  

The next service has to work, right?  About the same price as FME, the Tatooine Service claims to be really good too.  It passes tests.  It hands out things called Wookiees.  The only problem is that, while size is an API field, the service requires the use of petite and big instead of short and tall.  This is just annoying.  This time my tall (well big) young Wookiee is 8 ft tall and 50 years old, but it is green and bald (scales are like that).  

I don’t really know what it is.  I’m sure it isn’t a Wookiee.  

And while she is awesome (better than the male Wookiees), she almost froze to death in the arctic tundra that is Boston.  

My point: APIs are only valuable if the parameters mean the same thing and you get back what you expect.

 

Repair Kenmore Dishwasher Error F4E1 needs THIS Reset Code

This post is paying it forward on the SEO for people repairing their own Kenmore Dishwasher.  The repair is VERY managable but requires an undocumented clear code at the end of the replacement.

20160312_104922.jpgIf you get an F4-E1 (washer pump bad) code, then you MUST clear the code after you replace the drive motor.  The reset code is pressing any three bottons in a 1-2-3 sequence three times (so 1-2-3, 1-2-3, 1-2-3).  I took a picture of my unit with all lights on after I entered the diagnostic code.

Kudos to Kenmore for making their unit incredibly servicable.  

Replacing the drive components was EASY because of their thoughtful design. The repair basically replaced all the mechanical parts of the device as a single unit for $250 (a new unit is about $900).  Once I had the drive, it only required removing a few obvious connections.  The parts effectively snap together.

My only element of panic when I put the unit back together with the new parts and got the original code again.  Entering the clear code made everything work.  While it’s easy to find the parts, the reset code is NOT easy to find.  Hopefully, this post helps you resolve this final step in the repair!

A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

Transitioning from a Bossy Boss into a Digital Age Leader [Series Conclusion]

Now that we are to the end of our 8 POST SERIES, BRAD SZOLLOSE AND ROB HIRSCHFELD INVITE YOU TO SHARE IN OUR DISCUSSION ABOUT FAILURES, FIGHTS AND FRIGHTENING TRANSFORMATIONS GOING ON AROUND US AS DIGITAL WORK CHANGES WORKPLACE DELIVERABLES, PLANNING AND CULTURE.

We hope you’ve enjoyed our discussion about digital management over the last seven posts. This series was born of our frustration with patterns of leadership in digital organizations: overly directing leaders stifle their team while hands-off leaders fail to provide critical direction. Neither culture is leading effectively!

Digital managers have to be two things at once

We felt that our “cultural intuition” is failing us.  That drove us to describe what’s broken and how to fix it.

Digital work and workers operate in a new model where top-down management is neither appropriate nor effective. To point, many digital workers actively resist being given too much direction, rules or structure. No, we are not throwing out management; on the contrary, we believe management is more important than ever, but changes to both work and workers has made it much harder than before.

That’s especially true when Boomers and Millennials try to work together because of differences in leadership experience and expectation. As Brad is always pointing out in his book Liquid Leadership, “what motivates a Millennial will not motivate a Boomer,” or even a Gen Xer.

Millennials may be so uncomfortable having to set limits and enforce decisions that they avoid exerting the very leadership that digital workers need! While GenX and Boomers may be creating and expecting unrealistic deadlines simply because they truly do not understand the depth of the work involved.

So who’s right and who’s wrong? As we’ve pointed out in previous posts, it’s neither! Why? Because unlike Industrial Age Models, there is no one way to get something done in The Information Age.

We desperately need a management model that works for everyone. How does a digital manager know when it’s time to be directing? If you’ve communicated a shared purpose well then you are always at liberty to 1) ask your team if this is aligned and 2) quickly stop any activity that is not aligned.

The trap we see for digital managers who have not communicated the shared goals is that they lack the team authority to take the lead.

We believe that digital leadership requires finding a middle ground using these three guidelines:

  1. Clearly express your intent and trust, don’t force, your team will follow it
  2. Respect your teams’ ability to make good decisions around the intent.
  3. Don’t be shy to exercise your authority when your team needs direction

Digital management is hard: you don’t get the luxury of authority or the comfort of certainty.

If you are used to directing then you have to trust yourself to communicate clearly at an abstract level and then let go of the details. If you are used to being hands-off then you have to get over being specific and assertive when the situation demands it.

Our frustration was that neither Boomer nor Millennial culture is providing effective management. Instead, we realized that elements of both are required. It’s up to the digital manager to learn when each mode is required.

Thank you for following along. It has been an honor.

The Matrix & Surrogates as an analogies for VMs, Containers and Metal

010312_1546_2012CloudOu1.jpgTrench coats aside, I used The Matrix as a useful analogy to explain visualization and containers to a non-technical friend today.  I’m interested in hearing from others if this is a helpful analogy.

Why does anyone care about virtual servers?

Virtual servers (aka virtual machines or VMs) are important because data centers are just like the Matrix.  The real world of data centers is a ugly, messy place fraught with hidden dangers and unpleasant chores.  Most of us would rather take the blue pill and live in a safe computer generated artificial environment where we can ignore those details and live in the convenient abstraction of Mega City.

Do VMs really work to let you ignore the physical stuff?

Pretty much.  For most people, they can live their whole lives within the virtual world.  They can think they are eating the steak and never try bending the spoons.

So why are containers disruptive?  

Well, it’s like the Surrogates movie.  Right now, a lot of people living in the Cloud Matrix are setting up even smaller bubbles.  They are finding that they don’t need a whole city, they can just live inside a single room.  For them, it’s more like Surrogates where people never leave their single room.

But if they never leave the container, do they need the Matrix?

No.  And that’s the disruption.  If you’ve wrapped yourself in a smaller bubble then you really don’t need to larger wrapper.

What about that messy “real world”?

It’s still out there in both cases.  Just once you are inside the inner bubble, you can’t really tell the difference.

Short lived VM (Mayflies) research yields surprising scheduling benefit

Last semester, Alex Hirschfeld (my son) did a simulation to explore the possible efficiency benefits of the Mayflies concept proposed by Josh McKenty and me.

Mayflies swarming from Wikipedia

In the initial phase of the research, he simulated a data center using load curves designed to oversubscribe the resources (he’s still interesting in actual load data).  This was sufficient to test the theory and find something surprising: mayflies can really improve scheduling.

Alex found an unexpected benefit comes when you force mayflies to have a controlled “die off.”  It allows your scheduler to be much smarter.

Let’s assume that you have a high mayfly ratio (70%), that means every day 10% of your resources would turn over.  If you coordinate the time window and feed that information into your scheduler, then it can make much better load distribution decisions.  Alex’s simulation showed that this approach basically eliminated hot spots and server over-crowding.

Here’s a snippet of his report explaining the effect in his own words:

On a system that is more consistent and does not have a massive virtual machine through put, Mayflies may not help with balancing the systems load, but with the social engineering aspect, it can increase the stability of the system.

Most of the time, the requests for new virtual machines on a cloud are immutable. They came in at a time and need to be fulfilled in the order of their request. Mayflies has the potential to change that. If a request is made, it has the potential to be added to a queue of mayflies that need to be reinitialized. This creates a queue of virtual machine requests that any load balancing algorithm can work with.

Mayflies can make load balancing a system easier. Knowing the exact size of the virtual machine that is going to be added and knowing when it will die makes load balancing for dynamic systems trivial.

Golang Example JSON REST HTTP Get with Digest Auth

Since I could not find a complete example of a GO REST Call that returned JSON and used Digest Auth (for Digital Rebar API), I wanted to feed the SEO monster for the next person.

My purpose is to illustrate the pattern, not deliver reference code.  Once I got all the pieces in the right place, the code was wonderfully logical.  The basic workflow is:

  1. define a structure with JSON mapping markup
  2. define an alternate HTTP transport that includes digest auth
  3. enable the client
  4. perform the get request
  5. extract the request body into a stream
  6. decode the stream into the mapped data structure (from step 1)
  7. use the information

Here’s the sample:

package main

import (
“fmt”
digest “code.google.com/p/mlab-ns2/gae/ns/digest”
“encoding/json”
)

// the struct maps to the JSON automatically with the added meta data
type Deployment struct {
ID int `json:”id”`
State int `json:”state”`
Name string `json:”name”`
Description string `json:”description”`
System bool `json:”system”`
ParentID int64 `json:”parent_id”`
CreatedAt string `json:”created_at”`
UpdatedAt string `json:”updated_at”`
}

func main() {

// setup a transport to handle disgest
transport := digest.NewTransport(“crowbar”, “password”)

// initialize the client
client, err := transport.Client()
if err != nil {
return err
}

// make the call (auth will happen)
resp, err := client.Get(“http://127.0.0.1:3000/api/v2/deployments”)
if err != nil {
return err
}
defer resp.Body.Close()

// magic of the structure definition will map automatically
var d []Deployment // it’s an array returned, so we need an array here.
err = json.NewDecoder(resp.Body).Decode(&d)

// print results
fmt.Printf(“Header:%s\n”, resp.Header[“Content-Type”])
fmt.Printf(“Code:%s\n”, resp.Status)
fmt.Printf(“Name:%s\n”, d[0].Name)

}

PS: I’m doing this for the  Digital Rebar API driver because it uses REST and Digest.  We’re actively maintaining it there if you want the latest.