What does it take to Operate Open Platforms? Answers in Datanaughts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanaughts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

SRE role with DevOps for Enterprise [@HPE podcast]


My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

Yes, DevOps and SRE are complementary

In this short 16 minute podcast, HPE’s Stephen Spector and I discuss how DevOps and SRE thinking overlaps and where are the differences.  We also discuss how Enterprises should be evaluating Site Reliability Engineering as a function and where it fits in their organization.

Beyond Expectations: OpenStack via Kubernetes Helm (Fully Automated with Digital Rebar)

RackN revisits OpenStack deployments with an eye on ongoing operations.

I’ve been an outspoken skeptic of a Joint OpenStack Kubernetes Environment (my OpenStack BCN presoSuper User follow-up and BOS Proposal) because I felt that the technical hurdles of cloud native architecture would prove challenging.  Issues like stable service positioning and persistent data are requirements for OpenStack and hard problems in Kubernetes.

I was wrong: I underestimated how fast these issues could be addressed.

youtube-thumb-nail-openstackThe Kubernetes Helm work out of the AT&T Comm Dev lab takes on the integration with a “do it the K8s native way” approach that the RackN team finds very effective.  In fact, we’ve created a fully integrated Digital Rebar deployment that lays down Kubernetes using Kargo and then adds OpenStack via Helm.  The provisioning automation includes a Ceph cluster to provide stateful sets for data persistence.  

This joint approach dramatically reduces operational challenges associated with running OpenStack without taking over a general purpose Kubernetes infrastructure for a single task.

sre-seriesGiven the rise of SRE thinking, the RackN team believes that this approach changes the field for OpenStack deployments and will ultimately dominate the field (which is already  mainly containerized).  There is still work to be completed: some complex configuration is required to allow both Kubernetes CNI and Neutron to collaborate so that containers and VMs can cross-communicate.

We are looking for companies that want to join in this work and fast-track it into production.  If this is interesting, please contact us at sre@rackn.com.

Why should you sponsor? Current OpenStack operators facing “fork-lift upgrades” should want to find a path like this one that ensures future upgrades are baked into the plan.  This approach provide a fast track to a general purpose, enterprise grade, upgradable Kubernetes infrastructure.

Closing note from my past presentations: We’re making progress on the technical aspects of this integration; however, my concerns about market positioning remain.

“Why SRE?” Discussion with Eric @Discoposse Wright

sre-series My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

ericewrightI was a guest on Eric “@discoposse” Wright of the Green Circle Community #42 Podcast (my previous appearance).

LISTEN NOW: Podcast #42

In this action-packed 30 minute conversation, we discuss the industry forces putting pressure on operations teams.  These pressures require operators to be investing much more heavily on reusable automation.

That leads us towards why Kubernetes is interesting and what went wrong with OpenStack (I actually use the phrase “dumpster fire”).  We ultimately talk about how those lessons embedded in Digital Rebar architecture.

Apparently IT death smells like kickstart files. Six Reasons why.

Today, I’m sharing a parable about always being focused on adding value.

Recently, I was on a call with an IT Ops manager who insisted that his team had their on-premises operations under control with “python scripts and manual kickstart files” because they “really don’t change their infrastructure setup.” He explained that he and his team was comfortable with this because it was something they understood and did not require learning new systems. While I understand his position, I was sort of sad for him and his employer because…

No value is created for his company by maintaining custom kickstart, preseeds or boot files.

Maintaining kickstarts is fatal for many reasons. Is there a way to make it less fatal? Yes, and it involves investing in learning tools that let you move up stack.

Contrary to popular IT mythology, managing physical infrastructure is still a reality for many IT teams and will remain a part of best practices until every workload simply runs on Amazon and it becomes their problem.  Since that “Utopian” future is unlikely, let’s deal with some practical realities of hybrid IT.

Here are my six reasons why custom kickstarts (and other site-specific boot provisioning scripts) are dangerous:

1. Creating Site Unique Processes

Every infrastructure is unique and that’s a practical reality that we have to accept because otherwise we would never be able to make improvements and corrects without touching everything that already deployed. However, we really want to work hard to minimize places where we inject variation into the environment. That means that server and site specific kickstarts with lots of post-provisioning steps forces operators to maintain additional information about each server.

2. Building Server Specific Configurations

When we create server specific templates, it becomes nearly impossible to recreate server builds. That directly leads to fragile infrastructure because teams cannot quickly redeploy or automate refreshes. Static IT infrastructure is a known fail pattern and makes enterprises vulnerable to staff changes, hacking and inability to manage and patch.

3. Having Opaque Configurations

Kickstart is hard to understand (and even harder to troubleshoot). When teams take actions during the provisioning process they are often not tracked or managed like other operational scripting tools. Failures or injections can easily go undetected. Even if they are tracked, the number of operators who can read and manage these scripts is limited. That means that critical aspects of your operational environment happen outside of your awareness.

4. Being Less Secure

Kickstart processes generally include injecting SSH keys, certificates and other authentication credentials. These embedded credentials are often hard coded into the process with minimal awareness of the operational team leaving you vulnerable at the most foundational level. This is not an acceptable security process; however, teams who hack kickstarts often don’t want to consider the implications.

Security side note: most teams don’t have the expertise to integrate TPM or HSM into their kickstart processes; consequently, these key security technologies are generally unused and ignored. If you want to talk about this, please contact me!

5. Diverging Provisioning Patterns

Cloud does not use kickstarts. Provisioning variation increases when teams keep/add logic and configuration into server provisioning instead of doing it as post-provision automation. If your physical provisioning team is not rehearsing on cloud then you’re in a serious IT hole because all workloads should be managed as hybrid-ready. Deployment fidelity helps accelerate teams and reduces cost.

6. Reusing Community Practice

Finally, managing your own kickstarts makes it impossible to leverage community patterns and practices. Kickstarts are not exactly a hive of innovation so you are not creating any competitive advantage by adding variation there. In cases like that, reusing community tooling is a net benefit to your organization. Why have we not done this already? Until recently, provisioning tools were not API driven or focused on reusable shared practice.

While Kickstart or similar is pretty much required for physical, we have a solution for these issues.

One of the key design elements of Digital Rebar is an templated, API driven boot provisioner. Our approach uses kickstarts, preseeds and other tools; however, we’ve worked hard to minimize their span and decompose them into reusable components. That allows users to inject site specific code as snippets that are centrally managed and hardware neutral.

Critically, our approach allows SRE and Ops teams to get out of the kickstart business and focus on provisioning workflow and automation. Yes, there’s some learning curve but there are a lot of benefits to moving up stack.

It’s not too late to “:q!” those kickstart edits and accelerate your infrastructure.

Spiraling Ops Debt & the SRE coding imperative

This post is part of an SRE series grounded in the ideas inspired by the Google SRE book.

2/13 Update: You can hear an INTERACTIVE DISCUSSION based on this post with Eric Wright on his podcast, GC Online.

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.


Ops Tsunami

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are build on rapidly evolving components.  For example, Kubernetes has a 3 month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

Since then, we’ve seen an container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.


(re)OpenStack for 2017 – board voting week starts this Monday

[1/19 Update: I placed 9th in the results (or 6th if you go only by popular vote instead of total votes).  There are 8 seats, so I was not elected.]

The OpenStack Project needs a course correction and I’m asking for your community vote to put me back on the 2017 Board to help drive it.  As a start-up CEO, I’m neutral, yet I also have the right technical, commercial and community influence to make this a reality.

Vote Now!Your support is critical because OpenStack fills a very real need and should have a solid future; however, it needs to adapt to market realities to achieve that.

I want the Board to acknowledge and adapt to stumbles in ecosystem success including being dropped or re-prioritized by key sponsors.  This should include tightening the mission so the project can collaborate more freely with both open and proprietary platforms.  In 2016, I’ve been deeply involved OpenStack alternatives including Kubernetes and hybrid Cloud automation with Amazon and Google.

OpenStack must adjust to being one of several alternatives including AWS, Google and container platforms like Kubernetes.

That means focusing on our IaaS strengths and being unambiguous about core function like SDN and storage integration.   It also means ensuring that commercial members of the ecosystem can both profit and compete.  The Board has both the responsibility and authority to make these changes if the members are willing to act.

What’s my background?  I’ve been an active and vocal member of the OpenStack community since the very beginning of the project especially around Operator and Product Management issues.  I was elected to the board four times and played critical roles including launching the DefCore efforts and pushing for more definition of the Big Tent concept (which I believe has hurt the project).

In a great field of candidates!  Like other years, there are many very strong candidates whom I have worked with in a number of roles.  I always recommend distributing your eight votes to multiple people and limited “affinity voting” for your own company or geography.   While all candidates would serve the board, this year, I’d like to call attention specifically to  Shamail Tahair as a candidate who has invested significant time in helping with Product Management and Enterprise Readiness for OpenStack.