OpenStack is caught in a snowstorm – it’s status quo for ops implementations to be snowflakes

OpenStack got into exactly the place we expected: operations started with fragmented and divergent data centers (aka snowflaked) and OpenStack did nothing to change that. Can we fix that? Yes, but the answer involves relying on Amazon as our benchmark.

In advance of my OpenStack Summit Demo/Presentation (video!) [slides], I’ve spent the last few weeks mapping seven (and counting) OpenStack implementations into the cloud provider subsystem of the Digital Rebar provisioning platform. Before I started working on adding OpenStack integration, RackN already created a hybrid DevOps baseline. We are able to run the same Kubernetes and Docker Swarm provisioning extensions on multiple targets including Amazon, Google, Packet and directly on physical systems (aka metal).

Before we talk about OpenStack challenges, it’s important to understand that data centers and clouds are messy, heterogeneous environments.

These variations are so significant and operationally challenging that they are the fundamental design driver for Digital Rebar. The platform uses a composable operational approach to isolate and then chain automation tasks together. That allows configurations, like networking, from infrastructure specific functions to be passed into common building blocks without user intervention.

Composability is critical because it allows operators to isolate variations into modular pieces and the expose common configuration elements. Since the pattern works successfully for crossing other clouds and metal, I anticipated success with OpenStack.

The challenge is that there is not “one standard OpenStack” implementation.  This issue is well documented under OpenStack as Project Shade.

If you only plan to operate a mono-cloud then these are not concerns; however, everyone I’ve met is using at least AWS and one other cloud. This operational fact means that AWS provides the common service behavior baseline. This is not an API statement – it’s about being able to operate on the systems delivered by the API.

While the OpenStack API worked consistently on each tested cloud (win for DefCore!), it frequently delivered systems that could not be deployed or were unusable for later steps.

While these are not directly OpenStack API concerns, I do believe that additional metadata in the API could help expose material configuration choices. The challenge becomes defining those choices in a reference architecture way. The OpenStack principle of leaving implementation choices open makes it challenging to drive these options to a narrow set of choices. Unfortunately, it means it is difficult to create an intra-OpenStack hybrid automation without hard-coded vendor identities or exploding configuration flags.

As series of individually reasonable options dominoes together to make to these challenges.  These are real issues that I made the integration difficult.

  • No default of externally accessible systems. I have to assign floating IPs (an anti-pattern for individual VMs) or be on the internal networks. No consistent naming pattern for networks, types (flavors) or starting images.  In several cases, the “private” network is the publicly accessible one and the “external” network is visible but unusable.
  • No consistent naming for access user accounts.  If I want to ssh to a system, I have to fail my first login before I learn the right user name.
  • No data to determine which networks provide which functions.  And there’s no metadata about which networks are public or private.  
  • Incomplete post-provisioning processes because they are left open to user customization.

There is a defensible and logical reason for each example above; sadly, those reasons do nothing to make OpenStack more operationally accessible.  While intra-OpenStack interoperability is helpful, I believe that ecosystems and users benefit from Amazon-like behavior.

What should you do?  Help broaden the OpenStack discussions to seek interoperability with the whole cloud ecosystem.

 

At RackN, we will continue to refine and adapt to these variations.  Creating a consistent experience that copes with variability is the raison d’etre for our efforts with Digital Rebar. That means that we ultimately use AWS as the yardstick for configuration of any infrastructure from physical, OpenStack and even Amazon!

 

SIG-ClusterOps: Promote operability and interoperability of Kubernetes clusters

Originally posted on Kubernetes Blog.  I wanted to repost here because it’s part of the RackN ongoing efforts to focus on operational and fidelity gap challenges early.  Please join us in this effort!

openWe think Kubernetes is an awesome way to run applications at scale! Unfortunately, there’s a bootstrapping problem: we need good ways to build secure & reliable scale environments around Kubernetes. While some parts of the platform administration leverage the platform (cool!), there are fundamental operational topics that need to be addressed and questions (like upgrade and conformance) that need to be answered.

Enter Cluster Ops SIG – the community members who work under the platform to keep it running.

Our objective for Cluster Ops is to be a person-to-person community first, and a source of opinions, documentation, tests and scripts second. That means we dedicate significant time and attention to simply comparing notes about what is working and discussing real operations. Those interactions give us data to form opinions. It also means we can use real-world experiences to inform the project.

We aim to become the forum for operational review and feedback about the project. For Kubernetes to succeed, operators need to have a significant voice in the project by weekly participation and collecting survey data. We’re not trying to create a single opinion about ops, but we do want to create a coordinated resource for collecting operational feedback for the project. As a single recognized group, operators are more accessible and have a bigger impact.

What about real world deliverables?

We’ve got plans for tangible results too. We’re already driving toward concrete deliverables like reference architectures, tool catalogs, community deployment notes and conformance testing. Cluster Ops wants to become the clearing house for operational resources. We’re going to do it based on real world experience and battle tested deployments.

Connect with us.

Cluster Ops can be hard work – don’t do it alone. We’re here to listen, to help when we can and escalate when we can’t. Join the conversation at:

The Cluster Ops Special Interest Group meets weekly at 13:00PT on Thursdays, you can join us via the video hangout and see latest meeting notes for agendas and topics covered.

2015 Container Review

It’s been a banner year for container awareness and adoption so we wanted to recap 2015.  For RackN, container acceleration is near to our heart because we both enable and use them in fundamental ways.   Look for Rob’s 2016 predictions on his blog.

The RackN team has truly deep and broad experience with containers in practical use.  In the summer, we delivered multiple container orchestration workloads including Docker Swarm, Kubernetes, Cloud Foundry, StackEngine and others.  In the fall, we refactored Digital Rebar to use Docker Compose with dramatic results.  And we’ve been using Docker since 2013 (yes, “way back”) for ops provisioning and development.

To make it easier to review that experience, we are consolidating a list of our container related posts for 2015.

General Container Commentary

RackN & Digital Rebar Related

Got some change? Build a datacenter ops lab on your coffee break [with Packet.net MaaS]

We’re using Packet.net hosted metal to test automation for private metal (video).  You can use discount code “RACKN100” to get a credit on Packet and try it yourself.

At RackN, we’ve been shrinking our scale deployment platform down to run faithfully on a desktop class system. Since we abstract the network and hardware complexity, you can build automation that scales to physical from as little as 16 Gb of RAM (the same size as Packet’s smaller server). That allows the exact same logic we use for an 80 node Ceph or Kubernetes cluster work on my 14” laptop.

In fact, we’ve been getting a bit obsessed with making a clean restart small and fast using containers, VMs and bootstrapping scripts.

Creating a remote test lab is part of this obsession because many rehearsals make great performances.  We wanted to eliminate the setup time and process for users who just want to experiment with a production grade deployment. Using Packet.net hosted metal and some Ansible scripts, we can build a complete HA Kubernetes cluster in about 15 minutes using VMs. This lets us iterate on Kubernetes best practices virtually since the “setup metal part” is handled abstractly by Digital Rebar.

Yawn. You could do the same in AWS. Why is that exciting?

The process for the lab system we build in Packet.net can then be used to provision a complete private infrastructure on metal including RAID, BIOS and server networking. Even though the lab uses VMs, we still do real networking, storage and configuration. For example, we can iterate building real software defined networking (SDN) overlays in this environment and then scale the work up to physical gear.

The provision and deploy time is so fast (generally, under 15 minutes) that we are using it as a clean environment for Dev and QA cycles on new automation. It’s also a very practical demo environment for these platforms because of the fidelity between this environment and an actual pilot. For me, that means spending $0.40 so I don’t have to sweat losing my work in process, battery life or my wifi connection to crank out a demo.

BTW… Packet.net servers are SUPER FAST. Even the small 16 Gb RAM machine is packed with SSDs and great connectivity.

If you are exploring any of the several workloads that we’ve been building (Docker Swarm, Kubernetes, Mesos, CloudFoundry, Ceph and OpenStack) or just playing around with API driven physical provisioning, we just made that work a little easier and a lot faster.

4 item OSCON report: no buzz winner, OpenStack is DownStack?, Free vs Open & the upstream imperative

Now that my PDX Trimet pass expired, it’s time to reflect on OSCON 2014.   Unfortunately, I did not ride my unicorn home on a rainbow; this year’s event seemed to be about raising red flags.

My four key observations:

  1. No superstar. Past OSCONs had at least one buzzy community super star.  2013 was Docker and 2011 was OpenStack.  This was not just my hallway track perception, I asked around about this specifically.  There was no buzz winner in 2014.
  2. People were down on OpenStack (“DownStack”). Yes, we did have a dedicated “Open Cloud Day” event but there was something missing.  OpenSack did not sponsor and there were no major parties or releases (compared to previous years) and little OpenStack buzz.  Many people I talked to were worried about the direction of the community, fragmentation of the project and operational readiness.  I would be more concerned about “DownStack” except that no open infrastructure was a superstar either (e.g.: Mesos, Kubernetes and CoreOS).  Perhaps, OSCON is simply not a good venue open infrastructure projects compared to GlueCon or Velocity?  Considering the rapid raise of container-friendly OpenStack alternatives; I think the answer may be that the battle lines for open infrastructure are being redrawn.
  3. Free vs. Open. Perhaps my perspective is more nuanced now (many in open source communities don’t distinguish between Free and Open source) but there’s a real tension between Free (do what you want) and Open (shared but governed) source.  Now that open source is a commercial darling, there is a lot of grumbling in the Free community about corporate influence and heavy handedness.   I suspect this will get louder as companies try to find ways to maintain control of their projects.
  4. Corporate upstreaming becomes Imperative. There’s an accelerating upstreaming trend for companies that write lots of code to support their primary business (Google is a primary example) to ensure that code becomes used outside their company.   They open their code and start efforts to ensure its adoption.  This requires a dedicated post to really explain.

There’s a clear theme here: Open source is going mainstream corporate.

We’ve been building amazing software in the open that create real value for companies.  Much of that value has been created organically by well-intentioned individuals; unfortunately, that model will not scale with the arrival for corporate interests.

Open source is thriving not dying: these companies value the transparency, collaboration and innovation of open development.  Instead, open source is transforming to fit within corporate investment and governance needs.  It’s our job to help with that metamorphosis.