Building Kubernetes based highly customizable environments on OpenStack with Kubespray

This talk was given on November 8 at the OpenStack Summit Sydney event.

Abstract

Kubespray (formerly Kargo) – is a project under Kubernetes community umbrella. From the technical side, it is a set of tools, that bring the possibility to deploy production-ready Kubernetes cluster easily.

Kubespray supports multiple Linux distributions to host the Kubernetes clusters (including Ubuntu, Debian, CentOS/RHEL and Container Linux by CoreOS), multiple cloud providers to be used as an underlay for the cluster deployment (AWS, DigitalOcean, GCE, Azure and OpenStack), together with the ability to use Bare Metal installations. It may consume Docker and rkt as the container runtimes for the containerized workloads, as well as a wide variety of networking plugins (Flannel, Weave, Calico and Canal); or built-in cloud provider networking instead.

In this talk we will describe the options of using Kubespray for building Kubernetes environments on OpenStack and how can you benefit from it.

What can I expect to learn?

Active Kubernetes community members, Ihor Dvoretskyi and Rob Hirschfeld, will highlight the benefits of running Kubernetes on top of OpenStack, and will describe how Kubespray may simplify the cluster building and management options for these use-cases.

Complete presentation

Slides
https://www.slideshare.net/RackN/slideshelf

Speakers

Ihor Dvoretskyi

Ihor is a Developer Advocate at Cloud Native Computing Foundation (CNCF), focused on the upstream Kubernetes-related efforts. He acts as a Product Manager at Kubernetes community, leading Product Management Special Interest Group with the goals of growing Kubernetes as a #1 open source container orchestration platform.

Rob Hirschfeld

Rob Hirschfeld has been involved in OpenStack since the earliest days with a focus on ops and building the infrastructure that powers cloud and storage.  He’s also co-Chair of the Kubernetes Cluster Ops SIG and a four term OpenStack board member.

 

Deep Thinking & Tech + Great Guests – L8ist Sh9y podcast relaunched

I love great conversations about technology – especially ones where the answer is not very neatly settled into winners and losers (which is ALL of them in IT).  I’m excited that RackN has (re)launched the L8ist Sh9y (aka Latest Shiny) podcast around this exact theme.

Please check out the deep and thoughtful discussion I just had with Mark Thiele (notes) of Aperca where we covered Mark’s thought on why public cloud will be under 20% of IT and culture issues head on.

Spoiler: we have David Linthicum coming next, SO SUBSCRIBE.

I’ve been a guest on some great podcasts (Cloudcast, gcOnDemand, Datanauts, IBM Dojo, HPEFoodfight) and have deep respect for critical work they do in industry.

We feel there’s still room for deep discussions specifically around automated IT Operations in cloud, data center and edge; consequently, we’re branching out to start including deep interviews in addition to our initial stable of IT Ops deep technical topics like Terraform, Edge Computing, GartnerSYM review, Kubernetes and, of course, our own Digital Rebar.

Soundcloud Subscription Information

 

Go CI/CD and Immutable Infrastructure for Edge Computing Management

In our last post, we pretty much tore apart the idea of running mini-clouds on the edge because they are not designed to be managed at scale in resource constrained environments without deep hardware automation.  While I’m a huge advocate of API-driven infrastructure, I don’t believe in a one-size-fits-all API because a good API provides purpose-driven abstractions.

The logical extension is that having deep hardware automation means there’s no need for cloud (aka virtual infrastructure) APIs.  This is exactly what container-focused customers have been telling us at RackN in regular data centers so we’d expect the same to apply for edge infrastructure.

If “cloudification” is not the solution then where should we look for management patterns?  

We believe that software development CI/CD and immutable infrastructure patterns are well suited to edge infrastructure use cases.  We discussed this at a session at the OpenStack OpenDev Edge summit.

Continuous Integration / Continuous Delivery (CI/CD) software pipelines help to manage environments where the risk of making changes is significant by breaking the changes into small, verifiable units.  This is essential for edge because lack of physical access makes it very hard to mitigate problems.  Using CI/CD, especially with A/B testing, allows for controlled rolling distribution of new software.  

For example, in a 10,000 site deployment, the CI/CD infrastructure would continuously roll out updates and patches over the entire system.  Small incremental changes reduce the risk of a major flaw being introduced.  The effect is enhanced when changes are rolled slowly over the entire fleet instead of simultaneously rolled out to all sites (known as A/B or blue/green testing).  In the rolling deployment scenario, breaking changes can be detected and stopped before they have significant impacts.

These processes and the support software systems are already in place for large scale cloud software deployments.  There are likely gaps around physical proximity and heterogeneity; however, the process is there and initial use-case fit seems to be very good.

Immutable Infrastructure is a catch-all term for deployments based on images instead of configuration.  This concept is popular in cloud deployments were teams produce “golden” VM or container images that contain the exact version of software needed and then are provisioned with minimal secondary configuration.  In most cases, the images only need a small file injected (known as a cloud init) to complete the process.

In this immutable pattern, images are never updated post deployment; instead, instances are destroyed and recreated.  It’s a deploy, destroy, repeat process.  At RackN, we’ve been able to adapt Digital Rebar Provisioning to support this even at the hardware layer where images are delivered directly to disk and re-provisioning happens on a constant basis just like a cloud managing VMs.

The advantage of the immutable pattern is that we create a very repeatable and controlled environment.  Instead of trying to maintain elaborate configurations and bi-directional systems of record, we can simply reset whole environments.  In a CI/CD system, we constantly generate fresh images that are incrementally distributed through the environment.

Immutable Edge Infrastructure would mean building and deploying complete system images for our distributed environment.  Clearly, this requires moving around larger images than just pushing patches; however, these uploads can easily be staged and they provide critical repeatability in management.  The alternative is trying to keep track of which patches have been applied successfully to distributed systems.  Based on personal experience, having an atomic deliverable sounds very attractive.

CI/CD and Immutable patterns are deep and complex subjects that go beyond the scope of a single post; however, they also offer a concrete basis for building manageable data centers.

The takeaway is that we need to be looking first to scale distributed software management patterns to help build robust edge infrastructure platforms. Picking a cloud platform before we’ve figured out these concerns is a waste of time.

Previous 2 Posts on OpenStack Conference:

Post 1 – OpenStack on Edge? 4 Ways Edge is Distinct from Cloud
Post 2 – Edge Infrastructure is Not Just Thousands of Mini Clouds

Podcast: OpenStack OpenDev Highlights Edge vs Cloud Computing Confusion

Rob Hirschfeld provides his thoughts from last week’s OpenStack OpenDev conference focused on Edge Computing. This podcast is part of a three blog series from Rob on the issues surrounding Edge and Cloud computing:

Post 1 – OpenStack on Edge? 4 Ways Edge is Distinct from Cloud
Post 2 – Edge Infrastructure is Not Just Thousands of Mini Clouds

Edge Infrastructure is Not Just Thousands of Mini Clouds

I left the OpenStack OpenDev Edge Infrastructure conference with a lot of concerns relating to how to manage geographically distributed infrastructure at scale.  We’ve been asking similar questions at RackN as we work to build composable automation that can be shared and reused.  The critical need is to dramatically reduce site-specific customization in a way that still accommodates required variation – this is something we’ve made surprising advances on in Digital Rebar v3.1.

These are very serious issues for companies like AT&T with 1000s of local exchanges, Walmart with 10,000s of in-store server farms or Verizon with 10,000s of coffee shop Wifi zones.  These workloads are not moving into centralized data centers.  In fact, with machine learning and IoT, we are expecting to see more and more distributed computing needs.

Running each site as a mini-cloud is clearly not the right answer.

While we do need the infrastructure to be easily API addressable, adding cloud without fixing the underlying infrastructure management moves us in the wrong direction.  For example, AT&T‘s initial 100+ OpenStack deployments were not field up-gradable and lead to their efforts to deploy OpenStack on Kubernetes; however, that may have simply moved the upgrade problem to a different platform because Kubernetes does not address the physical layer either!

There are multiple challenges here.  First, any scale infrastructure problem must be solved at the physical layer first.  Second, we must have tooling that brings repeatable, automation processes to that layer.  It’s not sufficient to have deep control of a single site: we must be able to reliably distribute automation over thousands of sites with limited operational support and bandwidth.  These requirements are outside the scope of cloud focused tools.

Containers and platforms like Kubernetes have a significant part to play in this story.  I was surprised that they were present only in a minor way at the summit.  The portability and light footprint of these platforms make them a natural fit for edge infrastructure.  I believe that lack of focus comes from the audience believing (incorrectly) that edge applications are not ready for container management.

With hardware layer control (which is required for edge), there is no need for a virtualization layer to provide infrastructure management.  In fact, “cloud” only adds complexity and cost for edge infrastructure when the workloads are containerized.  Our current cloud platforms are not designed to run in small environments and not designed to be managed in a repeatable way at thousands of data centers.  This is a deep architectural gap and not easily patched.

OpenStack sponsoring the edge infrastructure event got the right people in the room but also got in the way of discussing how we should be solving these operational.  How should we be solving them?  In the next post, we’ll talk about management models that we should be borrowing for the edge…

Read 1st Post of 3 from OpenStack OpenDev: OpenStack on Edge? 4 Ways Edge is Distinct from Cloud

OpenStack on Edge? 4 Ways Edge Is Distinct From Cloud

Last week, I attended a unique OpenDev Edge Infrastructure focused event hosted by the OpenStack Foundation to help RackN understand the challenges customers are facing at the infrastructure edges.  We are exploring how the new lightweight, remote API-driven Digital Rebar Provision can play a unique role in these resource and management constrained environments.

I had also hoped the event part of the Foundation’s pivot towards being an “open infrastructure” community that we’ve seen emerging as the semiannual conferences attract a broader set of open source operations technologies like Kubernetes, Ceph, Docker and SDN platforms.  As a past board member, I believe this is a healthy recognition of how the community uses a growing mix of open technologies in the data center and cloud.

It’s logical for the OpenStack community, especially the telcos, to be leaders in edge infrastructure; unfortunately, that too often seemed to mean trying to “square peg” OpenStack into the every round hole at the Edge.  For companies with a diverse solution portfolio, like RackN, being too myopic on using OpenStack to solve all problems keeps us from listening to the real use-cases.  OpenStack has real utility but there is no one-size-fits all solution (and that goes for Kubernetes too).

By far the largest issue of the Edge discussion was actually agreeing about what “edge” meant.  It seemed as if every session had a 50% mandatory overhead in definitioning.  I heard some very interesting attempts to define edge in terms of 1) resource constraints of the equipment or 2) proximity to data sources or 3) bandwidth limitations to the infrastructure.  All of these are helpful ways to describe edge infrastructure.

Putting my usual operations spin on the problem, I choose to define edge infrastructure in data center management terms.  Edge infrastructure has very distinct challenges compared to hyperscale data centers.  

Here is my definition:

1) Edge is inaccessible by operators so remote lights out operation is required

2) Edge requires distributed scale management because there are many thousands of instances to be managed

3) Edge is heterogeneous because breath of environments and scale imposes variations

4) Edge has a physical location awareness component because proximity matters by design

These four items are hard operational management related challenges.  They are also very distinctive challenges when compared to traditional hyperscale data center operations issues where we typically enjoy easy access, consolidated management, homogeneous infrastructure and equal network access.

In our next post, ….

What makes ops hard? SRE/DevOps challenge & imperative [from Cloudcast 301]

TL;DR: Operators (DevOps & SREs) have a hard job, we need to make time and room for them to redefine their jobs in a much more productive way.

Cloudcast-Logo-2015-Banner-BlueThe Cloudcast.net by Brian Gracely and Aaron Delp brings deep experience and perspective into their discussions based on their impressive technology careers and understanding of the subject matter.  Their podcasts go deep quickly with substantial questions that get to the heart of the issue.  This was my third time on the show (previous notes).

In episode 301, we go deeply into the meaning and challenges for Site Reliability Engineering (SRE) functions.  We also cover some popular technologies that are of general interest.

Author’s Note; For further information about SREs, listen to my discussion about “SRE vs DevOps vs Cloud Native” on the Datanauts podcast #89.  (transcript pending)

Here are my notes from Cloudcast 301. with bold added for emphasis:

  • 2:00 Rob defines SRE (more resources on RackN.com site).
    • 2:30 Google’s SRE book gave a name, even changed the definition, to what I’ve been doing my whole career. Evolved name from being just about sites to a full system perspective.  
    • 3:30 SRE and DevOps are aligned at the core.  While DevOps is about process and culture, SRE is more about the function and “factory.”
    • 4:30 Developers don’t want to be shoving coal into the engine, but someone, SREs, have to make sure that everything keeps running
  • 5:15 Brian asks about impedance mismatch between Dev and Ops.  How do we fix that?
    • 6:30 Rob talks about the crisis brewing for operations innovation gap (link).  Digital Rebar is designed to create site-to-site automation so Operators can share repeatable best practices.
    • 7:30 OpenStack ran aground because Operators because we never created a the practices that could be repeated.  “Managed service as the required pattern is a failure of building good operational software.”
    • 8:00 RackN decomposes operations into isolated units so that individual changes don’t break the software on top

  • 9:20 Brian talks about the increasing rate of releases means that operations doesn’t have the skills to keep up with patching.
    • 10:10 That’s “underlay automation” and even scarier because software is composited with all sorts of parts that have their own release cycles that are not synchronized.
    • 11:30 We need to get system level patch/security.update hygiene to be automatic
    • 12:20 This is really hard!

  • 13:00 Brian asks what are the baby steps?
    • 13:20 We have to find baby steps where there are nice clean boundaries at every layer from the very most basic.  For RackN, that’s DHCP and PXE and then upto Kubernetes.
    • 15:15 Rob rants that renaming Ops teams as SRE is a failure because SRE has objectives like job equity that need to be included.
    • 16:00 Org silos get in the way of automation that have antibodies that make it difficult for SREs and DevOps to succeed.
    • 17:10 Those people have to be empowered to make change
    • 17:40 The existing tools must be pluggable or you are hurting operators.  There’s really no true greenfield, so we help people by making things work in existing data centers.
    • 19:00 Scripts may have technical debt but that does not mean they should just be disposed.
    • 19:20 New and shiney does not equal better.  For example, Container Linux (aka CoreOS) does not solve all problems.  
    • 20:10 We need to do better creating bridges between existing and new.
    • 20:40 How do we make Day 2 compelling?

  • 21:15 Brian asks about running OpenStack on Kubernetes.
    • 22:00 Rob is a fan of Kubernetes on Metal, but really, we don’t want metal and vms to be different.  That means that Kubernetes can be a universal underlay which is threatening to OpenStack.
    • 23:00 This is no longer a JOKE: “Joint OpenStack Kubernetes Environments”
    • 23:30 Running things on Kubernetes (or OpenStack) is great because the abstractions hide complexity of infrastructure; however, at the physical layer you need something that exposes that complexity (which is what RackN does).

  • 25:00 Brian asks at what point do you need to get past the easy abstractions
    • 25:30 You want to never care ever.  But sometimes you need the information for special cases.
    • 26:20 We don’t want to make the core APIs complex just to handle the special cases.
    • 27:00 There’s still a class of people who need to care about hardware.  These needs should not be embedded into the Kubernetes (or OpenStack) API.

  • 28:00 Brian summarizes that we should not turn 1% use cases into complexity for everyone.  We need to foster the skill of coding for operators
    • 28:45 For SREs, turning Operators into coding & automation is essential.  That’s a key point in the 50% programming statement for SREs.
    • In the closing, Rob suggested checking out Digital Rebar Provision as a Cobbler replacement.

We’re very invested in talking about SRE and want to hear from you! How is your company transforming operations work to make it more sustainable, robust and human?We want to hear your stories and questions.

.IO! .IO! It’s off to a Service Mesh you should go [Gluecon 2017 notes]

TL;DR: If you are containerizing your applications, you need to be aware of this “service mesh” architectural pattern to help manage your services.

Gluecon turned out to be all about a microservice concept called a “service mesh” which was being promoted by Buoyant with Linkerd and IBM/Google/Lyft with Istio.  This class of services is a natural evolution of the rush to microservices and something that I’ve written microservice technical architecture on TheNewStack about in the past.

servicemeshA service mesh is the result of having a dependency grid of microservices.  Since we’ve decoupled the application internally, we’ve created coupling between the services.  Hard coding those relationships causes serious failure risks so we need to have a service that intermediates the services.  This pattern has been widely socialized with this zipkin graphic (Srdan Srepfler’s microservice anatomy presentation)

IMHO, it’s healthy to find service mesh architecturally scary.

One of the hardest things about scaling software is managing the dependency graph.  This challenge is unavoidable from early days of Windows “DLL Hell” to the mixed joy/terror of working with Ruby Gem, Python Pip and Node.js NPM.  We get tremendous acceleration from using external modules and services, but we also pay a price to manage those dependencies.

For microservice and Cloud Native designs, the service mesh is that dependency management price tag.

A service mesh is not just a service injected between services.  It’s simplest function is to provide a reverse proxy so that multiple services can be consolidated under a single end-point.  That quickly leads to needing load balancers, discovery and encrypted back-end communication.  From there, we start thinking about circuit breaker patterns, advanced logging and A/B migrations.  Another important consideration is that service meshes are for internal services and not end-user facing, that means layers of load balancers.

It’s easy to see how a service mesh becomes a very critical infrastructure component.

If you are working your way through containerization then these may seem like very advanced concepts that you can postpone learning.  That blissful state will not last for long and I highly suggest being aware of the pattern before your development teams start writing their own versions of this complex abstraction layer.  Don’t assume this is a development concern: the service mesh is deeply tied to infrastructure and operations.

The service mesh is one of those tricky dev/ops intersections and should be discussed jointly.

Has your team been working with a service mesh?  We’d love to hear your stories about it!

Related Reading:

June 2 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

SRE Items of the Week

RackN and our Co-Founder and CEO Rob Hirschfeld openly called for a significant change to the OpenStack and Kubernetes communities in his VMBlog.com post, How is OpenStack so dead AND yet so very alive to SREs?

“We’re going to keep solving problems in and around the OpenStack community.  I’m excited to see the Foundation embracing that mission.  There are still many hard decisions to make.  For example, I believe that Kubernetes as an underlay is compelling for operators and will drive the OpenStack code base into a more limited role as a Kubernetes workload (check out my presentation about that at Boston).  While that may refocus the coding efforts, I believe it expands the relevance of the open infrastructure community we’ve been building.

Building infrastructure software is hard and complex.  It’s better to do it with friends so please join me in helping keep these open operations priorities very much alive.”

To provide more information on this idea, Rob posted a new blog, OpenStack’s Big Pivot: our suggestion to drop everything and focus on being a Kubernetes VM management workload.

“Sometimes paradigm changes demand a rapid response and I believe unifying OpenStack services under Kubernetes has become an such an urgent priority that we must freeze all other work until this effort has been completed.”

This proposal has caused significant readership for a typical RackN blog as well as on social media so Rob has posted a 2nd post to further the proposal. (re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes.

It’s essential to solve these problems in an open way so that we can work together as a community of operators.”

As you would expect, RackN is very interested in your thoughts on this proposal and its impact not only on the OpenStack and Kubernetes communities but also how it can transform the ability of IT infrastructure teams to deploy complex technologies in a reliable and scalable manner.

Please contact @zehicle and @rackngo to join the conversation.
_____________

Using Containers and Kubernetes to Enable the Development of Object-Oriented Infrastructure: Brendan Burns GlueCon Presentation

Is SRE a Good Term?
Interview with Rob Hirschfeld (RackN) and Charity Majors (Honeycomb) at Gluecon 2017


_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

Velocity : June 19 – 20 in San Jose, CA

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly)Issue #74

(re)Finding an Open Infrastructure Plan: Bridging OpenStack & Kubernetes

TL;DR: infrastructure operations is hard and we need to do a lot more to make these systems widely accessible, easy to sustain and lower risk.  We’re discussing these topics on twitter…please join in.  Themes include “do we really have consensus and will to act” and “already a solved problem” and “this hurts OpenStack in the end.”  

pexels-photo-229949I am always looking for ways to explain (and solve!) the challenges that we face in IT operations and open infrastructure.  I’ve been writing a lot about my concern that data center automation is not keeping pace and causing technical debt.  That concern led to my recent SRE blogging for RackN.

It’s essential to solve these problems in an open way so that we can work together as a community of operators.

It feels like developers are quick to rally around open platforms and tools while operators tend to be tightly coupled to vendor solutions because operational work is tightly coupled to infrastructure.  From that perspective, I’m been very involved in OpenStack and Kubernetes open source infrastructure platforms because I believe the create communities where we can work together.

This week, I posted connected items on VMblog and RackN that layout a position where we bring together these communities.

Of course, I do have a vested interest here.  Our open underlay automation platform, Digital Rebar, was designed to address a missing layer of physical and hybrid automation under both of these projects.  We want to help accelerate these technologies by helping deliver shared best practices via software.  The stack is additive – let’s build it together.

I’m very interested in hearing from you about these ideas here or in the context of the individual posts.  Thanks!