Full Metal DevOps: 12 things we needed beyond Cobbler

Almost a manifesto!

Rob Hirschfeld

The RackN team did not plan to replace Cobbler, we just needed something that responded to our need for full-cycle cross-platform DevOps automation.

Provisioning an O/S is never enough!  You need to coordinate a lot of operational activity to deploy a multi-node system, like OpenStack, Kubernetes, Docker Swarm or Ceph.  Since we believe an automated upgrade path is also required, there is a huge gap in provisioning.

So what was needed?  Here’s our (rather long!) list of gaps to fill for full Metal DevOps provisioning:

GapCommentary
1Needs to work with Cobbler!Improve? Yes.  Disrupt?  Hell No!  It has to be OK to leave Cobbler in place while we do something better.  I’d be OK to tweak my Cobber to point it to the new stuff.
2REST API & JSON CLIBeyond the obvious API, we really want a way to write scripts that drive deployment proactively.
3Modular ComponentsIf…

View original post 358 more words

Smaller Nodes? Just the Right Size for Docker!

Container workloads have the potential to redefine how we think about scale and hosted infrastructure.

Last Fall, Ubiquity Hosting and RackN announced a 200 node Docker Swarm cluster as a phase one of our collaboration. Unlike cloud-based container workloads demonstrations, we chose to run this cluster directly on the bare metal.  

Why bare metal instead of virtualized? We believe that metal offers additional performance, availability and control.  

With the cluster automation ready, we’re looking for customers to help us prove those assumptions. While we could simply build on many VMs, our analysis is the a lot of smaller nodes will distribute work more efficiently. Since there is no virtualization overhead, lower RAM systems can still give great performance.

The collaboration with RackN allows us to offer customers a rapid, repeatable cluster capability. Their Digital Rebar automation works on a broad spectrum of infrastructure allow our users to rehearse deployments on cloud, quickly change components and iteratively tune the cluster.

We’re finding that these dedicated metal nodes have much better performance than similar VMs in AWS?  Don’t believe us – you can use Digital Rebar to spin up both and compare.   Since Digital Rebar is an open source platform, you can explore and expand on it.

The Docker Swarm deployment is just a starting point for us. We want to hear your provisioning ideas and work to turn them into reality.

2015 Container Review

It’s been a banner year for container awareness and adoption so we wanted to recap 2015.  For RackN, container acceleration is near to our heart because we both enable and use them in fundamental ways.   Look for Rob’s 2016 predictions on his blog.

The RackN team has truly deep and broad experience with containers in practical use.  In the summer, we delivered multiple container orchestration workloads including Docker Swarm, Kubernetes, Cloud Foundry, StackEngine and others.  In the fall, we refactored Digital Rebar to use Docker Compose with dramatic results.  And we’ve been using Docker since 2013 (yes, “way back”) for ops provisioning and development.

To make it easier to review that experience, we are consolidating a list of our container related posts for 2015.

General Container Commentary

RackN & Digital Rebar Related

From Start to Scale: learn faster with heterogenous deployments

Why mix VMs and Physical? Having a consistent deploy approach can dramatically speed learning cycles that result in better scale ops. I would never deploy production OpenStack on VMs but I strongly recommend rehearsing that deployment on VMs hundreds of times before I touch metal.

Over the last two months, the RackN team redefined “heterogeneous” infrastructure in Digital Rebar from being “just” multi-vendor hardware to include any server resource from containers and Vagrant/Virtualbox to clouds like AWS or Packet. To support this truly diverse range, there were both technical and operational challenges to overcome.

The technical challenge rises from the fundamental control differences between cloud and physical infrastructure. In cloud, infrastructure is much more prescribed – you cannot change most aspects of your system and especially not your network interfaces or IPs. To provision hardware efficiently, we had to establish control over the very things that Cloud systems manage for you. 

That management diversity exercised the full extent of the Digital Rebar “functional ops” architecture.

Over the last year, we’ve been unwinding baked-in control assumptions from earlier versions of Digital Rebar. That added flexibility allows Digital Rebar to mix control APIs for infrastructure ranging from using Cobbler to Docker, Vagrant and AWS. Since we could already cope with heterogeneous control APIs using Digital Rebar’s unique functional ops design, we retained the ability to mix and match container, virtual and physical infrastructure.

The operational challenge was more subtle. We were motivated to make this change by first hand observations of the fidelity gap. I am a strong believer that container platforms will directly target metal in the next two years. The challenge is how do we get there from our current virtualization-focused infrastructure.

It’s easy to look at the completed work as an obvious step forward. Looking over my shoulder, I know that it took years of learning and perseverance to create a platform that was flexible enough to handle both extremes of control. Even more important was understanding why it was so important for a physical scale deployment platform to provide ops fidelity for developers too.

With the infrastructure work behind us, we’re seeing Digital Rebar deliver real operational transformation. We want to help IT embrace containers and immutable infrastructure without having to discard the hard won battles installing cloud and traditional infrastructure. Most critically, we hope that you’ll join our open community and share your operational journey with us.

Faster, Simpler AND Smaller – Immutable Provisioning with Docker Compose!

Nearly 10 TIMES faster system resets – that’s the result of fully enabling an multi-container immutable deployment on Digital Rebar.

Docker ComposeI’ve been having a “containers all the way down” month since we launched Digital Rebar deployment using Docker Compose. I don’t want to imply that we rubbed Docker on the platform and magic happened. The RackN team spent nearly a year building up the Consul integration and service wrappers for our platform before we were ready to fully migrate.

During the Digital Rebar migration, we took our already service-oriented code base and broke it into microservices. Specifically, the Digital Rebar parts (the API and engine) now run in their own container and each service (DNS, DHCP, Provisioning, Logging, NTP, etc) also has a dedicated container. Likewise, supporting items like Consul and PostgreSQL are, surprise, managed in dedicated containers too. All together, that’s over nine containers and we continue to partition out services.

We use Docker Compose to coordinate the start-up and Consul to wire everything together. Both play a role, but Consul is the critical glue that allows Digital Rebar components to find each other. These were not random choices. We’ve been using a Docker package for over two years and using Consul service registration as an architectural choice for over a year.

Service registration plays a major role in the functional ops design because we’ve been wrapping datacenter services like DNS with APIs. Consul is a separation between providing and consuming the service. Our previous design required us to track the running service. This worked until customers asked for pluggable services (and every customer needs pluggable services as they scale).

Besides being a faster to reset the environment, there are several additional wins:

  1. more transparent in how it operates – it’s obvious which containers provide each service and easy to monitor them as individuals.
  2. easier to distribute services in the environment – we can find where the service runs because of the Consul registration, so we don’t have to manage it.
  3. possible to have redundant services – it’s easy to spin up new services even on the same system
  4. make services pluggable – as long as the service registers and there’s an API, we can replace the implementation.
  5. no concern about which distribution is used – all our containers are Ubuntu user space but the host can be anything.
  6. changes to components are more isolated – changing one service does not require a lot of downloading.

Docker and microservices are not magic but the benefits are real. Be prepared to make architectural investments to realize the gains.

Got some change? Build a datacenter ops lab on your coffee break [with Packet.net MaaS]

We’re using Packet.net hosted metal to test automation for private metal (video).  You can use discount code “RACKN100” to get a credit on Packet and try it yourself.

At RackN, we’ve been shrinking our scale deployment platform down to run faithfully on a desktop class system. Since we abstract the network and hardware complexity, you can build automation that scales to physical from as little as 16 Gb of RAM (the same size as Packet’s smaller server). That allows the exact same logic we use for an 80 node Ceph or Kubernetes cluster work on my 14” laptop.

In fact, we’ve been getting a bit obsessed with making a clean restart small and fast using containers, VMs and bootstrapping scripts.

Creating a remote test lab is part of this obsession because many rehearsals make great performances.  We wanted to eliminate the setup time and process for users who just want to experiment with a production grade deployment. Using Packet.net hosted metal and some Ansible scripts, we can build a complete HA Kubernetes cluster in about 15 minutes using VMs. This lets us iterate on Kubernetes best practices virtually since the “setup metal part” is handled abstractly by Digital Rebar.

Yawn. You could do the same in AWS. Why is that exciting?

The process for the lab system we build in Packet.net can then be used to provision a complete private infrastructure on metal including RAID, BIOS and server networking. Even though the lab uses VMs, we still do real networking, storage and configuration. For example, we can iterate building real software defined networking (SDN) overlays in this environment and then scale the work up to physical gear.

The provision and deploy time is so fast (generally, under 15 minutes) that we are using it as a clean environment for Dev and QA cycles on new automation. It’s also a very practical demo environment for these platforms because of the fidelity between this environment and an actual pilot. For me, that means spending $0.40 so I don’t have to sweat losing my work in process, battery life or my wifi connection to crank out a demo.

BTW… Packet.net servers are SUPER FAST. Even the small 16 Gb RAM machine is packed with SSDs and great connectivity.

If you are exploring any of the several workloads that we’ve been building (Docker Swarm, Kubernetes, Mesos, CloudFoundry, Ceph and OpenStack) or just playing around with API driven physical provisioning, we just made that work a little easier and a lot faster.

DNS is critical – getting physical ops integrations right matters

Rob Hirschfeld

Why DNS? Maintaining DNS is essential to scale ops.  It’s not as simple as naming servers because each server will have multiple addresses (IPv4, IPv6, teams, bridges, etc) on multiple NICs depending on the systems function and applications. Plus, Errors in DNS are hard to diagnose.

Names MatterI love talking about the small Ops things that make a huge impact in quality of automation.  Things like automatically building a squid proxy cache infrastructure.

Today, I get to rave about the DNS integration that just surfaced in the OpenCrowbar code base. RackN CTO, Greg Althaus, just completed work that incrementally updates DNS entries as new IPs are added into the system.

Why is that a big deal?  There are a lot of names & IPs to manage.

In physical ops, every time you bring up a physical or virtual network interface, you are assigning at least one IP to that interface. For…

View original post 297 more words