Unknown's avatar

About Rob H

A Baltimore transplant to Austin, Rob thinks about ways of building scale infrastructure for the clouds using Agile processes. He sat on the OpenStack Foundation board for four years. He co-founded RackN enable software that creates hyperscale converged infrastructure.

Operators, they don’t want to swim Upstream

Operators Dinner 11/10

Nov 10, Palo Alto Operators Dinner

Last Tuesday, I had the honor of joining an OpenStack scale operators dinner. Foundation executives, Jonathan Bryce and Lauren Sell, were also on the guest list so talk naturally turned to “how can OpenStack better support operators.” Notably, the session was distinctly not OpenStack bashing.

The conversation was positive, enthusiastic and productive, but one thing was clear: the OpenStack default “we’ll fix it in the upstream” answer does not work for this group of operators.

What is upstreaming?  A sans nuance answer is that OpenStack drives fixes and changes in the next community release (longer description).  The project and community have a tremendous upstream imperative that pervades the culture so deeply that we take it for granted.  Have an issue with OpenStack?  Submit a patch!  Is there any other alternative?

Upstreaming [to trunk] makes perfect sense considering the project vendor structure and governance; however, it is a very frustrating experience for operators.   OpenStack does have robust processes to backport fixes and sustain past releases and documentation; yet, the feeling at the table was that they are not sufficiently operator focused.

Operators want fast, incremental and pragmatic corrections to the code and docs they are deploying (which is often two releases back).  They want it within the community, not from individual vendors.

There are great reasons for focusing on upstream trunk.  It encourages vendors to collaborate and makes it much easier to add and expand the capabilities of the project.  Allowing independent activity on past releases creates a forward integration mess and could make upgrades even harder.  It will create divergence on APIs and implementation choices.

The risk of having a stable, independently sustained release is that operators have less reason to adopt the latest shiny release.  And that is EXACTLY what they are asking for.

Upstreaming is a core value to OpenStack and essential to our collaborative success; however, we need to consider that it is not the right answer to all questions.  Discussions at that dinner reinforced that pushing everything to latest trunk creates a significant barrier for OpenStack operators and users.

What are your experiences?  Is there a way to balance upstreaming with forking?  How can we better serve operators?

More Signal & Less Noise: my OpenStack Tokyo Restrospective

We’re building real business on OpenStack. This seems especially true in Asia where the focus is on using the core not expanding it. At the same time, we’ve entered the “big tent” era where non-core projects are proliferating.

Let’s explore what’s signal and what’s noise … but before we start, here are quick links to my summit videos:

wpid-20151030_100229.jpg

OpenStack summits are really family reunions. While aunt and uncles (Vendors) are busying showing off, all the cousins (Projects) are getting re-acquainted. Like any family it’s fun, competitive, friendly and sometimes dysfunctional.

Signal: Global Users and Providers

There are real deployments of OpenStack and real companies building businesses around the code base. I’ve been pleasantly surprised by the number of people quietly making OpenStack work. Why quietly? It’s still more of a struggle than it should be.

Signal: Demand for Heterogeneous and Interop Environments

There’s no such thing as a mono-lithic cloud. Even within the community, Monty’s Shade API normalizer, is drawing attention. More broadly, everyone is using multiple cloud platforms and the trend accelerates due to container portability.

Signal: Container Workloads

Containers are dominating the cloud discussion for good reasons. They are pushing into OpenStack at the top (Platform), bottom (Deployment) and side (Scheduler). While OpenStack must respond architecturally, it’s not clear yet if it can pivot from virtualization focused to something broader fast enough (Mesos?).

Signal: Ansible

The lightweight DevOps tool seems to be winning the popularity contest. It may not be the answer to all problems, but it’s clearly part of helping solve a lot of them. Warning: Ansible complexity explodes on multi-tiered, scale and upgrade orchestration.

Signal: DefCore and Product Working Group (PWG)

Both efforts have crossed from a concept into decision-making bodies within the community. The work is far from over. DefCore needs users to demand compliance from vendors. Product WG needs developers to demand their management sign on to PWG roadmaps.

Noise: Distro vs Service Argument

There are a lot of ways to consume OpenStack. None of them are wrong but some are more aligned with individual vendor strategies than others. Saying one way to run OpenStack is more right is undermining our overall operability and usability objectives.

Noise: Contributor Metrics

We’ve created a very commit economy and summits are vendors favorite time to brag about their dedication to community via upstreaming. These metrics are incomplete at best and potentially destructive to the health of the project as vendors compete to win the commit race instead of the quality and ecosystem race.

Noise: Big Tent

We’ve officially entered the “big tent” era of OpenStack. This governance change was lead by the Technical Committee to address how we manage projects; however, there are broader user, operator, vendor and ecosystem implications. Unfortunately, even within the community, the platform implications of a loosely governed, highly inclusive community not completely understood.

Overall, I left Tokyo enthusiastic about OpenStack’s future as a platform and community; however, I also see that we have not structured how we mingle platform, community and ecosystem. This is especially true because OpenStack is just a part in the much broader cloud market and work outside OpenStack is continue to disrupt our plans. As a Board member, I’ll hoping to start a discussion about this and want to hear your opinions.

To avoid echo chamber, OpenStack must embrace competitive cloud ecosystem

wpid-20151023_100533.jpg
Japanese Bullet Train View

I was in Japan before the Tokyo summit on a bullet train to Kyoto watching the mix of heavy industry and bucolic mountains pass by. That scene reflects an OpenStack duality: we want to be both a dominant platform delivering core cloud services and an open source values driven collective.

First, I fundamentally believe in the success of OpenStack as the open virtual infrastructure management platform.

I believe that we have solved the virtual compute/storage/network problem sufficiently to become the de facto open IaaS platform. While not perfect, the technologies are sufficient assuming we continue to improve ease of use and operational hardening. Pursing that base capability is my primary motivation for DefCore work.

I don’t believe that the OpenStack community is, or should try to become, the authority on “all things cloud.”

In the presence of Amazon, VMware, Microsoft and Google, we cannot make that claim with any degree of self-respect. Even newcomers like DigitalOcean have an undeniable footprint and influence. Those vendor platforms drive cloud ecosystems and technologies which foster fast innovation because there is no friction to joining their ecosystems and they are sufficiently large and stable enough to represent a target market. We’ve seen clear signs from Rackspace, HP and others that platform diversity improves cloud strength.

I continue to think we (OpenStack) spend too much time evaluating what is “in” or “out” of the project and too little time talking about what’s “on,” “under” and “with” the project like Kubernetes, Mesos, Docker, SDN, Hadoop and Ceph. That type of thinking creates distance between OpenStack efforts and the majority of the market.

What motivates the drive to an all open captive community? It’s the reasonable concern that critical parts of the infrastructure will become pay-to-play. For example, what if a non-OpenStack alternative to Heat Orchestration gained popularity for OpenStack implementers. Perhaps something that ran on Amazon also. That would create external pressure that would drive internal priorities. These “non-OpenStack” products would then have influence without having to contribute back to upstream.

Can we afford to have external entities driving internal priorities? Hell yes, that’s what customer adoption looks like.

OpenStack does not own the market sufficiently to create cloud echo chamber. The next wave of cloud innovation (my money is on container platforms) will follow the path of least resistance and widest adoption. We need to embrace that these innovations will not all be inside our community so that we can welcome them as part of our ecosystem. The community needs to find peace with that.

Composable Infrastructure – is there a balance between configurable and complexity?

Recently, I was part of a Composable Infrastructure briefing hosted by the enterprise side of HP and moderated by Tim Crawford. The event focused on a dynamically (re)configurable hardware. This type of innovation points to new generation of hardware with exciting capabilities; however, my takeaway is that we need more focus on the system level challenges of operations when designing units of infrastructure.  I’d like to see more focus on “composable ops” than “composable hardware.”

What is “composable infrastructure?”

wpid-20151030_170354.jpgBasically, these new servers use a flexible interconnect bus between CPU/RAM/Disk/Network components that allows the operators to connect chassis’ internal commodities together on-the-fly. Conceptually, this allows operators to build fit-to-purpose servers via a chassis API. For example, an operator could build a 4-CPU high RAM system for VMs on even days and then reconfigure it as a single socket four drive big data node on the odd ones. That means we have to block out three extra CPUs and drives for this system even though it only needs them 50% of the time.

I’ve seen designs like this before. They are very cool to review but not practical in scale operations.

My first concern is the inventory packing problem. You cannot make physical resources truly dynamic because they have no option for oversubscription. Useful composable infrastructure will likely have both a lot of idle capacity to service requests and isolated idle resource fragments from building ad hoc servers. It’s a classic inventory design trade-off between capacity and density.

The pro-composable argument is that data centers already have a lot of idle capacity. While composable designs could allow tighter packing of resources; this misses the high cost of complexity in operations.

My second concern is complexity.  What do scale operators want? Looking at Facebook’s open compute models, they operators want efficient, cheap and interchangeable parts. Scale operators are very clear: heterogeneity is expensive inside of a single infrastructure. It’s impossible to create a complete homogeneous data center so operators must be able to cope with variation. As a rule, finding ways to limit variation increases predictability and scale efficiency.

What about small and mid-tier operators? If they need purpose specific hardware then they will buy it. For general purpose tasks, virtual machines already offer composable flexibility and are much more portable. The primary lesson is that operational costs, both hard and soft, are much more significant than fractional improvements from optimized hardware utilization.

I’m a believer that tremendous opportunities are created by hardware innovation. Composable hardware raises interesting questions; however, I get much more excited by open compute (OCP) designs because OCP puts operational concerns first.

From Start to Scale: learn faster with heterogenous deployments

Why mix VMs and Physical? Having a consistent deploy approach can dramatically speed learning cycles that result in better scale ops. I would never deploy production OpenStack on VMs but I strongly recommend rehearsing that deployment on VMs hundreds of times before I touch metal.

Over the last two months, the RackN team redefined “heterogeneous” infrastructure in Digital Rebar from being “just” multi-vendor hardware to include any server resource from containers and Vagrant/Virtualbox to clouds like AWS or Packet. To support this truly diverse range, there were both technical and operational challenges to overcome.

The technical challenge rises from the fundamental control differences between cloud and physical infrastructure. In cloud, infrastructure is much more prescribed – you cannot change most aspects of your system and especially not your network interfaces or IPs. To provision hardware efficiently, we had to establish control over the very things that Cloud systems manage for you. 

That management diversity exercised the full extent of the Digital Rebar “functional ops” architecture.

Over the last year, we’ve been unwinding baked-in control assumptions from earlier versions of Digital Rebar. That added flexibility allows Digital Rebar to mix control APIs for infrastructure ranging from using Cobbler to Docker, Vagrant and AWS. Since we could already cope with heterogeneous control APIs using Digital Rebar’s unique functional ops design, we retained the ability to mix and match container, virtual and physical infrastructure.

The operational challenge was more subtle. We were motivated to make this change by first hand observations of the fidelity gap. I am a strong believer that container platforms will directly target metal in the next two years. The challenge is how do we get there from our current virtualization-focused infrastructure.

It’s easy to look at the completed work as an obvious step forward. Looking over my shoulder, I know that it took years of learning and perseverance to create a platform that was flexible enough to handle both extremes of control. Even more important was understanding why it was so important for a physical scale deployment platform to provide ops fidelity for developers too.

With the infrastructure work behind us, we’re seeing Digital Rebar deliver real operational transformation. We want to help IT embrace containers and immutable infrastructure without having to discard the hard won battles installing cloud and traditional infrastructure. Most critically, we hope that you’ll join our open community and share your operational journey with us.

Faster, Simpler AND Smaller – Immutable Provisioning with Docker Compose!

Nearly 10 TIMES faster system resets – that’s the result of fully enabling an multi-container immutable deployment on Digital Rebar.

Docker ComposeI’ve been having a “containers all the way down” month since we launched Digital Rebar deployment using Docker Compose. I don’t want to imply that we rubbed Docker on the platform and magic happened. The RackN team spent nearly a year building up the Consul integration and service wrappers for our platform before we were ready to fully migrate.

During the Digital Rebar migration, we took our already service-oriented code base and broke it into microservices. Specifically, the Digital Rebar parts (the API and engine) now run in their own container and each service (DNS, DHCP, Provisioning, Logging, NTP, etc) also has a dedicated container. Likewise, supporting items like Consul and PostgreSQL are, surprise, managed in dedicated containers too. All together, that’s over nine containers and we continue to partition out services.

We use Docker Compose to coordinate the start-up and Consul to wire everything together. Both play a role, but Consul is the critical glue that allows Digital Rebar components to find each other. These were not random choices. We’ve been using a Docker package for over two years and using Consul service registration as an architectural choice for over a year.

Service registration plays a major role in the functional ops design because we’ve been wrapping datacenter services like DNS with APIs. Consul is a separation between providing and consuming the service. Our previous design required us to track the running service. This worked until customers asked for pluggable services (and every customer needs pluggable services as they scale).

Besides being a faster to reset the environment, there are several additional wins:

  1. more transparent in how it operates – it’s obvious which containers provide each service and easy to monitor them as individuals.
  2. easier to distribute services in the environment – we can find where the service runs because of the Consul registration, so we don’t have to manage it.
  3. possible to have redundant services – it’s easy to spin up new services even on the same system
  4. make services pluggable – as long as the service registers and there’s an API, we can replace the implementation.
  5. no concern about which distribution is used – all our containers are Ubuntu user space but the host can be anything.
  6. changes to components are more isolated – changing one service does not require a lot of downloading.

Docker and microservices are not magic but the benefits are real. Be prepared to make architectural investments to realize the gains.

A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

Got some change? Build a datacenter ops lab on your coffee break [with Packet.net MaaS]

We’re using Packet.net hosted metal to test automation for private metal (video).  You can use discount code “RACKN100” to get a credit on Packet and try it yourself.

At RackN, we’ve been shrinking our scale deployment platform down to run faithfully on a desktop class system. Since we abstract the network and hardware complexity, you can build automation that scales to physical from as little as 16 Gb of RAM (the same size as Packet’s smaller server). That allows the exact same logic we use for an 80 node Ceph or Kubernetes cluster work on my 14” laptop.

In fact, we’ve been getting a bit obsessed with making a clean restart small and fast using containers, VMs and bootstrapping scripts.

Creating a remote test lab is part of this obsession because many rehearsals make great performances.  We wanted to eliminate the setup time and process for users who just want to experiment with a production grade deployment. Using Packet.net hosted metal and some Ansible scripts, we can build a complete HA Kubernetes cluster in about 15 minutes using VMs. This lets us iterate on Kubernetes best practices virtually since the “setup metal part” is handled abstractly by Digital Rebar.

Yawn. You could do the same in AWS. Why is that exciting?

The process for the lab system we build in Packet.net can then be used to provision a complete private infrastructure on metal including RAID, BIOS and server networking. Even though the lab uses VMs, we still do real networking, storage and configuration. For example, we can iterate building real software defined networking (SDN) overlays in this environment and then scale the work up to physical gear.

The provision and deploy time is so fast (generally, under 15 minutes) that we are using it as a clean environment for Dev and QA cycles on new automation. It’s also a very practical demo environment for these platforms because of the fidelity between this environment and an actual pilot. For me, that means spending $0.40 so I don’t have to sweat losing my work in process, battery life or my wifi connection to crank out a demo.

BTW… Packet.net servers are SUPER FAST. Even the small 16 Gb RAM machine is packed with SSDs and great connectivity.

If you are exploring any of the several workloads that we’ve been building (Docker Swarm, Kubernetes, Mesos, CloudFoundry, Ceph and OpenStack) or just playing around with API driven physical provisioning, we just made that work a little easier and a lot faster.

How do platforms die? One step at a time [the Fidelity Gap]

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem.  When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations.  Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelityFidelity Gap gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab.   Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements.  Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack.  There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing.  This design hides production defects and usability issues from developers.  These are issues that would be exposed quickly if the community required multi-instance development.  Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps?  We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers.  That system can then be fully automated with Ansible, Chef, Puppet and Salt.  Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away?  If you want to get to scale, start with the end in mind.