To avoid echo chamber, OpenStack must embrace competitive cloud ecosystem

wpid-20151023_100533.jpg
Japanese Bullet Train View

I was in Japan before the Tokyo summit on a bullet train to Kyoto watching the mix of heavy industry and bucolic mountains pass by. That scene reflects an OpenStack duality: we want to be both a dominant platform delivering core cloud services and an open source values driven collective.

First, I fundamentally believe in the success of OpenStack as the open virtual infrastructure management platform.

I believe that we have solved the virtual compute/storage/network problem sufficiently to become the de facto open IaaS platform. While not perfect, the technologies are sufficient assuming we continue to improve ease of use and operational hardening. Pursing that base capability is my primary motivation for DefCore work.

I don’t believe that the OpenStack community is, or should try to become, the authority on “all things cloud.”

In the presence of Amazon, VMware, Microsoft and Google, we cannot make that claim with any degree of self-respect. Even newcomers like DigitalOcean have an undeniable footprint and influence. Those vendor platforms drive cloud ecosystems and technologies which foster fast innovation because there is no friction to joining their ecosystems and they are sufficiently large and stable enough to represent a target market. We’ve seen clear signs from Rackspace, HP and others that platform diversity improves cloud strength.

I continue to think we (OpenStack) spend too much time evaluating what is “in” or “out” of the project and too little time talking about what’s “on,” “under” and “with” the project like Kubernetes, Mesos, Docker, SDN, Hadoop and Ceph. That type of thinking creates distance between OpenStack efforts and the majority of the market.

What motivates the drive to an all open captive community? It’s the reasonable concern that critical parts of the infrastructure will become pay-to-play. For example, what if a non-OpenStack alternative to Heat Orchestration gained popularity for OpenStack implementers. Perhaps something that ran on Amazon also. That would create external pressure that would drive internal priorities. These “non-OpenStack” products would then have influence without having to contribute back to upstream.

Can we afford to have external entities driving internal priorities? Hell yes, that’s what customer adoption looks like.

OpenStack does not own the market sufficiently to create cloud echo chamber. The next wave of cloud innovation (my money is on container platforms) will follow the path of least resistance and widest adoption. We need to embrace that these innovations will not all be inside our community so that we can welcome them as part of our ecosystem. The community needs to find peace with that.

Composable Infrastructure – is there a balance between configurable and complexity?

Recently, I was part of a Composable Infrastructure briefing hosted by the enterprise side of HP and moderated by Tim Crawford. The event focused on a dynamically (re)configurable hardware. This type of innovation points to new generation of hardware with exciting capabilities; however, my takeaway is that we need more focus on the system level challenges of operations when designing units of infrastructure.  I’d like to see more focus on “composable ops” than “composable hardware.”

What is “composable infrastructure?”

wpid-20151030_170354.jpgBasically, these new servers use a flexible interconnect bus between CPU/RAM/Disk/Network components that allows the operators to connect chassis’ internal commodities together on-the-fly. Conceptually, this allows operators to build fit-to-purpose servers via a chassis API. For example, an operator could build a 4-CPU high RAM system for VMs on even days and then reconfigure it as a single socket four drive big data node on the odd ones. That means we have to block out three extra CPUs and drives for this system even though it only needs them 50% of the time.

I’ve seen designs like this before. They are very cool to review but not practical in scale operations.

My first concern is the inventory packing problem. You cannot make physical resources truly dynamic because they have no option for oversubscription. Useful composable infrastructure will likely have both a lot of idle capacity to service requests and isolated idle resource fragments from building ad hoc servers. It’s a classic inventory design trade-off between capacity and density.

The pro-composable argument is that data centers already have a lot of idle capacity. While composable designs could allow tighter packing of resources; this misses the high cost of complexity in operations.

My second concern is complexity.  What do scale operators want? Looking at Facebook’s open compute models, they operators want efficient, cheap and interchangeable parts. Scale operators are very clear: heterogeneity is expensive inside of a single infrastructure. It’s impossible to create a complete homogeneous data center so operators must be able to cope with variation. As a rule, finding ways to limit variation increases predictability and scale efficiency.

What about small and mid-tier operators? If they need purpose specific hardware then they will buy it. For general purpose tasks, virtual machines already offer composable flexibility and are much more portable. The primary lesson is that operational costs, both hard and soft, are much more significant than fractional improvements from optimized hardware utilization.

I’m a believer that tremendous opportunities are created by hardware innovation. Composable hardware raises interesting questions; however, I get much more excited by open compute (OCP) designs because OCP puts operational concerns first.

From Start to Scale: learn faster with heterogenous deployments

Why mix VMs and Physical? Having a consistent deploy approach can dramatically speed learning cycles that result in better scale ops. I would never deploy production OpenStack on VMs but I strongly recommend rehearsing that deployment on VMs hundreds of times before I touch metal.

Over the last two months, the RackN team redefined “heterogeneous” infrastructure in Digital Rebar from being “just” multi-vendor hardware to include any server resource from containers and Vagrant/Virtualbox to clouds like AWS or Packet. To support this truly diverse range, there were both technical and operational challenges to overcome.

The technical challenge rises from the fundamental control differences between cloud and physical infrastructure. In cloud, infrastructure is much more prescribed – you cannot change most aspects of your system and especially not your network interfaces or IPs. To provision hardware efficiently, we had to establish control over the very things that Cloud systems manage for you. 

That management diversity exercised the full extent of the Digital Rebar “functional ops” architecture.

Over the last year, we’ve been unwinding baked-in control assumptions from earlier versions of Digital Rebar. That added flexibility allows Digital Rebar to mix control APIs for infrastructure ranging from using Cobbler to Docker, Vagrant and AWS. Since we could already cope with heterogeneous control APIs using Digital Rebar’s unique functional ops design, we retained the ability to mix and match container, virtual and physical infrastructure.

The operational challenge was more subtle. We were motivated to make this change by first hand observations of the fidelity gap. I am a strong believer that container platforms will directly target metal in the next two years. The challenge is how do we get there from our current virtualization-focused infrastructure.

It’s easy to look at the completed work as an obvious step forward. Looking over my shoulder, I know that it took years of learning and perseverance to create a platform that was flexible enough to handle both extremes of control. Even more important was understanding why it was so important for a physical scale deployment platform to provide ops fidelity for developers too.

With the infrastructure work behind us, we’re seeing Digital Rebar deliver real operational transformation. We want to help IT embrace containers and immutable infrastructure without having to discard the hard won battles installing cloud and traditional infrastructure. Most critically, we hope that you’ll join our open community and share your operational journey with us.

Faster, Simpler AND Smaller – Immutable Provisioning with Docker Compose!

Nearly 10 TIMES faster system resets – that’s the result of fully enabling an multi-container immutable deployment on Digital Rebar.

Docker ComposeI’ve been having a “containers all the way down” month since we launched Digital Rebar deployment using Docker Compose. I don’t want to imply that we rubbed Docker on the platform and magic happened. The RackN team spent nearly a year building up the Consul integration and service wrappers for our platform before we were ready to fully migrate.

During the Digital Rebar migration, we took our already service-oriented code base and broke it into microservices. Specifically, the Digital Rebar parts (the API and engine) now run in their own container and each service (DNS, DHCP, Provisioning, Logging, NTP, etc) also has a dedicated container. Likewise, supporting items like Consul and PostgreSQL are, surprise, managed in dedicated containers too. All together, that’s over nine containers and we continue to partition out services.

We use Docker Compose to coordinate the start-up and Consul to wire everything together. Both play a role, but Consul is the critical glue that allows Digital Rebar components to find each other. These were not random choices. We’ve been using a Docker package for over two years and using Consul service registration as an architectural choice for over a year.

Service registration plays a major role in the functional ops design because we’ve been wrapping datacenter services like DNS with APIs. Consul is a separation between providing and consuming the service. Our previous design required us to track the running service. This worked until customers asked for pluggable services (and every customer needs pluggable services as they scale).

Besides being a faster to reset the environment, there are several additional wins:

  1. more transparent in how it operates – it’s obvious which containers provide each service and easy to monitor them as individuals.
  2. easier to distribute services in the environment – we can find where the service runs because of the Consul registration, so we don’t have to manage it.
  3. possible to have redundant services – it’s easy to spin up new services even on the same system
  4. make services pluggable – as long as the service registers and there’s an API, we can replace the implementation.
  5. no concern about which distribution is used – all our containers are Ubuntu user space but the host can be anything.
  6. changes to components are more isolated – changing one service does not require a lot of downloading.

Docker and microservices are not magic but the benefits are real. Be prepared to make architectural investments to realize the gains.

A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

Got some change? Build a datacenter ops lab on your coffee break [with Packet.net MaaS]

We’re using Packet.net hosted metal to test automation for private metal (video).  You can use discount code “RACKN100” to get a credit on Packet and try it yourself.

At RackN, we’ve been shrinking our scale deployment platform down to run faithfully on a desktop class system. Since we abstract the network and hardware complexity, you can build automation that scales to physical from as little as 16 Gb of RAM (the same size as Packet’s smaller server). That allows the exact same logic we use for an 80 node Ceph or Kubernetes cluster work on my 14” laptop.

In fact, we’ve been getting a bit obsessed with making a clean restart small and fast using containers, VMs and bootstrapping scripts.

Creating a remote test lab is part of this obsession because many rehearsals make great performances.  We wanted to eliminate the setup time and process for users who just want to experiment with a production grade deployment. Using Packet.net hosted metal and some Ansible scripts, we can build a complete HA Kubernetes cluster in about 15 minutes using VMs. This lets us iterate on Kubernetes best practices virtually since the “setup metal part” is handled abstractly by Digital Rebar.

Yawn. You could do the same in AWS. Why is that exciting?

The process for the lab system we build in Packet.net can then be used to provision a complete private infrastructure on metal including RAID, BIOS and server networking. Even though the lab uses VMs, we still do real networking, storage and configuration. For example, we can iterate building real software defined networking (SDN) overlays in this environment and then scale the work up to physical gear.

The provision and deploy time is so fast (generally, under 15 minutes) that we are using it as a clean environment for Dev and QA cycles on new automation. It’s also a very practical demo environment for these platforms because of the fidelity between this environment and an actual pilot. For me, that means spending $0.40 so I don’t have to sweat losing my work in process, battery life or my wifi connection to crank out a demo.

BTW… Packet.net servers are SUPER FAST. Even the small 16 Gb RAM machine is packed with SSDs and great connectivity.

If you are exploring any of the several workloads that we’ve been building (Docker Swarm, Kubernetes, Mesos, CloudFoundry, Ceph and OpenStack) or just playing around with API driven physical provisioning, we just made that work a little easier and a lot faster.

How do platforms die? One step at a time [the Fidelity Gap]

The RackN team is working on the “Start to Scale” position for Digital Rebar that targets the IT industry-wide “fidelity gap” problem.  When we started on the Digital Rebar journey back in 2011 with Crowbar, we focused on “last mile” problems in metal and operations.  Only in the last few months did we recognize the importance of automating smaller “first mile” desktop and lab environments.

A fidelityFidelity Gap gap is created when work done on one platform, a developer laptop, does not translate faithfully to the next platform, a QA lab.   Since there are gaps at each stage of deployment, we end up with the ops staircase of despair.

These gaps hide defects until they are expensive to fix and make it hard to share improvements.  Even worse, they keep teams from collaborating.

With everyone trying out Container Orchestration platforms like Kubernetes, Docker Swarm, Mesosphere or Cloud Foundry (all of which we deploy, btw), it’s important that we can gracefully scale operational best practices.

For companies implementing containers, it’s not just about turning their apps into microservice-enabled immutable-rock stars: they also need to figure out how to implement the underlying platforms at scale.

My example of fidelity gap harm is OpenStack’s “all in one, single node” DevStack.  There is no useful single system OpenStack deployment; however, that is the primary system for developers and automated testing.  This design hides production defects and usability issues from developers.  These are issues that would be exposed quickly if the community required multi-instance development.  Even worse, it keeps developers from dealing with operational consequences of their decisions.

What are we doing about fidelity gaps?  We’ve made it possible to run and faithfully provision multi-node systems in Digital Rebar on a relatively light system (16 Gb RAM, 4 cores) using VMs or containers.  That system can then be fully automated with Ansible, Chef, Puppet and Salt.  Because of our abstractions, if deployment works in Digital Rebar then it can scale up to 100s of physical nodes.

My take away?  If you want to get to scale, start with the end in mind.

Introducing Digital Rebar. Building strong foundations for New Stack infrastructure

digital_rebarThis week, I have the privilege to showcase the emergence of RackN’s updated approach to data center infrastructure automation that is container-ready and drives “cloud-style” DevOps on physical metal.  While it works at scale, we’ve also ensured it’s light enough to run a production-fidelity deployment on a laptop.

You grow to cloud scale with a ready-state foundation that scales up at every step.  That’s exactly what we’re providing with Digital Rebar.

Over the past two years, the RackN team has been working on microservices operations orchestration in the OpenCrowbar code base.  By embracing these new tools and architecture, Digital Rebar takes that base into a new directions.  Yet, we also get to leverage a scalable heterogeneous provisioner and integrations for all major devops tools.  We began with critical data center automation already working.

Why Digital Rebar? Traditional data center ops is being disrupted by container and service architectures and legacy data centers are challenged with gracefully integrating this new way of managing containers at scale: we felt it was time to start a dialog the new foundational layer of scale ops.

Both our code and vision has substantially diverged from the groundbreaking “OpenStack Installer” MVP the RackN team members launched in 2011 from inside Dell and is still winning prizes for SUSE.

We have not regressed our leading vendor-neutral hardware discovery and configuration features; however, today, our discussions are about service wrappers, heterogeneous tooling, immutable container deployments and next generation platforms.

Over the next few days, I’ll be posting more about how Digital Rebar works (plus video demos).

RackN fills holes with Drill Release

Rob H's avatarCloud2030 Podcast

Drill Man! by BruceLowell.com [creative commons] Drill Man! by BruceLowell.com [creative commons] We’re so excited about our in-process release that we’ve been relatively quiet about the last OpenCrowbar Drill release (video tour here).  That’s not a fair reflection of the level of capability and maturity reflected in the code base; yes, Drill’s purpose was to set the stage for truly ground breaking ops automation work in the next release (“Epoxy”).

So, what’s in Drill?  Scale and Containers on Metal Workloads!  [official release notes]

The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker Swarm, Kubernetes, CloudFoundry and StackEngine workloads.

In addition to workloads, we put the platform through its paces in real ops environments at scale.  That resulted in even richer network configurations and options plus performance…

View original post 100 more words