Full Metal DevOps: 12 things we needed beyond Cobbler

The RackN team did not plan to replace Cobbler, we just needed something that responded to our need for full-cycle cross-platform DevOps automation.

Provisioning an O/S is never enough!  You need to coordinate a lot of operational activity to deploy a multi-node system, like OpenStack, Kubernetes, Docker Swarm or Ceph.  Since we believe an automated upgrade path is also required, there is a huge gap in provisioning.

So what was needed?  Here’s our (rather long!) list of gaps to fill for full Metal DevOps provisioning:

Gap Commentary
1 Needs to work with Cobbler! Improve? Yes.  Disrupt?  Hell No!  It has to be OK to leave Cobbler in place while we do something better.  I’d be OK to tweak my Cobber to point it to the new stuff.
2 REST API & JSON CLI Beyond the obvious API, we really want a way to write scripts that drive deployment proactively.
3 Modular Components If I’ve got my own DNS, DHCP, NTP, etc then let me use those instead (see #1 above)
4 Control over the discovery image RAM Discovery images are awesome BUT please let me mess with it too!  Inject my keys and let me control when it exits.
5 Configure heterogenous RAID, BIOS & IPMI Servers are a mix of in-band (in the O/S) or out-of-band (BMC) configs.  Don’t make me pick, I can’t.
6 Inject DevOps scripts dynamically based on system inventory or state Depending on the node’s role, I want to run a set of scripts AFTER the O/S is installed.  And, please let me mix Chef, Puppet, Ansible and Bash.  Bash?  Especially Bash.
7 Portable Scripts between Cloud or Metal I’m going to practice on VMs and AWS.  In fact, my devs only work there.  I need high fidelity between my cloud and metal deploys.
8 One-click to reset and start over I don’t care if you want to call this “Metal as a Service.”  Deployments are iterative and we need to go faster.
9 Don’t require PXE or IP control to add nodes to the system Beyond #2, I want to get control of servers that don’t PXE or are already provisioned.
10 System Inventory including Network topology.  Then Push it. No surprise that we need inventory to make provisioning decisions.  Can we make that API available?  Maybe push into CMDB?
11 Control SSH keys per system, group and deployment  Darn, Security is near the bottom again!  Can we please control keys and access from first boot.  It should be table stakes.
0 AND NEVER HAVE TO TOUCH KICKSTART or PRESEED TEMPLATES Well, there are times I have to do it (like soft raid for O/S drives), so at least create a template system because Cobbler’s was pretty good.

We built Digital Rebar to close these gaps and many others (like transparent in operation, working in containers, and failing fast).  We think it’s time to bring cloud operational practices into metal.  With this type of automation, we can make it happen!

What are your biggest challenges with Metal Ops?   Does it match this list?  I’ve love to hear your opinion.

Post-OpenStack DefCore, I’m Chasing “open infrastructure” via cross-platform Interop

Like my previous DefCore interop windmill tilting, this is not something that can be done alone. Open infrastructure is a collaborative effort and I’m looking for your help and support. I believe solving this problem benefits us as an industry and individually as IT professionals.

2013-09-13_18-56-39_197So, what is open infrastructure?   It’s not about running on open source software. It’s about creating platform choice and control. In my experience, that’s what defines open for users (and developers are not users).

I’ve spent several years helping lead OpenStack interoperability (aka DefCore) efforts to ensure that OpenStack cloud APIs are consistent between vendors. I strongly believe that effort is essential to build an ecosystem around the project; however, in talking to enterprise users, I’ve learned that that their  real  interoperability gap is between that many platforms, AWS, Google, VMware, OpenStack and Metal, that they use everyday.

Instead of focusing inward to one platform, I believe the bigger enterprise need is to address automation across platforms. It is something I’m starting to call hybrid DevOps because it allows users to mix platforms, service APIs and tools.

Open infrastructure in that context is being able to work across platforms without being tied into one platform choice even when that platform is based on open source software. API duplication is not sufficient: the operational characteristics of each platform are different enough that we need a different abstraction approach.

We have to be able to compose automation in a way that tolerates substitution based on infrastructure characteristics. This is required for metal because of variation between hardware vendors and data center networking and services. It is equally essential for cloud because of variation between IaaS capabilities and service delivery models. Basically, those  minor  differences between clouds create significant challenges in interoperability at the operational level.

Rationalizing APIs does little to address these more structural differences.

The problem is compounded because the differences are not nicely segmented behind abstraction layers. If you work to build and sustain a fully integrated application, you must account for site specific needs throughout your application stack including networking, storage, access and security. I’ve described this as all deployments have 80% of the work common but the remaining 20% is mixed in with the 80% instead of being nicely layers. So, ops is cookie dough not vinaigrette.

Getting past this problem for initial provisioning on a single platform is a false victory. The real need is portable and upgrade-ready automation that can be reused and shared. Critically, we also need to build upon the existing foundations instead of requiring a blank slate. There is openness value in heterogeneous infrastructure so we need to embrace variation and design accordingly.

This is the vision the RackN team has been working towards with open source Digital Rebar project. We now able to showcase workload deployments (Docker, Kubernetes, Ceph, etc) on multiple cloud platforms that also translate to full bare metal deployments. Unlike previous generations of this tooling (some will remember Crowbar), we’ve been careful to avoid injecting external dependencies into the DevOps scripts.

While we’re able to demonstrate a high degree of portability (or fidelity) across multiple platforms, this is just the beginning. We are looking for users and collaborators who want to want to build open infrastructure from an operational perspective.

You are invited to join us in making open cross-platform operations a reality.

OpenStack Shared Community Values? Here’s my seven, let’s compare

The recent discussion about OpenStack API vs Implementation had led to several discussions about “OpenStack Values.”  While entertaining, they ultimately show that we have a lot of conflicting desires and opinions about the project.  In fact, the term “values” is itself hard to define.

Consequently, I wanted to try to capture what I see as OpenStack’s current values (not my personal ones for the project – those are in the post script). I’ve tried to put everything in positive terms, but value choices always have positive and negative impacts.

Rank Value Provides Possible Downsides
1 Upstreaming Share code base and community effort “First in” wins

Latest over stability

Measure value in commits

2 Vendor’s taking initiative Broad Participation

Free marketing buzz

No one wants to say no because of Vendor bias perception
3 End-to-end open source No licenses required for scale users and developers Build, don’t buy wastes a lot of effort

Does not align with users who want to pay for services

4 Developer leadership Lots of code being created Not many user requirements being considered
5 Figuring out API via implementation Fast iterations Frustrating APIs

API depreciation

6 Passionate discussion Diversity of opinion

Drama attracts attention

“Unfriendly” community

Loudest voice wins

Cross culture challenges

7 Being able to contribute broadly Generally maintainable platform Deep skills in subject matters

Best tool for the job

In my experience, if you don’t align with a communities values then you’re going to be very unhappy in the community.  I’ve watched this happen to project founders and the community changed around them.  Let’s all RAGE QUIT!

So, this makes me reflect on my own open source values. I’d start with pragmatic utility, transparent action, principle driven decisions, iterative design and data driven decisions.

What do you value most in open communities?

APIs and Implementations collide at OpenStack Interop: The Oracle Zones vs VMs Debate

I strive to stay neutral as OpenStack DefCore co-chair; however, as someone asking for another Board term, it’s important to review my thinking so that you can make an informed voting decision.

DefCore, while always on the edge of controversy, recently became ground zero for the “what is OpenStack” debate [discussion write up]. My preferred small core “it’s an IaaS product” answer is only one side. Others favor “it’s an open cloud community” while another faction champions an “open cloud platform.” I’m struggling to find a way that it can be all together at the same time.

The TL;DR is that, today, OpenStack vendors are required to implement a system that can run Linux guests. This is an example of an implementation over API bias because there’s nothing in the API that drives that specific requirement.

From a pragmatic “get it done” perspective, OpenStack needs to remain implementation driven for now. That means that we care that “OpenStack” clouds run VMs.

While there are pragmatic reasons for this, I think that long term success will require OpenStack to become an API specification. So today’s “right answer” actually undermines the long term community value. This has been a long standing paradox in OpenStack.

Breaking the API to implementation link allows an ecosystem to grow with truly alternate implementations (not just plug-ins). This is a threat to the community “upstream first” mentality.  OpenStack needs to be confident enough in the quality and utility of the shared code base that it can allow competitive implementations. Open communities should not need walls to win but they do need clear API definition.

What is my posture for this specific issue?  It’s complicated.

First, I think that the user and ecosystem expectations are being largely ignored in these discussions. Many of the controversial items here are vendor initiatives, not user needs. Right now, I’ve heard clearly that those expectations are for OpenStack to be an IaaS the runs VMs. OpenStack really needs to focus on delivering a reliably operable VM based IaaS experience. Until that’s solid, the other efforts are vendor noise.

Second, I think that there are serious test gaps that jeopardize the standard. The fundamental premise of DefCore is that we can use the development tests for API and behavior validation. We chose this path instead of creating an independent test suite. We either need to address tests for interop within the current body of tests or discuss splitting the efforts. Both require more investment than we’ve been willing to make.

We have mechanisms in place to collects data from test results and expand the test base.  Instead of creating new rules or guidelines, I think we can work within the current framework.

The simple answer would be to block non-VM implementations; however, I trust that cloud consumers will make good decisions when given sufficient information.  I think we need to fix the tests and accept non-VM clouds if they pass the corrected tests.

For this and other reasons, I want OpenStack vendors to be specific about the configurations that they test and support. We took steps to address this in DefCore last year but pulled back from being specific about requirements.  In this particular case, I believe we should require the official OpenStack vendor to state clear details about their supported implementation.  Customers will continue vote with their wallet about which configuration details are important.

This is a complex issue and we need community input.  That means that we need to hear from you!  Here’s the TC Position and the DefCore Patch.

12 Predictions for ’16: mono-cloud ambitions die as containers drive more hybrid IT

I expect 2016 to be a confusing year for everyone in IT.  For 2015, I predicted that new uses for containers are going to upset cloud’s apple cart; however, the replacement paradigm is not clear yet.  Consequently, I’m doing a prognostication mix and match: five predictions and seven items on a “container technology watch list.”

TL;DR: In 2016, Hybrid IT arrives on Containers’ wings.

Considering my expectations below, I think it’s time to accept that all IT is heterogeneous and stop trying to box everything into a mono-cloud.  Accepting hybrid as current state unblocks many IT decisions that are waiting for things to settle down.

Here’s the memo: “Stop waiting.  It’s not going to converge.”

2016 Predictions

  1. Container Adoption Seen As Two Stages:  We will finally accept that Containers have strength for both infrastructure (first stage adoption) and application life-cycle (second stage adoption) transformation.  Stage one offers value so we will start talking about legacy migration into containers without shaming teams that are not also rewriting apps as immutable microservice unicorns.
  2. OpenStack continues to bump and grow.  Adoption is up and open alternatives are disappearing.  For dedicated/private IaaS, OpenStack will continue to gain in 2016 for basic VM management.  Both competitive and internal pressures continue to threaten the project but I believe they will not emerge in 2016.  Here’s my complete OpenStack 2016 post?
  3. Amazon, GCE and Azure make everything else questionable.  These services are so deep and rich that I’d question anyone who is not using them.  At least one of them simply have to be part of everyone’s IT strategy for financial, talent and technical reasons.
  4. Cloud API becomes irrelevant. Cloud API is so 2011!  There are now so many reasonable clients to abstract various Infrastructures that Cloud APIs are less relevant.  Capability, interoperability and consistency remain critical factors, but the APIs themselves are not interesting.
  5. Metal aaS gets interesting.  I’m a big believer in the power of operating metal via an API and the RackN team delivers it for private infrastructure using Digital Rebar.  Now there are several companies (Packet.net, Ubiquity Hosting and others) that offer hosted metal.

2016 Container Tech Watch List

I’m planning posts about all these key container ecosystems for 2016.  I think they are all significant contributors to the emerging application life-cycle paradigm.

  1. Service Containers (& VMs): There’s an emerging pattern of infrastructure managed containers that provide critical host services like networking, logging, and monitoring.  I believe this pattern will provide significant value and generate it’s own ecosystem.
  2. Networking & Storage Services: Gaps in networking and storage for containers need to get solved in a consistent way.  Expect a lot of thrash and innovation here.
  3. Container Orchestration Services: This is the current battleground for container mind share.  Kubernetes, Mesos and Docker Swarm get headlines but there are other interesting alternatives.
  4. Containers on Metal: Removing the virtualization layer reduces complexity, overhead and cost.  Container workloads are good choices to re-purpose older servers that have too little CPU or RAM to serve as VM hosts.  Who can say no to free infrastructure?!  While an obvious win to many, we’ll need to make progress on standardized scale and upgrade operations first.
  5. Immutable Infrastructure: Even as this term wins the “most confusing” concept in cloud award, it is an important one for container designers to understand.  The unfortunate naming paradox is that immutable infrastructure drives disciplines that allow fast turnover, better security and more dynamic management.
  6. Microservices: The latest generation of service oriented architecture (SOA) benefits from a new class of distribute service registration platforms (etcd and consul) that bring new life into SOA.
  7. Paywall Registries: The important of container registries is easy to overlook because they seem to be version 2.0 of package caches; however, container layering makes these services much more dynamic and central than many realize.  (more?  Bernard Golden and I already posted about this)

What two items did not make the 2016 cut?  1) Special purpose container-focused operating systems like CoreOS or RancherOS.  While interesting, I don’t think these deployment technologies have architectural level influence.  2) Container Security via VMs. I’m seeing patterns where containers may actually be more secure than VMs.  This is FUD created by people with a vested interest in virtualization.

Did I miss something? I’d love to know what you think I got right or wrong!

2015 Container Review

It’s been a banner year for container awareness and adoption so we wanted to recap 2015.  For RackN, container acceleration is near to our hear because we both enable and use them in fundamental ways.   Look for Rob’s 2016 predictions on his blog.

The RackN team has truly deep and broad experience with containers in practical use.  In the summer, we delivered multiple container orchestration workloads including Docker Swarm, Kubernetes, Cloud Foundry, StackEngine and others.  In the fall, we refactored Digital Rebar to use Docker Compose with dramatic results.  And we’ve been using Docker since 2013 (yes, “way back”) for ops provisioning and development.

To make it easier to review that experience, we are consolidating a list of our container related posts for 2015.

General Container Commentary

View original post 68 more words

My OpenStack 2016 Analysis: Continue Core, Stop Confusing Ecosystem, Change Hybrid Approach

Note: I’ve served on the OpenStack Foundation board since its formation.  There I’ve led the “define the core” DefCore efforts.  I’m on the 2016 ballot for another term.

I love using end-of-year posts to reflect (2015, I got 6 of 7!) and try to set direction (OpenStack needed to prioritize).  This year, I wanted to use a simple “Continue, Stop, Change” format that I’ve used for employee reviews in the past.  These three items reflect how I think OpenStack needs to respond to the industry in 206.

Continue: Focus on Core

OpenStack adoption continues around the legacy projects that traditionally define it for most users.  A lot of work and focus is needed around those projects including better representation of user, operator and product interests.

Towards that end, we’ve made amazing progress on DefCore implementation and I’m excited about the discussions that it’s been generating.  It’s driving pragmatic decisions about what is required (running a vm?) and how to verify compliance.  It’s also driving conceptual thinking around OpenStack principles and ecosystem priorities.

DefCore’s focus on using community tests to define OpenStack creates a very concrete and defensible standard.  Ultimately, it comes back to users and operators demanding compliance for the work to remain meaningful.

Overall, To focus on core function, OpenStack needs to empower new groups within the community.  Expanding the role of the Product Group, Operators, and User Committee are key to giving a voice to these constituents.

OpenStack core must transition into a consistent platform or it risks becoming irrelevant.

Stop: Confusing The Ecosystem

I’m concerned about the “big tent” governance change puts OpenStack into conflict with both community vendors and the larger cloud market.  I believe we’re creating an echo chamber of OpenStack on OpenStack focus that forces adjacent efforts (like software defined network, storage and container orchestration) to be either inside or outside the community circle.  While that artificially grows the apparent contributor base, it creates artificial walls between OpenStack and the dominate cloud platforms.

Let me illustrate using my own company, RackN.  We create cross-platform devops orchestration based on an open source project, Digital Rebar.  We consider ourselves to be part of the OpenStack community and have supported deploying the core.  We also provision bare metal and deploy Kubernetes, Docker Swarm and Cloud Foundry.  That has apparent conflicts with big tent Ironic and Magnum projects.  Does that make RackN competitive with OpenStack or not?

It hurts OpenStack when competitive alignment is unclear because vendors, users and operators are uncertain about where to make investments.  In the end, users will choose simpler alternatives.

I believe the Board needs to define the OpenStack ecosystem strategy in a clear and actionable way.  If re-elected, that will be my Board priority for 2016.

Change: Hybrid Approach

My top 2016 prediction (post coming) is that we accept “hybrid IT as the new normal.”  That means that we stop driving towards an IT mono-culture and start working towards tools that embrace heterogeneity.  Along those lines, OpenStack needs to evaluate our relative position and strengths in a hybrid cloud landscape.

Interoperability between OpenStack implementations is important because it reduces friction; however, we need to expand our thinking to ensure interoperability with other platforms.  That does not mean simply cloning the AWS APIs!  It means that we need to consider users and operator needs against a spectrum of private and public infrastructures.

A broader hybrid approach also suggests that duplicating cloud-locked adjacent services (e.g. Cloud Formation vs. Heat) does not address user needs.

I am advocating that OpenStack encourage a cloud-neutral ecosystem, outside of the OpenStack tent, that work across a wide range of platforms.  That leads to user choice and creates a truly open platform. 

And, of course, more Community Discussion!

I want to thank the many people who participated in a heated twitter discussion in advance of this post.  There are many great ideas and counter-points covered in that lengthy dialog.

Do you have an opinion about what to OpenStack should stop, accelerate or change?  I’d love to hear it!