RackN Ends DevOps Gridlock in Data Center [Press Release]

Posted on May 4, 2017 by Rob H

Today we announced the availability of Digital Rebar Provision, the industry’s first cloud-native physical provisioning utility. We’ve had this in the Digital Rebar community for a few weeks before offering support and response has been great!

DR Provision By releasing their API-driven provisioning tool as a stand-alone component of the larger Digital Rebar suite, RackN helps DevOps teams break automation bottlenecks in their legacy data centers without disrupting current operations. The stand-alone open utility can be deployed in under 5 minutes and fits into any data center design. RackN also announced a $1,000 starter support and consulting package to further accelerate transition from tools like Cobbler, MaaS or Stacki to the new Golang utility.

“We were seeing SREs suffering from high job turnover,” said Rob Hirschfeld, RackN founder and CEO. “When their integration plans get gridlocked by legacy tooling they quickly either lose patience or political capital. Digital Rebar Provision replaces the legacy tools without process disruption so that everyone can find shared wins early in large SRE initiatives.”

The first cloud-native physical provisioning utility

Data center provisioning is surprisingly complex because it’s caught between cutting edge hardware and arcane protocols and firmware requirements that are difficult to disrupt. The heart of the system is a fickle combination of specific DHCP options, a firmware bootstrap environment (known as PXE), a very lightweight file transfer protocol (TFTP) and operating system specific templating tools like preseed and kickstart. Getting all these pieces to work together with updated APIs without breaking legacy support has been elusive.

By rethinking physical ops in cloud-native terms, RackN has managed to distill out a powerful provisioning tool for DevOps and SRE minded operators who need robust API/CLI, Day 2 Ops, security and control as primary design requirements. By bootstrapping foundational automation with Digital Rebar Provision, DevOps teams lay a foundation for data center operations that improves collaboration between operators and SRE teams: operators enjoy additional control and reuse and SREs get a doorway into building a fully automated process.

A pragmatic path without burning downing the data center

“I’m excited to see RackN providing a pragmatic path from physical boot to provisioning without having to start over and rebuild my data center to get there.” said Dave McCrory, an early cloud and data gravity innovator. “It’s time for the industry to stop splitting physical and cloud IT processes because snowflaked, manual processes slow everyone down. I can’t imagine an easier on-ramp than Digital Rebar Provision”

The RackN Digital Rebar is making it easy for Cobbler, Stacki, MaaS and Forman users to evaluate our RESTful, Golang, Template-based PXE Provisioning utility. Interested users can evaluate the service in minutes on a laptop or engage with RackN for a more comprehensive trail with expert support. The open Provision service works both independently and as part of Digital Rebar’s full life-cycle hybrid control.

See specific features at http://rackn.com/provision/drsa.

Want help starting on this journey? Contact us and we can help.

How about a CaaPuccino? Krish and Rob discuss containers, platforms, hybrid issues around Kubernetes and OpenStack.

Posted on April 24, 2017 by Rob H

CaaPuccino: A frothy mix of containers and platforms.

Check out Krish Subramanian’s (@krishnan) Modern Enterprise podcast (audio here) today for a surprisingly deep and thoughtful discussion about how frothy new technologies are impacting Modern Enterprise IT. Of course, we also take some time to throw some fire bombs at the end. You can use my notes below to jump to your favorite topics.

The key takeaways are that portability is hard and we’re still working out the impact of container architecture.

The benefit of the longer interview is that we really dig into the reasons why portability is hard and discuss ways to improve it. My personal SRE posts and those on the RackN blog describe operational processes that improve portability. These are real concerns for all IT organizations because mixed and hybrid models are a fact of life.

If you are not actively making automation that works against multiple infrastructures then you are building technical debt.

Of course, if you just want the snark, then jump forward to 24:00 minutes in where we talk future of Kubernetes, OpenStack and the inverted intersection of the projects.

Krish, thanks for the great discussion!

Rob’s Podcast Notes (39 minutes)

2:37: Rob intros about Digital Rebar & RackN

4:50: Why our Kubernetes is JUST UPSTREAM

5:35: Where are we going in 5 years > why Rob believes in Hybrid

Should not be 1 vendor who owns everything
That’s why we work for portability
Public cloud vision: you should stop caring about infrastructure
Coming to an age when infrastructure can be completely automated
Developer rebellion against infrastructure

8:36: Krish believes that Public cloud will be more decentralized

Public cloud should be part of everyone’s IT plan
It should not be the ONLY thig

9:25: Docker helps create portability, what else creates portability? Will there be a standard

Containers are a huge change, but it’s not just packaging
Smaller units of work is important for portability
Container schedulers & PaaS are very opinionated, that’s what creates portability
Deeper into infrastructure loses portability (RackN helps)
Rob predicts that Lambda and Serverless creates portability too

11:38: Are new standards emerging?

Some APIs become dominate and create de facto APIs
Embedded assumptions break portability – that’s what makes automation fragile
Rob explains why we inject configuration to abstract infrastructure
RackN works to inject attributes instead of allowing scripts to assume settings
For example, networking assumptions break portability
Platforms force people to give up configuration in ways that break portability

14:50: Why did Platform as a Service not take off?

Rob defends PaaS – thinks that it has accomplished a lot
Challenge of PaaS is that it’s very restrictive by design
Calls out Andrew Clay Shafer’s “don’t call it a PaaS” position
Containers provide a less restrictive approach with more options.

17:00: What’s the impact on Enterprise? How are developers being impacted?

Service Orientation is a very important thing to consider
Encapsulation from services is very valuable
Companies don’t own all their IT services any more – it’s not monolithic
IT Service Orientation aligns with Business Processes
Rob says the API economy is a big deal
In machine learning, a business’ data may be more valuable than their product

19:30: Services impact?

Service’s have a business imperative
We’re not ready for all the impacts of a service orientation
Challenge is to mix configuration and services
Magic of Digital Rebar is that it can mix orchestration of both

22:00: We are having issues with simple, how are we going to scale up?

Barriers are very low right now

22:30: Will Kubernetes help us solve governance issues?

Kubernetes is doing a go building an ecosystem
Smart to focus on just being Kubernetes
It will be chaotic as the core is worked out

24:00: Do you think Kubernetes is going in the right direction?

Rob is bullish for Kubernetes to be the dominant platform because it’s narrow and specific
Google has the right balance of control
Kubernetes really is not that complex for what it does
Mesos is also good but harder to understand for users
Swarm is simple but harder to extend for an ecosystem
Kubernetes is a threat to Amazon because it creates portability and ecosystem outside of their platform
Rob thinking that Kubernetes could create platform services that compete with AWS services like RDS.
It’s likely to level the field, not create a Google advantage

27:00: How does Kubernetes fit into the Digital Rebar picture?

We think of Kubernetes as a great infrastructure abstraction that creates portability
We believe there’s a missing underlay that cannot abstract the infrastructure – that’s what we do.
OpenStack deployments broken because every data center is custom and different – vendors create a lot of consulting without solving the problem
RackN is creating composability UNDER Kubernetes so that those infrastructure differences do not break operation automation
Kubernetes does not have the constructs in the abstraction to solve the infrastructure problem, that’s a different problem that should not be added into the APIs
Digital Rebar can also then use the Kubernetes abstractions?

30:20: Can OpenStack really be managed/run on top of Kubernetes? That seems complex!

There is a MESS in the message of Kubernetes under OpenStack because it sends the message that Kubernetes is better at managing application than OpenStack
Since OpenStack is just an application and Kubernetes is a good way to manage applications
When OpenStack is already in containers, we can use Kubernetes to do that in a logical way
“I’m super impressed with how it’s working” using OpenStack Helm Packs (still needs work)
Physical environment still has to be injected into the OpenStack on Kubernetes environment

35:05 Does OpenStack have a future?

Yes! But it’s not the big “data center operating system” future that we expected in 2010. Rob thinks it a good VM management platform.
Rob provides the same caution for Kubernetes. It will work where the abstractions add value but data centers are complex hybrid beasts
Don’t “square peg a data center round hole” – find the best fit
OpenStack should have focused on the things it does well – it has a huge appetite for solving too many problems.

April 21 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on April 21, 2017 by Rob H

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rackngo)

SRE Items of the Week

DigitalRebar Provision deploy Docker’s LinuxKit Kubernetes

_____________

Install Digital Rebar PXE Provision on a Mac OSX System and Test Boot using Virtual Box

_____________

Packet Pushers 333 Automation & Orchestration in Networking
http://packetpushers.net/podcast/podcasts/show-333-orchestration-vs-automation/

While the discussion is all about NETWORK DevOps, they do a good job of decrying WHY current state of system orchestration is so sad – in a word: heterogeneity. It’s not going away because the alternative is lock-in. They also do a good job of describing the difference between automation and orchestration; however, I think there’s a middle tier of resource “scheduling” that better describes OpenStack and Kubernetes.

Around 5:00 minutes into the podcast, they effectively describe the composable design of Digital Rebar and the rationale for the way that we’ve abstracted interfaces for automation. If you guys really do want to cash in by consulting with it (at 10 minutes), just contact Rob H.
_____________

Digital Magazine Launch: Increment On-Call
https://increment.com/on-call/

Increment is dedicated to covering how teams build and operate software systems at scale, one issue at a time. In this, our inaugural issue, we focus on industry best practices around on-call and incident response.
_____________

Need PXW? Try out this Cobbler Replacement
https://robhirschfeld.com/2017/04/11/provision-preview/

INTRO
We wanted to make open basic provisioning API-driven, secure, scalable and fast. So we carved out the Provision & DHCP services as a stand alone unit from the larger open Digital Rebar project. While this Golang service lacks orchestration, this complete service is part of Digital Rebar infrastructure and supports the discovery boot process, templating, security and extensive image library (Linux, ESX, Windows, … ) from the main project.

TL;DR: FIVE MINUTES TO REPLACE COBBLER? YES.

The project APIs and CLIs are complete for all provisioning functions with good Swagger definitions and docs. After all, it’s third generation capability from the Digital Rebar project. The integrated UX is still evolving.
_____________

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

DevOpsDays Austin : May 4-5, 2017 in Austin TX

CloudNative vs SRE vs DevOps: The Ultimate Server Cage Match
Not Actually a DevOps Talk with Michael Cote (May 4 at 4:50pm)

OpenStack Summit : May 8 – 11, 2017 in Boston, MA

OpenStack and Kubernetes. Combining the best of both worlds – Kubernetes Day

Interop ITX : May 15 – 19, 2017 in Las Vegas, NV

Open Source IT Summit – Tuesday, May 16, 9:00 – 5:00pm : Rob Hirschfeld to speak

Gluecon : May 24 – 25, 2017 in Denver, CO

Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #68

Starting Weekly SRE Update Posts!

Posted on April 9, 2017 by Rob H

Everyone at RackN is excited to talk about Site Reliability Engineering and we’re spinning up several efforts including an interview series about it (contact me if you want talk!).

Our first deliverable is a weekly SRE industry round up blog post (our first one!) You can subscribe to the updates on the RackN site.

Let us know if you find news that we should include by posting comments there or tweeting to @RackNGo.

If you want to hear more of my regular opinionated stuff, never fear! I’ve got some fun “server cage match” material coming together around DevOpsDays Austin talking about how Cloud Native, DevOps and SRE align. You can subscribe to this blog too (button on the left)- we generally don’t double post material, so you’ll get fresh insights from both.

Weekly Recap of All Things Site Reliability Engineering (SRE)

Posted on April 7, 2017 by Rob H

pexels-photo-273011

Welcome to the first post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com.

SRE Items of the Week

Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites by Ian Miell

INTRO TO POST

For several years I managed the 3rd line site reliability operation for many of the world’s busiest gambling sites, working for a little-known company that built and ran the core backend online software for several businesses that each at peak could take tens of millions of pounds in revenue per hour. I left a couple of years ago, so it’s a good time to reflect on what I learned in the process.

In many ways, what we did was similar to what’s now called an SRE function (I’m going to call us SREs, but the acronym didn’t exist at the time). We were on call, had to respond to incidents, made recommendations for re-engineering, provided robust feedback to developers and customer teams, managed escalations and emergency situations, ran monitoring systems, and so on.

The team I joined was around 5 engineers (all former developers and technical leaders), which grew to around 50 of more mixed experience across multiple locations by the time I left.

I’m going to focus here on process and documentation, since I don’t think they’re talked about usefully enough where I do read about them.
_____

2017 trend lines: When DevOps and hybrid collide by Rob Hirschfeld (@zehicle)
IBM Cloud Computing News

INTRO TO POST

What happens when DevOps methods meet hybrid environments? Following are some emerging trends and my commentary on each.

There are two major casualties as the pace of innovation in IT continues to accelerate: manual processes (non-DevOps) and tightly-coupled software stacks (non-hybrid).

We are changing some things much too quickly for developers and operators to keep up using processes that require human intervention in routine activities like integrated testing or deployment. Furthermore, monolithic platforms—our traditional “duck-and-cover” protection from pace of change—are less attractive for numerous reasons, including slower pace, vendor lock-in and lack of choice.

RECENT SRE AND DEVOPS EVENTS

SRECon17 Americas

Videos of all Sessions from Woodland Hunter

CloudNativeCon + KubeCon 2017 March 29-30, 2017 in Berlin

YouTube Videos of all sessions

IBM Interconnect March in Las Vegas, NV

Christopher Ferris, IBM CTO Open Technology and Rob Hirschfeld “Open Cloud Architecture: Think You Can Out-Innovate the Best of the Rest” – SLIDES

DevOps Summit

“Best Practices in Operating Hybrid Infrastructure that Spans Clouds and the Data Center” – BLOG / SLIDES

UPCOMING MEETUPS & PODCASTS

Continuous Discussions (#c9d9) Episode 66: Scaling Agile and DevOps in the Enterprise – April 11, 2017 at 10am PT. Rob Hirschfeld a guest in this Electric Cloud podcast.

UPCOMING EVENTS FOR RACKN

DockerCon 2017 : April 17 – 20, 2017 in Austin, TX
DevOpsDays Austin : May 4-5, 2017 in Austin TX
OpenStack Summit : May 8 – 11, 2017 in Boston, MA
Interop ITX : May 15 – 19, 2017 in Las Vegas, NV
Open Source IT Summit – Tuesday, May 16, 9:00 – 5:00pm : Rob Hirschfeld to speak
Gluecon : May 24 – 25, 2017 in Denver, CO

Surviving Day 2 in Open Source Hybrid Automation – May 23, 2017 : Rob Hirschfeld and Greg Althaus

OTHER NEWSLETTERS

SRE Weekly (@SREWeekly) – Issue #66

Don’t Balkanize My Installer, Yo!

Posted on March 28, 2017 by Rob H

Last week, RackN announced our enterprise support for Kubernetes using nothing but upstream Ansible from the project itself. This effort represents years of effort by the RackN founders to keep platforms interoperable via open and shareable operations automation.

That’s why our Digital Rebar approach targets underlay challenges and leverages existing automation tools instead of investing yet another install path.

dcos This week, we added Install Wizard templates to the DC/OS install automation we build in collaboration with Mesosphere last year. That makes it even easier to run DC/OS on physical infrastructure. Like our Kubernetes work, the Digital Rebar automation uses the same community dcos_install.sh that’s used in the community documentation. The difference is that we’re also driving all the underlay prep and configuration automatically.

If this approach appeals to you, contact RackN and join in the open Day 2 revolution.

Interested in seeing the DC/OS install in action? Here’s a demo video:

10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

Posted on March 14, 2017 by Rob H

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better. I’m not thinking about scale in number of nodes, but in operator efficiency. The primary way to create that efficiency is limit site customization and to improve reuse. Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers. This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.

(cc) Kaizer Rangwala

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers. That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms. Unfortunately, that does nothing for the platforms or infrastructure bound applications. For this lower level software, we need to accept that operations environments are heterogeneous.

I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself. To make things even harder, the infrastructure and dependencies are also constantly changing. Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current. These challenges limit the virtuous feedback cycles that help communities grow.

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken. That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns. Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

How scared do we need to be for Ops collaboration & investment?

Posted on March 8, 2017 by Rob H

Note: Yesterday RackN posted Are you impatient enough to be an SRE? and then the CIA wikileaks news hit… perhaps the right question is “Are you scared enough to automate deeply yet?”

Cia leak (1) As an industry, the CIA hacking release yesterday should be driving discussions about how to make our IT infrastructure more robust and fluid. It is not simply enough to harden because both the attack and the platforms are evolving to quickly.

We must be delivering solutions with continuous delivery and immutability assumptions baked in.

A more fluid IT that assumes constant updates and rebuilding from sources (immutable) is not just a security posture but a proven business benefit. For me, that means actually building from the hardware up where we patch and scrub systems regularly to shorten the half-life of all attach surfaces. It also means enabling existing security built into our systems that are generally ignored because of configuration complexity. These are hard but solvable automation challenges.

The problem is too big to fix individually: we need to collaborate in the open.

I’ve been really thinking deeply about how we accelerate SRE and DevOps collaboration across organizations and in open communities. The lack of common infrastructure foundations costs companies significant overhead and speed as teams across the globe reimplement automation in divergent ways. It also drags down software platforms that must adapt to each data center as a unique snowflake.

That’s why hybrid automation within AND between companies is an imperative. It enables collaboration.

Making automation portable able to handle the differences between infrastructure and environments is harder; however, it also enables sharing and reuse that creates allows us to improve collectively instead of individually.

That’s been a vision driving us at RackN with the open hybrid Digital Rebar project. Curious? Here’s RackN post that inspired this one:

From RackN’s Are you impatient enough to be an SRE?

“Like the hardware that runs it, the foundation automation layer must be commoditized. That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way. Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on. We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification. That practice and pace should be the norm instead of the exception.”

Are you impatient enough to be an SRE?

Posted on March 7, 2017 by Rob H

Our focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

SRE minded teams are very impatient about eliminating manual, routine and non-differentiated work.

I’ve been talking to a lot of people about SRE lately in the context of helping Ops get out of the way while coping with increasing load and complexity. Why are they so impatient? Because they know that ops demand is constantly increasing, there’s no “good enough” when it comes to finding ways to automate tasks and move up stack. Without consistent improvement in automation, teams will get buried (my post about Ops Debt).

The core SRE mantra needs to be “Own Ops, don’t be owned by Ops.”

Yet, outsourcing ops responsibility to a service is equally problematic for an SRE. They cannot give up responsibility for the integrated system. In fact, that’s one of the basic reasons why Google’s SRE teams went from just “web site reliability” to full system thinking. Every aspect of the infrastructure stack needs to be considered when looking at system performance and reliability. For example, something deep like SSD drive write behavior or GPU BIOS could make a critical difference. SREs need to be able to root cause issues and black box infrastructure (a.k.a. Cloud) can get in the way.

SRE teams must balance owning the full stack versus focusing on what makes their job unique.

That’s why we have been rethinking about how SRE teams approach infrastructure. Instead of trying to turn infrastructure into a black box services; we’ve designed the Digital Rebar composable Ops platform that embraces and contains heterogeneity with a high degree of transparency and control. This is critical because SREs cannot afford to keep reinventing automation at the bottom of the stack. We must be able to share and leverage best-practices on infrastructure provisioning and platform deployment.

Like the hardware that runs it, the foundation automation layer must be commoditized.

That means that Operators should be able to buy infrastructure (physical and cloud) from any vendor and run it in a consistent way. Instead of days or weeks to get infrastructure running, it should take hours and be fully automated from power-on. We should be able to rehearse on cloud and transfer that automation directly to (and from) physical without modification. That practice and pace should be the norm instead of the exception.

That’s what we are building at RackN. Our primary goal is to reuse automation whenever possible. That was our top design priority for Digital Rebar and it drives our customer engagement models. If you’d like to hear more, download our SRE white paper.

More information:

RackN Home Page
RackN YouTube Page – See our technology in action
Contact Us: sre@RackN.com

SRE role with DevOps for Enterprise [@HPE podcast]

Posted on February 21, 2017 by Rob H

My focus on SRE series continues… At RackN, we see a coming infrastructure explosion in both complexity and scale. Unless our industry radically rethinks operational processes, current backlogs will escalate and stability, security and sharing will suffer.

Yes, DevOps and SRE are complementary

In this short 16 minute podcast, HPE’s Stephen Spector and I discuss how DevOps and SRE thinking overlaps and where are the differences. We also discuss how Enterprises should be evaluating Site Reliability Engineering as a function and where it fits in their organization.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Tag Archives: SRE