Deep Thinking & Tech + Great Guests – L8ist Sh9y podcast relaunched

I love great conversations about technology – especially ones where the answer is not very neatly settled into winners and losers (which is ALL of them in IT).  I’m excited that RackN has (re)launched the L8ist Sh9y (aka Latest Shiny) podcast around this exact theme.

Please check out the deep and thoughtful discussion I just had with Mark Thiele (notes) of Aperca where we covered Mark’s thought on why public cloud will be under 20% of IT and culture issues head on.

Spoiler: we have David Linthicum coming next, SO SUBSCRIBE.

I’ve been a guest on some great podcasts (Cloudcast, gcOnDemand, Datanauts, IBM Dojo, HPEFoodfight) and have deep respect for critical work they do in industry.

We feel there’s still room for deep discussions specifically around automated IT Operations in cloud, data center and edge; consequently, we’re branching out to start including deep interviews in addition to our initial stable of IT Ops deep technical topics like Terraform, Edge Computing, GartnerSYM review, Kubernetes and, of course, our own Digital Rebar.

Soundcloud Subscription Information

 

Exploring the Edge Series: “Edge is NOT just Mini-Cloud”

While the RackN team and I have been heads down radically simplifying physical data center automation, I’ve still been tracking some key cloud infrastructure areas.  One of the more interesting ones to me is Edge Infrastructure.

This once obscure topic has come front and center based on coming computing stress from home video, retail machine and distributed IoT.  It’s clear that these are not solved from centralized data centers.

While I’m posting primarily on the RackN.com blog, I like to take time to bring critical items back to my personal blog as a collection.  WARNIING: Some of these statements run counter to other industry.  Please let me know what you think!

Don’t want to read?  Here’s a summary podcast.

Post 1: OpenStack On Edge? 4 Ways Edge Is Distinct From Cloud

By far the largest issue of the Edge discussion was actually agreeing about what “edge” meant.  It seemed as if every session had a 50% mandatory overhead in definitioning.  Putting my usual operations spin on the problem, I choose to define edge infrastructure in data center management terms.  Edge infrastructure has very distinct challenges compared to hyperscale data centers.  Read article for the list...

Post 2: Edge Infrastructure Is Not Just Thousands Of Mini Clouds

Running each site as a mini-cloud is clearly not the right answer.  There are multiple challenges here. First, any scale infrastructure problem must be solved at the physical layer first. Second, we must have tooling that brings repeatable, automation processes to that layer. It’s not sufficient to have deep control of a single site: we must be able to reliably distribute automation over thousands of sites with limited operational support and bandwidth. These requirements are outside the scope of cloud focused tools.

Post 3: Go CI/CD And Immutable Infrastructure For Edge Computing Management

If “cloudification” is not the solution then where should we look for management patterns?  We believe that software development CI/CD and immutable infrastructure patterns are well suited to edge infrastructure use cases.  We discussed this at a session at the OpenStack OpenDev Edge summit.

What do YOU think?  This is an evolving topic and it’s time to engage in a healthy discussion.

OpenStack Boston Day 1 Notes

Contrary to pundit expectations, OpenStack did not roll over and die during the keynotes yesterday.

20170508_093339

In my 2011 Boston Summit shirt.

In fact, I saw the signs of a maturing project seeing real use and adoption. More critically, OpenStack leadership started the event with an acknowledgement of being part of, not owning, the vibrant open infrastructure community.

Continued Growth in Core Areas

Practical reasons for running dedicated infrastructure (compliance, control and cost) make OpenStack relevant for companies and governments with significant budgets. There is also a healthy shared infrastructure (aka public cloud) market living in the shadow of the big 3 players. It’s still unclear how this ecosystem will make money for the vendors.

What do customers buy? Should the Core be free?

My personal experience is that most customers are reluctant to (but grudgingly do) buy distros for the core open technology. They are much more willing to pay for adjacencies like security, storage and networking.

Emerging Challenges from Adjacent Technologies

Containers and Kubernetes are making a significant impact on the OpenStack community. At points, the OpenStack keynote was more about Kubernetes than OpenStack. It’s also clear that customers want to use containers as an abstraction layer to make infrastructure less visible or locked-in. That opens the market for using servers directly (bare metal) or other clouds. That portability is likely to help OpenStack more than hurt it because customers can exit workloads from the Big 3 players.

Friction for adoption remains a critical hurdle.

Containers, which are cloud first platforms, have much less friction than IaaS platforms. IaaS platforms, even managed ones, require physical infrastructure with the matching complexity and investment.

OpenStack: an open infrastructure software community

Overall, the summit remains an amazing community space for open infrastructure software and cloud alternatives to the Big 3 players. The Foundation’s pivot to embrace Kubernetes and foster several other open technologies helps maintain the central enthusiasm for open source infrastructure that gave birth to the platform in the first place.

A healthy pragmatic vibe

The summit may not have the same heady taking-on-the-world feeling as the early days; instead, it has a healthy pragmatic vibe. Considering how frothy this space remains, that may be a welcome relief.

What are your impressions? I’m looking forward to hearing from you!

Cloud Native PHYSICAL PROVISIONING? Come on! Really?!

We believe Cloud Native development disciplines are essential regardless of the infrastructure.

imageToday, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement.  Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning.  We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity.

I’m going to post more about Provision can help address the political struggles of SRE and DevOps that I’ve been watching in our industry.   A hint is in the release, but the Cloud Native comment needs to be addressed.

First, Cloud Native is an architecture, not an infrastructure statement.

There is no requirement that we use VMs or AWS in Cloud Native.  From that perspective, “Cloud” is a useful but deceptive adjective.  Cloud Native is born from applications that had to succeed in hands-off, lower SLA infrastructure with fast delivery cycles on untrusted systems.  These are very hostile environments compared to “legacy” IT.

What makes Digital Rebar Provision Cloud Native?  A lot!

The following is a list of key attributes I consider essential for Cloud Native design.

Micro-services Enabled: The larger Digital Rebar project is a micro-services design.  Provision reflects a stand-alone bundling of two services: DHCP and Provision.  The new Provision service is designed to both stand alone (with embedded UX) and be part of a larger system.

Swagger RESTful API: We designed the APIs first based on years of experience.  We spent a lot of time making sure that the API conformed to spec and that includes maintaining the Swagger spec so integration is easy.

Remote CLI: We build and test our CLI extensively.  In fact, we expect that to be the primary user interface.

Security Designed In: We are serious about security even in challenging environments like PXE where options are limited by 20 year old protocols.  HTTPS is required and user or bearer token authentication is required.  That means that even API calls from machines can be secured.

12 Factor & API Config: There is no file configuration for Provision.  The system starts with command line flags or environment variables.  Deeper configuration is done via API/CLI.  That ensures that the system can be fully managed by remote and configured securely becausee credentials are required for configuration.

Fast Start / Golang:  Provision is a totally self-contained golang app including the UX.  Even so, it’s very small.  You can run it on a laptop from nothing in about 2 minutes including download.

CI/CD Coverage: We committed to deep test coverage for Provision and have consistently increased coverage with every commit.  It ensures quality and prevents regressions.

Documentation In-project Auto-generated: On-boarding is important since we’re talking about small, API-driven units.  A lot of Provisioning documentation is generated directly from the code into the actual project documentation.  Also, the written documentation is in Restructured Text in the project with good indexes and cross-references.  We regenerate the documentation with every commit.

We believe these development disciplines are essential regardless of the infrastructure.  That’s why we made sure the v3 Provision (and ultimately every component of Digital Rebar as we iterate to v3) was built to these standards.

What do you think?  Is this Cloud Native?  What did we miss?

10x Faster Today but 10x Harder to Maintain Tomorrow: the Cul-De-Sac problem

I’ve been digging into what it means to be a site reliability engineer (SRE) and thinking about my experience trying to automate infrastructure in a way to scales dramatically better.  I’m not thinking about scale in number of nodes, but in operator efficiency.  The primary way to create that efficiency is limit site customization and to improve reuse.  Those changes need to start before the first install.

As an industry, we must address the “day 2” problem in collaboratively developed open software before users’ first install.

Recently, RackN asked the question “Shouldn’t we have Shared Automation for Commodity Infrastructure?” which talked about fact that we, as an industry, keep writing custom automation for what should be commodity servers.  This “snow flaking” happens because there’s enough variation at the data center system level that it’s very difficult to share and reuse automation on an ongoing basis.

Since variation enables innovation, we need to solve this problem without limiting diversity of choice.

 

Happily, platforms like Kubernetes are designed to hide these infrastructure variations for developers.  That means we can expect a productivity explosion for the huge number of applications that can narrowly target platforms.  Unfortunately, that does nothing for the platforms or infrastructure bound applications.  For this lower level software, we need to accept that operations environments are heterogeneous.

 

I realized that we’re looking at a multidimensional problem after watching communities like OpenStack struggle to evolve operations practice.

It’s multidimensional because we are building the operations practice simultaneously with the software itself.  To make things even harder, the infrastructure and dependencies are also constantly changing.  Since this degree of rapid multi-factor innovation is the new normal, we have to plan that our operations automation itself must be as upgradable.

If we upgrade both the software AND the related deployment automation then each deployment will become a cul-de-sac after day 1.

For open communities, that cul-de-sac challenge limits projects’ ability to feed operational improvements back into the user base and makes it harder for early users to stay current.  These challenges limit the virtuous feedback cycles that help communities grow.  

The solution is to approach shared project deployment automation as also being continuously deployed.

This is a deceptively hard problem.

This is a hard problem because each deployment is unique and those differences make it hard to absorb community advances without being constantly broken.  That is one of the reasons why company opt out of the community and into vendor distributions. While Vendors are critical to the ecosystem, the practice ultimately limits the growth and health of the community.

Our approach at RackN, as reflected in open Digital Rebar, is to create management abstractions that isolate deployment variables based on system level concerns.  Unlike project generated templates, this approach absorbs heterogeneity and brings in the external information that often complicate project deployment automation.  

We believe that this is a general way to solve the broader problem and invite you to participate in helping us solve the Day 2 problems that limit our open communities.

Surgical Ansible & Script Injections before, during or after deployment.

I’ve been posting about the unique composable operations approach the RackN team has taken with Digital Rebar to enable hybrid infrastructure and mix-and-match underlay tooling.  The orchestration design (what we call annealing) allows us to dynamically add roles to the environment and execute them as single role/node interactions in operational chains.

ansiblemtaWith our latest patches (short demo videos below), you can now create single role Ansible or Bash scripts dynamically and then incorporate them into the node execution.

That makes it very easy to extend an existing deployment on-the-fly for quick changes or as part of a development process.

You can also run an ad hoc bash script against one or groups of machines.  If that script is something unique to your environment, you can manage it without having to push it back upsteam because Digital Rebar workloads are composable and designed to be safely integrated from multiple sources.

Beyond tweaking running systems, this is fastest script development workflow that I’ve ever seen.  I can make fast, surgical iterative changes to my scripts without having to rerun whole playbooks or runlists.  Even better, I can build multiple operating system environments side-by-side and test changes in parallel.

For secure environments, I don’t have to hand out user SSH access to systems because the actions run in Digital Rebar context.  Digital Rebar can limit control per user or tenant.

I’m very excited about how this capability can be used for dev, test and production systems.  Check it out and let me know what you think.

 

 

 

2016 Infrastructure Revolt makes 2017 the “year of the IT Escape Clause”

Software development technology is so frothy that we’re developing collective immunity to constant churn and hype cycles. Lately, every time someone tells me that they have hot “picked technology Foo” they also explain how they are also planning contingencies for when Foo fails. Not if, when.

13633961301245401193crawfish20boil204-mdRequired contingency? That’s why I believe 2017 is the year of the IT Escape Clause, or, more colorfully, the IT Crawfish.

When I lived in New Orleans, I learned that crawfish are anxious creatures (basically tiny lobsters) with powerful (and delicious) tails that propel them backward at any hint of any danger. Their ability to instantly back out of any situation has turned their name into a common use verb: crawfish means to back out or quickly retreat.

In IT terms, it means that your go-forward plans always include a quick escape hatch if there’s some problem. I like Subbu Allamaraju’s description of this as Change Agility.  I’ve also seen this called lock-in prevention or contingency planning. Both are important; however, we’re reaching new levels for 2017 because we can’t predict which technology stacks are robust and complete.

The fact is the none of them are robust or complete compared to historical platforms. So we go forward with an eye on alternatives.

How did we get to this state? I blame the 2016 Infrastructure Revolt.

Way, way, way back in 2010 (that’s about bronze age in the Cloud era), we started talking about developers helping automate infrastructure as part of deploying their code. We created some great tools for this and co-opted the term DevOps to describe provisioning automation. Compared to the part, it was glorious with glittering self-service rebellions and API-driven enlightenment.

In reality, DevOps was really painful because most developers felt that time fixing infrastructure was a distraction from coding features.

In 2016, we finally reached a sufficient platform capability set in tools like CI/CD pipelines, Docker Containers, Kubernetes, Serverless/Lambda and others that Developers had real alternatives to dealing with infrastructure directly. Once we reached this tipping point, the idea of coding against infrastructure directly become unattractive. In fact, the world’s largest infrastructure company, Amazon, is actively repositioning as a platform services company. Their re:Invent message was very clear: if you want to get the most from AWS, use our services instead of the servers.

For most users, using platform services instead of infrastructure is excellent advice to save cost and time.

The dilemma is that platforms are still evolving rapidly. So rapidly that adopters cannot count of the services to exist in their current form for multiple generations. However, the real benefits drive aggressive adoption. They also drive the rise of Crawfish IT.

As they say in N’Awlins, laissez les bon temps rouler!

Related Reading on the Doppler: Your Cloud Strategy Must Include No-Cloud Options