Cloud Native Surfing at IBM Think 2018

Rob Hirschfeld speaks with Kevin Allen, Content Lead, IBM [@KevJosephAllen] about next week’s IBM Think 2018 conference (Mach 19-22) in Las Vegas. Contact us if you are interested in setting up a meeting with Rob next week at the event.

Highlights:

What is RackN working on? Physical Infrastructure Automation to manage metal in the data center as you would a VM in the cloud.

Trends in Infrastructure and Cloud space?  Getting involved in immutable infrastructure, CI/CD pipelines, and focus on zero-touch management. We have also been talking about Edge Computing and how it will be managed vs cloud.

Cloud Native movement is developers on surfboards and see a huge wave in the distance, where are we now? We are still at the point in open source that the technology is powerful and people are still learning how they work. Layers are forming on top of these container tools and customers are moving up the stack to understand more and more. The tide is coming in and the waves are getting bigger with lots and lots of wavelets still growing out at sea.

Enterprise user base is looking for more integration from projects, doesn’t have to be in 1 project but multiple projects connecting with each other.

Hybrid Cloud conversation has changed? Hybrid Cloud is the way people do business. The focus has moved to Hybrid IT with infrastructure being located at various locations allowing customers to take advantage of best of breed based on needs. The market is hybrid and customers need to integrate data flows between these services. Tools are lacking in this marketplace to manage this.

Looking forward to Think 2018? Interested in new AI and machine learning but key focus for the event is talking to real users and seeing real applications. Focus on actual deployments of this technology is more important that what is coming.

Advice for Event? Comfortable shoes. Allow time for unexpected things to happen – attend new talks based on speakers or topics you don’t know much about.

Podcast with Peter Miron talking NATS Service, Edge and Cloud Native Foundation

 

 

 

Joining this week’s L8ist Sh9y Podcast is Peter Miron, General Manager for NATS project sponsored by Apcera provides details on this open source project how it integrates with modern application architecture as well as their participation in Cloud Native Foundation.

About NATS

NATS is a family of open source products that are tightly integrated but can be deployed independently. NATS is being deployed globally by thousands of companies, spanning innovative use-cases including: Mobile apps, Microservices and Cloud Native, and IoT. NATS is also available as a hosted solution, NATS Cloud

The core NATS Server acts as a central nervous system for building distributed applications. There are dozens of clients ranging from Java, .NET, to GO. NATS Streaming extends the platform to provide for real-time streaming & big data use-cases.

 

Topic                                      Time (Minutes.Seconds)

Introduction                                          0.00 – 2.07
What is NATS?                                      2.07 – 3.36
Built for Containers, Short Term        3.36 – 5.14
Simple Example                                    5.14 – 6.51
Container ServiceMesh Concept       6.51 – 9.20
Loosely Coupled?                                 9.20 – 12.02
Inter-process Communication           12.02 – 15.11
Security                                                  15.11 – 18.02
Generic Politics Discussion                18.02 – 24.10
Edge Computing & NATS                   24.10 – 28.55
Apps to Service Portability                 28.55 – 32.37
Open Source Politics – CNCF            32.37 – 39.53
Conclusion                                            39.53  – END

Podcast Guest: Peter Miron
General Manager for NATS team

Peter Miron is an architect at Apcera, a highly secure, policy-driven platform for cloud-native applications and microservices. He was previously the director of technology for Pershing.

Before joining Pershing, Miron worked as the SVP of engineering at Bitly and vice president at Vonage. He also worked as the CTO of Knewton.

Miron holds a bachelor’s degree in art history from Syracuse University.

 

Coal or Diamonds? Configuration Management is Under Pressure

Cloud Native thinking is thankfully changing the way we approach traditional IT infrastructure.  These profound changes in how we build applications with 12-factor design and containers has deep implications on how we manage configuration and the tools we use to do it.  These are not cloud only impacts – the changes impact every corner of IT data centers.

“You still have to do configuration management but… we’re getting to a point we can do a lot less” (8:30)

Configuration Management is both necessary and very hard. I’ve written and spoken about the developer rebellion against Infrastructure (and will again at DOD Dallas!).  The TL;DR on that lightning talk is “infrastructure sucks.”

In this podcast, Eric and I have time to stretch out and really discuss what’s going on with in both broad and specific terms.  At the 15 minute mark, we start talking about how “radical simplicity” is coming to provisioning and deployment automation.  We break down how the business needs for repeatable and robust automation are driving IT to rethinking huge swaths of their infrastructures.  That transitions into making a whole data center into a CI/CD pipeline.

Podcast: Episode 15: The Death of Configuration Management with Rob Hirschfeld 

“If we have radically better control of the physical infrastructure, then I don’t need anything else to install Kubernetes.” (22:00)

Like always, Eric and I are not shy about taking on IT hot topics.  Dig deep, enjoy and let us know what YOU think about these topics.  We want to hear from you.

Datanauts #89 dives deep on SRE approach and urgency

TL;DR: SRE makes Ops more Dev like in critical ways like status equity and tooling approaches.

In Datanauts 089, Chris Wahl and Ethan Banks help me break down the concepts from my “DevOps vs SRE vs Cloud Native” presentation from DevOpsDays Austin last spring. They do a great job exploring the tough topics and concepts from the presentation.  It’s almost like an extended Q&A so you may want to review the slides or recording before diving into the podcast.

Advanced Reading: my follow-up discussion on SRE with the Cloudcast team and my previous Datanauts podcast.

Here are my notes from the podcast:

  • 01:00 “Doing infrastructure in a way that the robots can take over”
  • 01:51 Video where Charity & Rob Debated the SRE term
  • 02:00 History of SRE term from Google vs Sys Ops – if site was not up, money was not flowing.  SRE culture fixed pay equity and career ladder, ops would have automation/dev time, dev on hooks for errors
  • 03:00 Google took a systems approach with lots of time for automation and coding
  • 03:20 Finding a 10x improvement in ops.  Go buy the book
  • 04:00 SRE is a new definition of System Op
  • 04:10 The S in could be “system” or physical location (not web site).
  • 05:00 We’re seeing SRE teams showing up in companies of every size.  Replacing DevOps teams (which is a good thing).  Rob is hoping that SRE is replacing DevOps as a job title.  
  • 06:10 Don’t fall for a title change from Sys Op to SRE with actually getting the pay and authority
  • 06:45 Ethan believes that SRE is transforming to have a broad set of responsibilities.  Is just a new System Admin definition?
  • 07:30 Rob things that the SRE expectation is for a much higher level of automation.  There’s a big thinking shift.
  • 08:00 SREs are still operators.  You have to walk the walk to know how to run the system.  Not developers who are writing the platform.
  • 08:30 Chris asks about the Ops technical debt
  • 09:00 We need to make Ops tooling “better enough” – we’re not solving this problem fast enough.  We have to do a better job – Rob talks about the Wannacry event.
  • 10:30 Chris asks how to fix this since complexity is increasing.  Rob plugs Digital Rebar as a way to solve this.
  • 11:00 People are excited about Digital Rebar but don’t have the time to fix the problem.  They are running crisis to crisis so we never get to automation that actually improves things.
  • 12:00 At best, Ops is invisible.  SRE is different because it includes CI/CD with on going interactions.  There’s a lot coming with immutable operating systems and constantly term.
  • 13:00 The idea that a Linux system has been up for 10 years is an anti-pattern.  Rob would rather have people say that none of their servers has been up for more than a week (because they are constantly refreshed)
  • 13:19 Chris & Ethan – SECTION 1 REVIEW
    • SRE is not new, it’s about moving into a proactive stance (automatically reacting)
    • The power is the buy in so that Ops has ownership of the stack
  • 15:00 SRE vs DevOps vs Cloud Native – not in conflict, but we love to create opposition
  • 15:40 There is a difference, they are not interchangeable.  SRE is a job title, DevOps is a process and Cloud Native is an architecture.
  • 16:30 We need to resist that Cloud Native is a “new shiney” that replaces DevOps. We don’t have to take things away.
  • 17:00 Lean is a process where we’re trying to shorten the flow from ideation to delivery.  Read the Goal [links] and The Phoenix Project [links].  
  • 18:00 Bottlenecks (where we’ve added work or delays) really break our pipelines.  
  • 19:00 Ethan’s adds the insight: If you don’t have small steps then you don’t really understand your process
  • 20:00 Platform as a Service is not really reducing complexity, we’re just hiding/abstracting it.  That moves the complexity.  We may hide it from developers but may be passing it to the operators.
  • 21:00 Chris asks if this can be mapped to legacy?  Rob agrees that it’s a legacy architectural choice that was made to reduce incremental risk.  Today, we’re trying to make our risk into smaller steps which makes it so that we will have smaller but more frequent breaks.
  • 22:40 The way we deliver systems is changing to require a much faster pace of taking changes
  • 23:00 SREs are data driven so they can feed information back to devs.  They can’t (shouldn’t) walk away from running systems.  This is an investment requirement so we can create data.
  • 24:00 We let a lot of problems lurk below the surface that eventually surface as a critical issue.  Cannot let toothaches turn into abscesses.  SREs should watch systems over time.
  • 25:20 If you are running under performance in the cloud, then you are wasting money.
  • 26:00 Cloud Native, an architecture?  What is it?  It means a ton of things.  For this preso, Rob made it about 12 factor and API driven infrastructure.
  • 26:50 “If you are not worried about rising debt then we are in trouble.”  We need to root cause!  If not, they snowball and operators are just running fire to fire.  We need to stop having operators be heros / grenade divers because it’s an anti-pattern.  Predictable systems do not create a lot of interrupts or crises.  Operators should not be event driven.
  • 28:40 Chris & Ethan – SECTION 2 REVIEW
    • Chris: Being data driven combats complexity
    • Ethan: Breaking down processes into smaller units reduces risk.  
  • 30:00 Cloud First is not Cloud Only.  CNCF projects are not VM specific, they are about abstractions that help developers be more productive.  Ideally, the abstractions remove infrastructure because developers don’t want to do any infrastructure.  We should not are about which type of infrastructure we are using
  • 31:30 The similarities between the concepts is in their common outcomes/values.  Cloud First wants to be infrastructure agnostic.
  • 32:30 Chris ask how important CI/CD should be.  Are these still important in non-Cloud environments.  Rob things that Cloud Native may “cloud wash” architectures that are really just as important in traditional infrastructure.  
  • 34:00 Cloud Native was a defensive architecture because early cloud was not very good.  CI/CD pipelines would be considered best practices in regular manufacturing. 
  • 35:00 These ideas are really good manufacturing process applied back to IT.  Thankfully, there’s really nothing unexpected from repeatable production.
  • 36:30 Lesson: Pay Equity.  Traditionally operators are not paid as well as developers and that means that we’re giving them less respect.  HiPPO (highest paid person in organization) is a very real effect where you can create a respect gap.
  • 38:00 Lesson: Disrupt Less.  We love the idea of disruption but they are very expensive and disproportionately to the operators.  Change for Developers may be small but have big impacts to operators.  More disruptive changes actually slow down adoption because that slows down inertia.  SREs should be able to push back to insist on migration paths.
  • 40:00 Rob talks about how RedFish, while good to replace IPMI, will take long time before it.  There are pros and cons.

 

June 23 – Weekly Recap of All Things Site Reliability Engineering (SRE)

Welcome to the weekly post of the RackN blog recap of all things SRE. If you have any ideas for this recap or would like to include content please contact us at info@rackn.com or tweet Rob (@zehicle) or RackN (@rack ngo)

SRE Items of the Week

Datanauts 089: SRE vs Cloud Native vs DevOps
http://bit.ly/2txPXWV

Rob Hirschfeld joins the Datanauts to talk about the term Site Reliability Engineer (SRE) and what it means for IT operations.

Rob explores how the SRE designation is an effort to put operations teams on a more equal footing with developers within an organization. Rob and the Datanauts also discuss how SREs line up with other industry trends such as the cloud native and DevOps movements. LISTEN HERE

Why Does DevOps Require a New Operating Model? By Mustafa Kapadia @MKapadiaTweets
https://devops.com/why-should-cios-redesign-their-organizations/

For many, redesigning the operating model is table stakes for a successful DevOps transformation. But have you ever wondered why? Popular wisdom will have you believe that the main reason for operating model redesign are to…

“Improve collaboration between business and IT”
“Realign metrics”
“Take full advantage of the new tools”
“And even jump start culture change”

While these are all good reasons, frankly they miss the point. Experience suggests there is a more practical reason – match ownership with desired output.

What do we mean by that? Well first, let’s look at how the current model works. READ MORE

What can developers learn from being on call? By Julia Evans @b0rk http://jvns.ca/blog/2017/06/18/operate-your-software/

We often talk about being on call as being a bad thing. For example, the night before I wrote this my phone woke me up in the middle of the night because something went wrong on a computer. That’s no fun! I was grumpy.

In this post, though, we’re going to talk about what you can learn from being on call and how it can make you a better software engineer!. And to learn from being on call you don’t necessarily need to get woken up in the middle of the night. By “being on call”, here, I mean “being responsible for your code when it breaks”. It could mean waking up to issues that happened overnight and needing to fix them during your workday! READ MORE

Kargo Ansible Playbooks foster Collaborative Kubernetes Ops
http://bit.ly/2qENw3I   

Why Kargo?
Making Kubernetes operationally strong is a widely held priority and I track many deployment efforts around the project. The incubated Kargo project is of particular interest for me because it uses the popular Ansible toolset to build robust, upgradable clusters on both cloud and physical targets. I believe using tools familiar to operators grows our community.

We’re excited to see the breadth of platforms enabled by Kargo and how well it handles a wide range of options like integrating Ceph for StatefulSet persistence and Helm for easier application uploads. Those additions have allowed us to fully integrate the OpenStack Helm charts (demo video). READ MORE

newsletter

Subscribe to our new daily DevOps, SRE, & Operations Newsletter https://paper.li/e-1498071701#/

UPCOMING EVENTS

Rob Hirschfeld and Greg Althaus are preparing for a series of upcoming events where they are speaking or just attending. If you are interested in meeting with them at these events please email info@rackn.com.

  • 2017 New York Venture Summit – LINK

OTHER NEWSLETTERS

Cloud Native PHYSICAL PROVISIONING? Come on! Really?!

We believe Cloud Native development disciplines are essential regardless of the infrastructure.

imageToday, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement.  Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning.  We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity.

I’m going to post more about Provision can help address the political struggles of SRE and DevOps that I’ve been watching in our industry.   A hint is in the release, but the Cloud Native comment needs to be addressed.

First, Cloud Native is an architecture, not an infrastructure statement.

There is no requirement that we use VMs or AWS in Cloud Native.  From that perspective, “Cloud” is a useful but deceptive adjective.  Cloud Native is born from applications that had to succeed in hands-off, lower SLA infrastructure with fast delivery cycles on untrusted systems.  These are very hostile environments compared to “legacy” IT.

What makes Digital Rebar Provision Cloud Native?  A lot!

The following is a list of key attributes I consider essential for Cloud Native design.

Micro-services Enabled: The larger Digital Rebar project is a micro-services design.  Provision reflects a stand-alone bundling of two services: DHCP and Provision.  The new Provision service is designed to both stand alone (with embedded UX) and be part of a larger system.

Swagger RESTful API: We designed the APIs first based on years of experience.  We spent a lot of time making sure that the API conformed to spec and that includes maintaining the Swagger spec so integration is easy.

Remote CLI: We build and test our CLI extensively.  In fact, we expect that to be the primary user interface.

Security Designed In: We are serious about security even in challenging environments like PXE where options are limited by 20 year old protocols.  HTTPS is required and user or bearer token authentication is required.  That means that even API calls from machines can be secured.

12 Factor & API Config: There is no file configuration for Provision.  The system starts with command line flags or environment variables.  Deeper configuration is done via API/CLI.  That ensures that the system can be fully managed by remote and configured securely becausee credentials are required for configuration.

Fast Start / Golang:  Provision is a totally self-contained golang app including the UX.  Even so, it’s very small.  You can run it on a laptop from nothing in about 2 minutes including download.

CI/CD Coverage: We committed to deep test coverage for Provision and have consistently increased coverage with every commit.  It ensures quality and prevents regressions.

Documentation In-project Auto-generated: On-boarding is important since we’re talking about small, API-driven units.  A lot of Provisioning documentation is generated directly from the code into the actual project documentation.  Also, the written documentation is in Restructured Text in the project with good indexes and cross-references.  We regenerate the documentation with every commit.

We believe these development disciplines are essential regardless of the infrastructure.  That’s why we made sure the v3 Provision (and ultimately every component of Digital Rebar as we iterate to v3) was built to these standards.

What do you think?  Is this Cloud Native?  What did we miss?

DevOps vs Cloud Native: Damn, where did all this platform complexity come from?

Complexity has always part of IT and it’s increasing as we embrace microservices and highly abstracted platforms.  Making everyone cope with this challenge is unsustainable.

We’re just more aware of infrastructure complexity now that DevOps is exposing this cluster configuration to developers and automation tooling. We are also building platforms from more loosely connected open components. The benefit of customization and rapid development has the unfortunate side-effect of adding integration points. Even worse, those integrations generally require operations in a specific sequence.

The result is a developer rebellion against DevOps on low level (IaaS) platforms towards ones with higher level abstractions (PaaS) like Kubernetes.
11-11-16-hirschfeld-1080x675This rebellion is taking the form of “cloud native” being in opposition to “devops” processes. I discussed exactly that point with John Furrier on theCUBE at Kubecon and again in my Messy Underlay presentation Defrag Conf.

It is very clear that DevOps mission to share ownership of messy production operations requirements is not widely welcomed. Unfortunately, there is no magic cure for production complexity because systems are inherently complex.

There is a (re)growing expectation that operations will remain operations instead of becoming a shared team responsibility.  While this thinking apparently rolls back core principles of the DevOps movement, we must respect the perceived productivity impact of making operations responsibility overly broad.

What is the right way to share production responsibility between teams?  We can start to leverage platforms like Kubernetes to hide underlay complexity and allow DevOps shared ownership in the right places.  That means that operations still owns the complex underlay and platform jobs.  Overall, I think that’s a workable diversion.