DC2020: Is Exposing Bare Metal Practical or Dangerous?

One of IBM’s major announcements at Think 2018 was Managed Kubernetes on Bare Metal. This new offering combines elements of their existing offerings to expose some additional security, attestation and performance isolation. Bare metal has been a hot topic for cloud service providers recently with AWS adding it to their platform and Oracle using it as their primary IaaS. With these offerings as a backdrop, let’s explore the role of bare metal in the 2020 Data Center (DC2020).

Physical servers (aka bare metal) are the core building block for any data center; however, they are often abstracted out of sight by a virtualization layer such as VMware, KVM, HyperV or many others. These platforms are useful for many reasons. In this post, we’re focused on the fact that they provide a control API for infrastructure that makes it possible to manage compute, storage and network requests. Yet the abstraction comes at a price in cost, complexity and performance.

The historical lack of good API control has made bare metal less attractive, but that is changing quickly due to two forces.

These two forces are Container Platforms and Bare Metal as a Service or BMaaS (disclosure: RackN offers a private BMaaS platform called Digital Rebar). Container Platforms such as Kubernetes provide an application service abstraction level for data center consumers that eliminates the need for users to worry about traditional infrastructure concerns.  That means that most users no longer rely on APIs for compute, network or storage allowing the platform to handle those issues. On the other side, BMaaS VM infrastructure level APIs for the actual physical layer of the data center allow users who care about compute, network or storage the ability to work without VMs.  

The combination of containers and bare metal APIs has the potential to squeeze virtualization into a limited role.

The IBM bare metal Kubernetes announcement illustrates both of these forces working together.  Users of the managed Kubernetes service are working through the container abstraction interface and really don’t worry about the infrastructure; however, IBM is able to leverage their internal bare metal APIs to offer enhanced features to those users without changing the service offering.  These benefits include security (IBM White Paper on Security), isolation, performance and (eventually) access to metal features like GPUs. While the IBM offering still includes VMs as an option, it is easy to anticipate that becoming less attractive for all but smaller clusters.

The impact for DC2020 is that operators need to rethink how they rely on virtualization as a ubiquitous abstraction.  As more applications rely on container service abstractions the platforms will grow in size and virtualization will provide less value.  With the advent of better control of the bare metal infrastructure, operators have real options to get deep control without adding virtualization as a requirement.

Shifting to new platforms creates opportunities to streamline operations in DC2020.

Even with virtualization and containers, having better control of the bare metal is a critical addition to data center operations.  The ideal data center has automation and control APIs for every possible component from the metal up.

Learn more about the open source Digital Rebar community:

DC2020: Mono-clouds are easier! Why do Hybrid?

Background: This post was inspired by a mult-cloud session session at IBM Think2018 where I am attending as a guest of IBM. Providing hybrid solutions is a priority for IBM and it’s customers are clearly looking for multi-cloud options. In this way, IBM has made a choice to support competitive platforms. This post explores why they would do that.

There is considerable angst and hype over the terms multi-cloud and hybrid-cloud. While it would be much simpler if companies could silo into a single platform, innovation and economics requires a multi-party approach. The problem is NOT that we want to have choice and multiple suppliers. The problem is that we are moving so quickly that there is minimal interoperability and minimal efforts to create interoperability.

To drive interoperability, we need a strong commercial incentive to create an neutral ecosystem.

Even something with a clear ANSI standard like SQL has interoperability challenges. It also seems like the software industry has given up on standards in favor of APIs and rapid innovation. The reality on the ground is that technology is fundamentally heterogeneous and changing. For this reason, mono-anything is a myth and hybrid is really status quo.

If we accept multi-* that as a starting point, then we need to invest in portability and avoid platform assumptions when we build automation. Good design is to assume change at multiple points in your stack. Automation itself is a key requirement because it enables rapid iterative build, test and deploy cycles. It is not enough to automate for day 1, the key to working multi-* infrastructure is a continuous deployment pipeline.

Pipelines provide insurance for hybrid infrastructure by exposing issues quickly before they accumulate technical debt.

That means the utility of tools like Terraform, Ansible or Docker is limited to how often you exercise them. Ideally, we’d build abstraction automation layers above these primitives; however, this has proven very difficult in practice. The degrees of variation between environments and pace of innovation make it impossible to standardize without becoming very restrictive. This may be possible for a single company but is not practical for a vendor trying to support many customers with a single platform.

This means that hybrid, while required in the market, carries an integration tax that needs to be considered.

My objective for discussing Data Center 2020 topics is to find ways to lower that tax and improve the outcome. I’m interested in hearing your opinion about this challenge and if you’ve found ways to solve it.

Counterpoint Addendum: if you are in a position to avoid multi-* deployments (e.g. a start-up) then you should consider that option. There is measurable overhead of heterogeneous automation; however, I’ve found the tipping point away from a mono-stack can be surprising low and committing to a vertical stack does make applications less innovation resilient.

Podcast – Jim Plamondon tells history of developer evangelism and so much more

Coming direct from Cambodia is a rare podcast with Jim Plamondon, the creator of how software platforms were built at Microsoft via APIs and developer evangelism. In this podcast, he talks about the early history of developer evangelism at Apple and Microsoft, the current state of open source, and the upcoming competitive industry coming from China and its roots in the third world.

Highlights

  • Soviet Agriculture and Technology Market Comparison
  • Why NeXT and Apple Failed with Software Industry but iPhone Succeeded
  • China Industry Takeover is Coming: Product Price Points

Books referenced in the podcast (links to Amazon, we have no agreement with them based on your click/purchase):

Note – If you are easily offended by language please consider skipping this podcast J

Topic                                                       Time (Minutes.Seconds)

Introduction                                           0.0 –  0.33
Creator of Developer Evangelism      0.33 – 4.58
Plamondon Files                                   4.58 – 5.53
Working with Hostile Community      5.53 – 7.02
Android vs iOS Platform                       7.02 – 7.46
Study: Apple vs Windows                    7.46 – 9.13
PC Industry – Mostly All Alive            9.13 – 10.00
Open Source has same Struggles     10.00 – 12.21 (Focus on individual not yechnology)
Cargo Cult & Hype Cycle                     12.21 – 16.11 (VR and AI are on version 3; not new at all)
Security Breach                                     16.11 – 17.01
Back to Hype Cycle                              17.01 – 19.03 (Markets find a solution that makes profits)
Latest thoughts on Open Source       19.03 – 23.25 (Zipf’s Law)
Time Buying Strategy                           23.25 – 25.07 (e.g. IBM Server response to Amazon S3)
Microsoft Anti-Trust & Apple Mgmt    25.07 – 28.45 (NeXT Failure)
iPhone walled Garden Worked           28.45 – 31.10
Android will defeat iPhone                   31.10 – 32.33
Internet Competition dead?                 32.33 – 36.07 (Here comes China)
Alibaba moves West                              36.07 – 39.45 (Take over 3rd world then US/Europe)
Per Capita Income Averages                39.45 – 43.55 (Own tiny consumer market than move up)
China and Open Source                        43.55 –  47.18
Western vs Asian Gov’ts                       47.18 – 49.50 (Go learn Mandarin)
Wrap Up                                                  49.50 – END

 

Podcast Guest: Jim Plamondon
Jim Plamondon is a retired Technology Evangelist, noted for formalizing Microsoft’s Technology Evangelism practices in the 1990’s.

 

 

 

Podcast: Digital Rebar Tech Discussion on Patch APIs, Swagger, and Integrations

In this week’s L8ist Sh9y Podcast, we bring on the Digital Rebar team at RackN to discuss several issues they have working on over the past few months:

  • Patch Rest APIs and CLI : Scaling Challenges Require Patch
  • Swagger API History and Changes : No CLI Generation
  • Integrations to Existing Tools up the Stack

Topic                                                   Time (Minutes.Seconds)

Introduction                                              0.0 – 0.42
Intro to Digital Rebar Project                 0.42 – 1.58
Patch in Rest APIs                                    1.58 – 4.02  (Reference: JsonPatch.com)
Why not use PUT?                                  4.02 – 4.53
CLI use Reference Objects                    4.53 – 6.28
Examples of How Use This                    6.28 – 10.55
Patch Synchronous Question                10.55 – 12.32
Swagger Built into API                            12.32 – 15.44  (Reference: https://swagger.io/)
Operator view of CLI w/out Swagger  15.44 – 18.30
2 Key Points on Swagger Change         18.30 – 20.22
Integration to Other Systems                 20.22 – 28.13   (Grumpy Operators Syndrome)
Learn More About Digital Rebar            28.13 – END      (Digital Rebar Online Meetup Community)

Guest Podcast Attendees

  • Greg Althaus, Co-Founder, CTO RackN
  • Victor Lowther, Sr. Software Engineering, RackN
  • Shane Gibson, Sr. Architect and Community Evangelist, RackN

Digital Rebar is the open, fast and simple data center provisioning and control scaffolding designed with a cloud native architecture. Sponsored by RackN, this community is building an extensible stand-alone DCHP/PXE/IPXE service with minimal overhead offering a quick 5 minutes to provisioning solution.

Community Mission: Embrace the Heterogeneous Nature of Data Center Operations while Eliminating Complexity and Manual Steps.

Cloud Native PHYSICAL PROVISIONING? Come on! Really?!

We believe Cloud Native development disciplines are essential regardless of the infrastructure.

imageToday, RackN announce very low entry level support for Digital Rebar Provisioning – the RESTful Cobbler PXE/DHCP replacement.  Having a company actually standing behind this core data center function with support is a big deal; however…

We’re making two BIG claims with Provision: breaking DevOps bottlenecks and cloud native physical provisioning.  We think both points are critical to SRE and Ops success because our current approaches are not keeping pace with developer productivity and hardware complexity.

I’m going to post more about Provision can help address the political struggles of SRE and DevOps that I’ve been watching in our industry.   A hint is in the release, but the Cloud Native comment needs to be addressed.

First, Cloud Native is an architecture, not an infrastructure statement.

There is no requirement that we use VMs or AWS in Cloud Native.  From that perspective, “Cloud” is a useful but deceptive adjective.  Cloud Native is born from applications that had to succeed in hands-off, lower SLA infrastructure with fast delivery cycles on untrusted systems.  These are very hostile environments compared to “legacy” IT.

What makes Digital Rebar Provision Cloud Native?  A lot!

The following is a list of key attributes I consider essential for Cloud Native design.

Micro-services Enabled: The larger Digital Rebar project is a micro-services design.  Provision reflects a stand-alone bundling of two services: DHCP and Provision.  The new Provision service is designed to both stand alone (with embedded UX) and be part of a larger system.

Swagger RESTful API: We designed the APIs first based on years of experience.  We spent a lot of time making sure that the API conformed to spec and that includes maintaining the Swagger spec so integration is easy.

Remote CLI: We build and test our CLI extensively.  In fact, we expect that to be the primary user interface.

Security Designed In: We are serious about security even in challenging environments like PXE where options are limited by 20 year old protocols.  HTTPS is required and user or bearer token authentication is required.  That means that even API calls from machines can be secured.

12 Factor & API Config: There is no file configuration for Provision.  The system starts with command line flags or environment variables.  Deeper configuration is done via API/CLI.  That ensures that the system can be fully managed by remote and configured securely becausee credentials are required for configuration.

Fast Start / Golang:  Provision is a totally self-contained golang app including the UX.  Even so, it’s very small.  You can run it on a laptop from nothing in about 2 minutes including download.

CI/CD Coverage: We committed to deep test coverage for Provision and have consistently increased coverage with every commit.  It ensures quality and prevents regressions.

Documentation In-project Auto-generated: On-boarding is important since we’re talking about small, API-driven units.  A lot of Provisioning documentation is generated directly from the code into the actual project documentation.  Also, the written documentation is in Restructured Text in the project with good indexes and cross-references.  We regenerate the documentation with every commit.

We believe these development disciplines are essential regardless of the infrastructure.  That’s why we made sure the v3 Provision (and ultimately every component of Digital Rebar as we iterate to v3) was built to these standards.

What do you think?  Is this Cloud Native?  What did we miss?

APIs and Implementations collide at OpenStack Interop: The Oracle Zones vs VMs Debate

I strive to stay neutral as OpenStack DefCore co-chair; however, as someone asking for another Board term, it’s important to review my thinking so that you can make an informed voting decision.

DefCore, while always on the edge of controversy, recently became ground zero for the “what is OpenStack” debate [discussion write up]. My preferred small core “it’s an IaaS product” answer is only one side. Others favor “it’s an open cloud community” while another faction champions an “open cloud platform.” I’m struggling to find a way that it can be all together at the same time.

The TL;DR is that, today, OpenStack vendors are required to implement a system that can run Linux guests. This is an example of an implementation over API bias because there’s nothing in the API that drives that specific requirement.

From a pragmatic “get it done” perspective, OpenStack needs to remain implementation driven for now. That means that we care that “OpenStack” clouds run VMs.

While there are pragmatic reasons for this, I think that long term success will require OpenStack to become an API specification. So today’s “right answer” actually undermines the long term community value. This has been a long standing paradox in OpenStack.

Breaking the API to implementation link allows an ecosystem to grow with truly alternate implementations (not just plug-ins). This is a threat to the community “upstream first” mentality.  OpenStack needs to be confident enough in the quality and utility of the shared code base that it can allow competitive implementations. Open communities should not need walls to win but they do need clear API definition.

What is my posture for this specific issue?  It’s complicated.

First, I think that the user and ecosystem expectations are being largely ignored in these discussions. Many of the controversial items here are vendor initiatives, not user needs. Right now, I’ve heard clearly that those expectations are for OpenStack to be an IaaS the runs VMs. OpenStack really needs to focus on delivering a reliably operable VM based IaaS experience. Until that’s solid, the other efforts are vendor noise.

Second, I think that there are serious test gaps that jeopardize the standard. The fundamental premise of DefCore is that we can use the development tests for API and behavior validation. We chose this path instead of creating an independent test suite. We either need to address tests for interop within the current body of tests or discuss splitting the efforts. Both require more investment than we’ve been willing to make.

We have mechanisms in place to collects data from test results and expand the test base.  Instead of creating new rules or guidelines, I think we can work within the current framework.

The simple answer would be to block non-VM implementations; however, I trust that cloud consumers will make good decisions when given sufficient information.  I think we need to fix the tests and accept non-VM clouds if they pass the corrected tests.

For this and other reasons, I want OpenStack vendors to be specific about the configurations that they test and support. We took steps to address this in DefCore last year but pulled back from being specific about requirements.  In this particular case, I believe we should require the official OpenStack vendor to state clear details about their supported implementation.  Customers will continue vote with their wallet about which configuration details are important.

This is a complex issue and we need community input.  That means that we need to hear from you!  Here’s the TC Position and the DefCore Patch.

Understanding OpenStack Designated Code Sections – Three critical questions

A collaboration with Michael Still (TC Member from Rackspace) & Joshua McKenty and Cross posted by Rackspace.

After nearly a year of discussion, the OpenStack board launched the DefCore process with 10 principles that set us on path towards a validated interoperability standard.   We created the concept of “designated sections” to address concerns that using API tests to determine core would undermine commercial and community investment in a working, shared upstream implementation.

Designated SectionsDesignated sections provides the “you must include this” part of the core definition.  Having common code as part of core is a central part of how DefCore is driving OpenStack operability.

So, why do we need this?

From our very formation, OpenStack has valued implementation over specification; consequently, there is a fairly strong community bias to ensure contributions are upstreamed. This bias is codified into the very structure of the GNU General Public License (GPL) but intentionally missing in the Apache Public License (APL v2) that OpenStack follows.  The choice of Apache2 was important for OpenStack to attract commercial interests, who often consider GPL a “poison pill” because of the upstream requirements.

Nothing in the Apache license requires consumers of the code to share their changes; however, the OpenStack foundation does have control of how the OpenStack™ brand is used.   Thus it’s possible for someone to fork and reuse OpenStack code without permission, but they cannot called it “OpenStack” code.  This restriction only has strength if the OpenStack brand has value (protecting that value is the primary duty of the Foundation).

This intersection between License and Brand is the essence of why the Board has created the DefCore process.

Ok, how are we going to pick the designated code?

Figuring out which code should be designated is highly project specific and ultimately subjective; however, it’s also important to the community that we have a consistent and predictable strategy.  While the work falls to the project technical leads (with ratification by the Technical Committee), the DefCore and Technical committees worked together to define a set of principles to guide the selection.

This Technical Committee resolution formally approves the general selection principles for “designated sections” of code, as part of the DefCore effort.  We’ve taken the liberty to create a graphical representation (above) that visualizes this table using white for designated and black for non-designated sections.  We’ve also included the DefCore principle of having an official “reference implementation.”

Here is the text from the resolution presented as a table:

Should be DESIGNATED: Should NOT be DESIGNATED:
  • code provides the project external REST API, or
  • code is shared and provides common functionality for all options, or
  • code implements logic that is critical for cross-platform operation
  • code interfaces to vendor-specific functions, or
  • project design explicitly intended this section to be replaceable, or
  • code extends the project external REST API in a new or different way, or
  • code is being deprecated

The resolution includes the expectation that “code that is not clearly designated is assumed to be designated unless determined otherwise. The default assumption will be to consider code designated.”

This definition is a starting point.  Our next step is to apply these rules to projects and make sure that they provide meaningful results.

Wow, isn’t that a lot of code?

Not really.  Its important to remember that designated sections alone do not define core: the must-pass tests are also a critical component.   Consequently, designated code in projects that do not have must-pass tests is not actually required for OpenStack licensed implementation.

OpenStack’s next hurdle: Interoperability. Why should you care?

SXSW life size Newton's Cradle

SXSW life size Newton’s Cradle

The OpenStack Board spent several hours (yes, hours) discussing interoperability related topics at the last board meeting.  Fundamentally, the community benefits when uses can operate easily across multiple OpenStack deployments (their own and/or public clouds).

Cloud interoperability: the ability to transfer workloads between systems without changes to the deployment operations management infrastructure.

This is NOT hybrid (which I defined as a workload transparently operating in multiple systems); however it is a prereq to achieve scalable hybrid operation.

Interoperability matters because the OpenStack value proposition is all about creating a common platform.  IT World does a good job laying out the problem (note, I work for Dell).  To create sites that can interoperate, we have to some serious lifting:

At the OpenStack Summit, there are multiple chances to engage on this.   I’m moderating a panel about Interop and also sharing a session about the highly related topic of Reference Architectures with Monty Tayor.

The Interop Panel (topic description here) is Tuesday @ 5:20pm.  If you join, you’ll get to see me try to stump our awesome panelists

  • Jonathan LaCour, DreamHost
  • Troy Toman, Rackspace
  • Bernard Golden,  Enstratius
  • Monty Taylor, OpenStack Board (and HP)
  • Peter Pouliot, Microsoft

PS: Oh, and I’m also talking about DevOps Upgrades Patterns during the very first session (see a preview).

OpenStack: Five Challenges & Conference Observations

I was part of the Dell contingent at the OpenStack design conference earlier in the month.  The conference opened a new chapter for the project because the number of contributing companies reached critical mass.  That means that the core committers are no longer employed by just one or two entities; instead, there are more moneyed interests rubbing elbows and figuring out how to collaborate.

From my perspective (from interview with @Cote ), this changed the tone of the conference from more future looking to pragmatic.

That does not mean that everything is sunshine and rainbows for OpenStack clouds, there are real issues to be resolved.  IMHO, the top issues for OpenStack are:

  1. API implementation vs specification
  2. Building up coverage on continuous integration
  3. Ensuring that we can deploy consistently in multi-node systems
  4. Getting contributions from new members
  5. Figuring out which projects are core, satellite, missing or junk.  [xref 2014 DefCore]

Of these issues, I’ve been reconsidering my position favoring API via Implementation over specification (past position).  This has been a subject of debate on my team at Dell and I like Greg Althaus’ succinct articulation of the problem with implementation driven API: “it is not fair.”  This also ends up being a branding issues for OpenStack because governance needs to figure out which is a “real” OpenStack cloud deployment that can use the brand.  Does it have to be 100% of the source?  What about extensions?  What if it uses the API with an alternate implementation?

Of the other issues, most are related to maturity.  I think #2 needs pressure by and commitment from the larger players (Dell very much included).  Crowbar and the deployment blueprint is our answer #3.  Shouting the “don’t fork it up” chorus from the roof tops addresses contributions while #5 will require some strong governance and inevitably create some hurt feelings.

Addressing OpenStack API equality: rethinking API via implementation over specification

This post is an extension of my post about OpenStack’s top 5 challenges from my perspective working on Dell’s OpenStack team.  This issue has been the subject of much public debate on the OpenStack lists, at the conference and online (Wilamuna & Shuttleworth).  So far, I have held back from the community discussion because I think both sides are being well represented by active contributors to trunk.

I have been, and remain, convinced that a key to OpenStack success as a cloud API is having a proven scale implementation; however, we need to find a compromise between implementation and specification.

The community needs code that specifies an API and offers an implementation backing that API.  OpenStack has been building an API by implementing the backing functionality (instead of vice versa).  This ordering is important because the best APIs are based on doing real work not by trying to anticipate potential needs.  I’m also a fan of implementation driven API because it emerges quickly.  That does not mean it’s all wild west and cowboy coding.  OpenStack has the concept of API extensions; consequently, its possible to offer new services in a safe way allowing new features to literally surface overnight.

Unfortunately, API solely via implementation risks being fragile, unpredictable, and unfair. 

It is fragile because the scope is reflected by what the implementors choose to support.  They often either expose internals or restrict capabilities in ways that a leading with design specification would not.  This same challenge makes them less predictable because the API may change as the implementation progresses or become locked into exposing internals when downstream consumers depend on exposed functions.

I am ready to forgive fragile and unpredictable, but unfair may prove to be expensive for the OpenStack community.

Here’s the problem: there is more than on right way to solve a problem.  The benefit of an API is that it allows multiple implementations to emerge and prove their benefits.  We’ve seen many cases where, even with a weak API, a later implementation is the one that carries the standard (I’m thinking browsers here).  When your API is too bound to implementation it locks out alternatives that expand your market.  It makes it unfair to innovators who come to the party in the second wave.

I think that it is possible to find a COMPROMISE between API implementation and specification; however, it takes a lot of maturity to make that work.  First, it requires the implementors to slow down and write definitions outside of their implementation.  Second, it requires implementors to emotionally decouple from their code and accept other API implementations.  Of course, a dose of test driven design (TDD) and continuous integration (CI) discipline does not hurt either.

I’m interested in hearing more opinions about this… come to the 10/27 OpenStack meetup to discuss and/or please comment!