Infrastructure Masons is building a community around data center practice

IT is subject to seismic shifts right now. Here’s how we cope together.

For a long time, I’ve advocated for open operations (“OpenOps”) as a way to share best practices about running data centers. I’ve worked hard in OpenStack and, recently, Kubernetes communities to have operators collaborate around common architectures and automation tools. I believe the first step in these efforts starts with forming a community forum.

I’m very excited to have the RackN team and technology be part of the newly formed Infrastructure Masons effort because we are taking this exact community first approach.

infrastructure_masons

Here’s how Dean Nelson, IM organizer and head of Uber Compute, describes the initiative:

An Infrastructure Mason Partner is a professional who develop products, build or support infrastructure projects, or operate infrastructure on behalf of end users. Like their end users peers, they are dedicated to the advancement of the Industry, development of their fellow masons, and empowering business and personal use of the infrastructure to better the economy, the environment, and society.

We’re in the midst of tremendous movement in IT infrastructure.  The change to highly automated and scale-out design was enabled by cloud but is not cloud specific.  This requirement is reshaping how IT is practiced at the most fundamental levels.

We (IT Ops) are feeling amazing pressure on operations and operators to accelerate workflow processes and innovate around very complex challenges.

Open operations loses if we respond by creating thousands of isolated silos or moving everything to a vendor specific island like AWS.  The right answer is to fund ways to share practices and tooling that is tolerant of real operational complexity and the legitimate needs for heterogeneity.

Interested in more?  Get involved with the group!  I’ll be sharing more details here too.

 

Modularizing Crowbar via Barclamps – Dell prepares to open source our #OpenStack installer

My team at Dell is working diligently to release Crowbar (Apache 2) to the community.

  • We have ramped up our team size (Andi Abes was spotted recently posting on the Swift list).
  • We are collaborating with partners like Rackspace, Opscode and Citrix
  • We brought in UI expertise (Jon Roberts) to improve usability and polish.
  • We are making sure that the code is integrated with our Dell OpenStack Solution (DOSS).
  • We are lining up customers for real field trials.

The single most critical aspect of Crowbar involves a recent architectural change by Greg Althaus to make Crowbar much more modular.  He dubbed the modules “barclamps” because they are used to attach new capabilities into the system.  For example, we include barclamps for DNS, discovery, Nova, Swift, Nagios, Gangalia, and BIOS config.  Users select which combination to use based on their deployment objectives.

In the Crowbar architecture, nearly every capability of the system is expressed as a barclamp.  This means that the code base can be expanded and updated modularly.  We feel that this pattern is essential to community involvement.

For example, another hardware vendor can add a barclamp that does the BIOS configuration for their specific equipment (yes! that is our intent).  While many barclamps will be included with the open source release to install open source components, we anticipate that other barclamps will be only available with licensed products or in limited distribution.

A barclamp is like a cloud menu planner: it evaluates the whole environment and proposes configurations, roles, and recipes that fit your infrastructure.  If you like the menu, then it tells Chef to start cooking.

Barclamps complement the “PXE state machine” aspect of Crowbar by providing logic Crowbar evaluates as the servers reach deployment states.  These states are completely configurable via the provisioner barclamp; consequently, Crowbar users can choose to change order of operations.  They can also add barclamps and easily incorporate them into their workflow where needed.

Barclamps take the form of a Rails controller that inherits from the barclamp superclass.  The superclass provides the basic REST verbs that each barclamp must service while the child class implements the logic to create a “proposal” leveraging the wealth of information in Chef.  Proposals are JSON collections that include configuration data needed for the deployment recipes and a mapping of nodes into roles.

Users are able to review and edit proposals (which are versioned) before asking Crowbar to implement the proposal in Chef.  The proposal is implemented by assigning the nodes into the proposed roles and allowing Chef to work it’s magic.

Users can operate barcamps in parallel.  In fact, most of our barclamps are designed to operate in conjunction.

Reminder: It is vital to understand that Crowbar is not a stand-alone utility.  It is coupled to Chef Server for deployment and data storage.  Our objective was to leverage the outstanding capabilities and community support for Chef as much as possible.

We’re excited about this architecture addition to Crowbar and encourage you to think about barclamps that would be helpful to your cloud deployment.

OpenStack “Operating Hyperscale Clouds” preso given at Design Summit on 4/27.

Today, I gave the a presentation that is the sequel to the “bootstrapping the hyperscale clouds.”  The topic addresses concerns raised by the BlackOps concept but brings up history and perspectives that drive Dell’s Solution focus for OpenStack.

You can view in slide share or here’s a PDF of the preso: Operating the Hyperscale Cloud 4-27

I tend to favor pictures over text, so you may get more when OpenStack posts the video [link pending].

The talk is about our journey to build an OpenStack solution starting from our drive to build a cloud installer (Crowbar).  The critical learning is that success for clouds is the adoption of an “operational model” (aka Cloud Ops) that creates cloud agility.

Ultimately, the presentation is the basis for another white paper.

Appending:

BlackOps: 7 tenants for infrastructure & operations in hyperscale clouds. #CloudOps #Hyperscale

Traditional IT Ops

In my work queue at Dell, the request for a “cloud taxonomy” keeps turning up on my priority list just behind world dominance peace.  Basically, a cloud taxonomy is layer cake picture that shows all the possible cloud components stacked together like gears in an antique Swiss watch.  Unfortunately, our clock-like layer cake has evolved to into a collaboration between the Swedish Chef and Rube Goldberg as we try to accommodate more and more technologies into the mix.

The key to de-spaghettifying our cloud taxomony was to realize that clouds have two distinct sides: an external well-known API and internal “black box” operations.  Each side has different objectives that together create an elastic, scalable cloud.

The objective of the API side is to provide the smallest usable “surface area” for cloud consumers.  Surface area describes the scope of the interface that is published to the users.  The smaller the area, the easier it is for users to comprehend and harder it is for them to break.  Amazon’s EC2 & S3 APIs set the standards for small surface area design and spawned a huge cloud ecosystem.

Hyperscale Cloud (APIs!)

To understand the cloud taxonomy, it is essential to digest the impact of the cloud ecosystem.  The cloud ecosystem exists primarily beyond the API of the cloud.  It provides users with flexible options that address their specific use cases.  Since the ecosystem provides the user experience on top of the APIs (e.g.: RightScale), it frees the cloud provider to focus on services and economies of scale inside the black box.

The objective of the internal side of clouds is to create a prefect black box to give API users the illusion of a perfectly performing, strictly partitioned and totally elastic resource pool.  To the consumer, it does should not matter how ugly, inefficient, or inelegant the cloud operations really are; except, of course, that it does matter a great deal to the cloud operator. 

Cloud operation cannot succeed at scale without mastering the discipline of operating the black box cloud (BlackOps). 

Cloud APIs spawn Ecosystems

The BlackOps challenge is that clouds cannot wait until all of the answers are known because issues (or solutions) to scale architecture are difficult to predict in advance.  Even worse, taking the time to solve them in advance likely means that you will miss the market.

Since new technologies and approaches are constantly emerging, there is no “design pattern” for hyperscale.  To cope with constant changes, BlackOps live by seven tenants that help manage their infrastructure efficiently in a dynamic environment.

  1. Operational ownership – don’t wait for all the king’s horses and consultants to put your back together again (but asking for help is OK).
  2. Simple APIs – reduce the ways that consumers can stress the system making the scale challenges more predictable.
  3. Efficiency based financial incentives – customers will dramatically modify their consumption if you offer rewards that better match your black box’s capabilities.
  4. Automated processes & verification – ensures that changes and fixes can propagate at scale while errors are self-correcting.
  5. Frequent incremental rolling adjustments – prevents the great from being the enemy of the good so that systems are constantly improving (learn more about “split testing”)
  6. Passion for operational simplicity – at hyperscale, technical debt compounds very quickly.  Debt translates into increased risk and reduced agility and can infect hardware, software, and process aspects of operations.
  7. Hunger for feedback & root-cause knowledge – if you’re building the airplane in flight, it’s worth taking extra time to check your work.  You must catch problems early before they infect your scale infrastructure.  The only thing more frustrating than fixing a problem at scale, if fixing the same problem multiple times.

It’s no surprise that these are exactly the Agile & Lean principles.  The pace of change of cloud is so fast and fluid, that BlackOps must use an operational model that embraces iterative and rolling deployment.

Compared to highly orchestrated traditional IT operations, this approach seems like sending a team of ninjas to battle on quicksand with objectives delivered in a fortune cookie.

I am not advocating fuzzy mysticism or by-the-seat-of-your-pants do-or-die strategies.  BlackOps is a highly disciplined process based on well understood principles from just-in-time (JIT) and lean manufacturing.  Best of all, they are fast to market, able to deliver high quality and capable of responding to change.

Post Script / Plug: My understanding of BlackOps is based on the operational model that Dell has introduced around our OpenStack Crowbar project.  I’m going to be presenting more about this specific topic at the OpenStack Design Conference next week.

Dell to spin bare iron into OpenStack gold

I’m at the CloudConnect conference today supporting my team’s initial OpenStack foray.   Our announcement part of the Rackspace Cloud Builders announcement.

Tonight (3/8), we’re at the Rackspace Launch with a pony rack of servers (6 nodes) where we will run a LIVE DEMO of our cloud installer (codename “Crowbar”).  The initial offer includes my hyperscale white paper and our cloud foundation kit.

Interested in the details?  Here are background posts that talk about the Lean/Agile process we use, what is Crowbar, and my write up about hyperscale (“flat edge”) data centers.

Added 3/9: Links to articles about the release:

Here’s what Dell is saying about OpenStack on Dell.com/openstack:

Dell is one of the original partners in the OpenStack community, which has now grown to more than 50 companies and participants. To accelerate adoption of this powerful platform, Dell has worked to develop an effortless out-of-box OpenStack experience with:
  • Optimized PowerEdge™ C-based hardware configurations
  • A technical whitepaper that details the design of an OpenStack hyperscale cloud on PowerEdge C server technology
  • An OpenStack installer that allows bare metal deployment of OpenStack clouds in a few hours (vs. a manual installation period of several days)

Read more about the steps to design an OpenStack hyperscale cloud in a Dell technical whitepaper entitled “Bootstrapping OpenStack Clouds.”

Interested?  Contact OpenStack@Dell.com.

32nd rule to measure complexity + 6 hyperscale network design rules

If you’ve studied computer science then you know there are algorithms that calculate “complexity.” Unfortunately, these have little practical use for data center operators.  My complexity rule does not require a PhD:

The 32nd rule: If it takes more than 30 seconds to pick out what would be impacted by a device failure then your design is too complex.

6 Hyperscale Network Design Rules

  1. Cost Matters
  2. Keep Networks Flat
  3. Filter at the Edge
  4. Design Fault Zones
  5. Plan for Local Traffic
  6. Offer load balancers (to your users)

Sorry for the teaser… I’ll be able to release more substance behind this list soon.   Until then comments are (as always) welcome!

 

 

 

Bootstrapping Hyperscale OpenStack Clouds – slides from 2/3 OpenStack SJC Meetup

The OpenStack meeting lightening talk is only 5 minutes, so the deck is mostly pictures that support points around a more detailed followup.

Here’s the deck: bootstrapping clouds preso

 and my Hyperscale white paper (links through Dell.com)

The theme of the talk is that hyperscale systems requires a fundamentally different management paradigm because at hyperscale

hardware faults are common,manual steps are impractical and small costs add up quickly.

Included in the preso are concepts I introduced at Flatness at the Edge.

2/10 Update: Now you can watch it Thanks to “@opnstk_com_mgr Stephen Spector lighting talks video of Rob Hirschfeld, Dell at Santa Clara, CA Meetup Feb 3, 2011 http://ow.ly/3U8OA