What does it take to Operate Open Platforms? Answers in Datanauts 72

Did I just let OpenStack ops off the hook….?  Kubernetes production challenges…?  

ix34grhy_400x400I had a lot of fun in this Datanauts wide ranging discussion with unicorn herders Chris Wahl and Ethan Banks.  I like the three section format because it gives us a chance to deep dive into distinct topics and includes some out-of-band analysis by the hosts; however, that means you need to keep listening through the commercial breaks to hear the full podcast.

Three parts?  Yes, Chris and Ethan like to save the best questions for last.

In Part 1, we went deep into the industry operational and business challenges uncovered by the OpenStack project. Particularly, Chris and I go into “platform underlay” issues which I laid out in my “please stop the turtles” post. This was part of the build-up to my SRE series.

In Part 2, we explore my operations-focused view of the latest developments in container schedulers with a focus on Kubernetes. Part of the operational discussion goes into architecture “conceits” (or compromises) that allow developers to get the most from cloud native design patterns. I also make a pitch for using proven tools to run the underlay.

In Part 3, we go deep into DevOps automation topics of configuration and orchestration. We talk about the design principles that help drive “day 2” automation and why getting in-place upgrades should be an industry priority.  Of course, we do cover some Digital Rebar design too.

Take a listen and let me know what you think!

On Twitter, we’ve already started a discussion about how much developers should care about infrastructure. My opinion (posted here) is that one DevOps idea where developers “own” infrastructure caused a partial rebellion towards containers.

Infrastructure Masons is building a community around data center practice

IT is subject to seismic shifts right now. Here’s how we cope together.

For a long time, I’ve advocated for open operations (“OpenOps”) as a way to share best practices about running data centers. I’ve worked hard in OpenStack and, recently, Kubernetes communities to have operators collaborate around common architectures and automation tools. I believe the first step in these efforts starts with forming a community forum.

I’m very excited to have the RackN team and technology be part of the newly formed Infrastructure Masons effort because we are taking this exact community first approach.

infrastructure_masons

Here’s how Dean Nelson, IM organizer and head of Uber Compute, describes the initiative:

An Infrastructure Mason Partner is a professional who develop products, build or support infrastructure projects, or operate infrastructure on behalf of end users. Like their end users peers, they are dedicated to the advancement of the Industry, development of their fellow masons, and empowering business and personal use of the infrastructure to better the economy, the environment, and society.

We’re in the midst of tremendous movement in IT infrastructure.  The change to highly automated and scale-out design was enabled by cloud but is not cloud specific.  This requirement is reshaping how IT is practiced at the most fundamental levels.

We (IT Ops) are feeling amazing pressure on operations and operators to accelerate workflow processes and innovate around very complex challenges.

Open operations loses if we respond by creating thousands of isolated silos or moving everything to a vendor specific island like AWS.  The right answer is to fund ways to share practices and tooling that is tolerant of real operational complexity and the legitimate needs for heterogeneity.

Interested in more?  Get involved with the group!  I’ll be sharing more details here too.

 

Operators, they don’t want to swim Upstream

Operators Dinner 11/10

Nov 10, Palo Alto Operators Dinner

Last Tuesday, I had the honor of joining an OpenStack scale operators dinner. Foundation executives, Jonathan Bryce and Lauren Sell, were also on the guest list so talk naturally turned to “how can OpenStack better support operators.” Notably, the session was distinctly not OpenStack bashing.

The conversation was positive, enthusiastic and productive, but one thing was clear: the OpenStack default “we’ll fix it in the upstream” answer does not work for this group of operators.

What is upstreaming?  A sans nuance answer is that OpenStack drives fixes and changes in the next community release (longer description).  The project and community have a tremendous upstream imperative that pervades the culture so deeply that we take it for granted.  Have an issue with OpenStack?  Submit a patch!  Is there any other alternative?

Upstreaming [to trunk] makes perfect sense considering the project vendor structure and governance; however, it is a very frustrating experience for operators.   OpenStack does have robust processes to backport fixes and sustain past releases and documentation; yet, the feeling at the table was that they are not sufficiently operator focused.

Operators want fast, incremental and pragmatic corrections to the code and docs they are deploying (which is often two releases back).  They want it within the community, not from individual vendors.

There are great reasons for focusing on upstream trunk.  It encourages vendors to collaborate and makes it much easier to add and expand the capabilities of the project.  Allowing independent activity on past releases creates a forward integration mess and could make upgrades even harder.  It will create divergence on APIs and implementation choices.

The risk of having a stable, independently sustained release is that operators have less reason to adopt the latest shiny release.  And that is EXACTLY what they are asking for.

Upstreaming is a core value to OpenStack and essential to our collaborative success; however, we need to consider that it is not the right answer to all questions.  Discussions at that dinner reinforced that pushing everything to latest trunk creates a significant barrier for OpenStack operators and users.

What are your experiences?  Is there a way to balance upstreaming with forking?  How can we better serve operators?

A year of RackN – 9 lessons from the front lines of evangalizing open physical ops

Let’s avoid this > “We’re heading right at the ground, sir!  Excellent, all engines full power!

another scale? oars & motors. WWF managing small scale fisheries

RackN is refining our from “start to scale” message and it’s also our 1 year anniversary so it’s natural time for reflection. While it’s been a year since our founders made RackN a full time obsession, the team has been working together for over 5 years now with the same vision: improve scale datacenter operations.

As a backdrop, IT-Ops is under tremendous pressure to increase agility and reduce spending.  Even worse, there’s a building pipeline of container driven change that we are still learning how to operate.

Over the year, we learned that:

  1. no one has time to improve ops
  2. everyone thinks their uniqueness is unique
  3. most sites have much more in common than is different
  4. the differences between sites are small
  5. small differences really do break automation
  6. once it breaks, it’s much harder to fix
  7. everyone plans to simplify once they stop changing everything
  8. the pace of change is accelerating
  9. apply, rinse, repeat with lesson #1

Where does that leave us besides stressed out?  Ops is not keeping up.  The solution is not to going faster: we have to improve first and then accelerate.

What makes general purpose datacenter automation so difficult?  The obvious answer, variation, does not sufficiently explain the problem. What we have been learning is that the real challenge is ordering of interdependencies.  This is especially true on physical systems where you have to really grok* networking.

The problem would be smaller if we were trying to build something for a bespoke site; however, I see ops snowflaking as one of the most significant barriers for new technologies. At RackN, we are determined to make physical ops repeatable and portable across sites.

What does that heterogeneous-first automation look like? First, we’ve learned that to adapt to customer datacenters. That means using the DNS, DHCP and other services that you already have in place. And dealing with heterogeneous hardware types and a mix of devops tools. It also means coping with arbitrary layer 2 and layer 3 networking topologies.

This was hard and tested both our patience and architecture pattern. It would be much easier to enforce a strict hardware guideline, but we knew that was not practical at scale. Instead, we “declared defeat” about forcing uniformity and built software that accepts variation.

So what did we do with a year?  We had to spend a lot of time listening and learning what “real operations” need.   Then we had to create software that accommodated variation without breaking downstream automation.  Now we’ve made it small enough to run on a desktop or cloud for sandboxing and a new learning cycle begins.

We’d love to have you try it out: rebar.digital.

* Grok is the correct work here.  Thinking that you “understand networking” is often more dangerous when it comes to automation.

Curious about SDN & OpenStack? We discuss at Open Networking Summit Panel (next Thursday)

Next Thursday (6/18), I’m on a panel at the SJC Open Networking Summit with John Zannos (Canonical), Mark Carroll (HP), Mark McClain (VMware).  Our topic is software defined networking (SDN) and OpenStack which could go anywhere in discussion.
OpenStack is clearly driving a lot of open innovation around SDN (and NFV).
I have no idea of what other’s want to bring in, but I was so excited about the questions that I suggested that I thought to just post them with my answers here as a teaser.

1) Does OpenStack require an SDN to be successful?

Historically, no.  There were two networking modes.  In the future, expect that some level of SDN will be required via the Neutron part of the project.

More broadly, SDN appears to be a critical component to broader OpenStack success.  Getting it right creates a lock-in for OpenStack.

2) If you have an SDN for OpenStack, does it need to integrate with your whole datacenter or can it be an island around OpenStack?

On the surface, you can create an Island and get away with it.  More broadly, I think that SDN is most interesting if it provides network isolation throughout your data center or your hosting provider’s data center.  You may not run everything on top of OpenStack but you will be connecting everything together with networking.

SDN has the potential to be the common glue.

3) Of the SDN approaches, which ones seem to be working?  Why?

Overall, the overlay networking approaches seem to be leading.  Anything that requires central control and administration will have to demonstrate it can scale.  Anything that actually requires re-configuring the underlay networking quickly is also going to have to make a lot of progress.

Networking is already distributed.  Anything that breaks that design pattern has an uphill battle.

4) Are SDN and NFV co-dependent?  Are they driving each other?

Yes.  The idea of spreading networking functions throughout your data center to manage east-west or individual tenant requirements (my definition of NFV) requires a way to have isolated traffic (one of the uses for SDN).

5) Is SDN relevant outside of OpenStack?  If so, in what?

Yes.  SDN on containers will become increasingly important.  Also, SDN termination to multi-user systems (like a big database) also make sense.

6) IPv6?  A threat or assistance to SDN?

IPv6 is coming, really.  I think that IPv6 has isolation and encryption capabilities that compete with SDN as an overlay.  Widespread IPv6 adoption could make SDN less relevant.  It also does a better job for multi-cloud networking since it’s neutral and you don’t have to worry about which SDN tech your host is using.

Ready State Foundation for OpenStack now includes Ceph Storage

For the Paris summit, the OpenCrowbar team delivered a PackStack demo that leveraged Crowbar’s ability to create a OpenStack ready state environment.  For the Vancouver summit, we did something even bigger: we updated the OpenCrowbar Ceph workload.

Cp_1600_1200_DB2A1582-873B-413B-8F3C-103377203FDC.jpegeph is the leading open source block storage back-end for OpenStack; however, it’s tricky to install and few vendors invest the effort to hardware optimize their configuration.  Like any foundation layer, configuration or performance errors in the storage layer will impact the entire system.  Further, the Ceph infrastructure needs to be built before OpenStack is installed.

OpenCrowbar was designed to deploy platforms like Ceph.  It has detailed knowledge of the physical infrastructure and sufficient orchestration to synchronize Ceph Mon cluster bring-up.

We are only at the start of the Ceph install journey.  Today, you can use the open source components to bring up a Ceph cluster in a reliable way that works across hardware vendors.  Much remains to optimize and tune this configuration to take advantage of SSDs, non-Centos environments and more.

We’d love to work with you to tune and extend this workload!  Please join us in the OpenCrowbar community.

Delicious 7 Layer DIP (DevOps Infrastructure Provisioning) model with graphic!

Applying architecture and computer science principles to infrastructure automation helps us build better controls.  In this post, we create an OSI-like model that helps decompose the ops environment.

The RackN team discussions about “what is Ready State” have led to some interesting realizations about physical ops.  One of the most critical has been splitting the operational configuration (DNS, NTP, SSH Keys, Monitoring, Security, etc) from the application configuration.

Interactions between these layers is much more dynamic than developers and operators expect.  

In cloud deployments, you can use ask for the virtual infrastructure to be configured in advance via the IaaS and/or golden base images.  In hardware, the environment build up needs to be more incremental because that variations in physical infrastructure and operations have to be accommodated.

Greg Althaus, Crowbar co-founder, and I put together this 7 layer model (it started as 3 and grew) because we needed to be more specific in discussion about provisioning and upgrade activity.  The system view helps explain how layer 5 and 6 operate at the system layer.

7 Layer DIP

The Seven Layers of our DIP:

  1. shared infrastructure – the base layer is about the interconnects between the nodes.  In this model, we care about the specific linkage to the node: VLAN tags on the switch port, which switch is connected, which PDU ID controls turns it on.
  2. firmware and management – nodes have substantial driver (RAID/BIOS/IPMI) software below the operating system that must be configured correctly.   In some cases, these configurations have external interfaces (BMC) that require out-of-band access while others can only be configured in pre-install environments (I call that side-band).
  3. operating system – while the operating system is critical, operators are striving to keep this layer as thin to avoid overhead.  Even so, there are critical security, networking and device mapping functions that must be configured.  Critical local resource management items like mapping media or building network teams and bridges are level 2 functions.
  4. operations clients – this layer connects the node to the logical data center infrastructure is basic ways like time synch (NTP) and name resolution (DNS).  It’s also where more sophisticated operators configure things like distributed cache, centralized logging and system health monitoring.  CMDB agents like Chef, Puppet or Saltstack are installed at the “top” of this layer to complete ready state.
  5. applications – once all the baseline is setup, this is the unique workload.  It can range from platforms for other applications (like OpenStack or Kubernetes) or the software itself like Ceph, Hadoop or anything.
  6. operations management – the external system references for layer 3 must be factored into the operations model because they often require synchronized configuration.  For example, registering a server name and IP addresses in a DNS, updating an inventory database or adding it’s thresholds to a monitoring infrastructure.  For scale and security, it is critical to keep the node configuration (layer 3) constantly synchronized with the central management systems.
  7. cluster coordination – no application stands alone; consequently, actions from layer 4 nodes must be coordinated with other nodes.  This ranges from database registration and load balancing to complex upgrades with live data migration. Working in layer 4 without layer 6 coordination creates unmanageable infrastructure.

This seven layer operations model helps us discuss which actions are required when provisioning a scale infrastructure.  In my experience, many developers want to work exclusively in layer 4 and overlook the need to have a consistent and managed infrastructure in all the other layers.  We enable this thinking in cloud and platform as a service (PaaS) and that helps improve developer productivity.

We cannot overlook the other layers in physical ops; however, working to ready state helps us create more cloud-like boundaries.  Those boundaries are a natural segue my upcoming post about functional operations (older efforts here).

why is hardware hard? Ready State Physical Ops Meetup on Tuesday 12/2 9am PT

meh.  Compared to cloud, Ops on physical infrastructure sinks.physical outlet

Unfortunately, the cloud and scale platforms need to run someone so someone’s got to deal with it.  In fact, we’ve got to deal with crates of cranky servers and flocks of finicky platforms.  It’s enough to keep a good operator down.

If that’s you, or someone you care about, join us for a physical ops support group meetup on Tuesday 9am PT (11 central).

There is a light at the end of the tunnel!  We can make it repeatable to provision OpenStack, Hadoop and other platforms.

As a community, we’re steadily bringing best practices and proven automation from cloud ops down into the physical space.   On the OpenCrowbar project, we’re accelerating this effort using the ready state concept as a hand off point for “physical-cloud equivalency” and exploring the concept of “functional operations” to make DevOps scripts more portable.

Join me and Rafael Knuth for a discussion about how operators can work together to break the cycle.

OpenCrowbar: ready to fly as OpenOps neutral platform – Dell stepping back

greg and rob

Two of Crowbar Founders: me with Greg Althaus [taken Jan 2013]

With the Anvil release in the bag, Dell announced on the community list yesterday that it has stopped active contribution on the Crowbar project.  This effectively relaunches Crowbar as a truly vendor-neutral physical infrastructure provisioning tool.

While I cannot speak for my employer, Dell, about Crowbar; I continue serve in my role as a founder of the Crowbar Project.  I agree with Eric S Raymond that founders of open source projects have a responsibility to sustain their community and ensure its longevity.

In the open DevOps bare metal provisioning market, there is nothing that matches the capabilities developed in either Crowbar v1 or OpenCrowbar.  The operations model and system focused approach is truly differentiated because no other open framework has been able to integrate networking, orchestration, discovery, provisioning and configuration management like Crowbar.

It is time for the community to take Crowbar beyond the leadership of a single hardware vendor, OS vendor, workload or CMDB tool.  OpenCrowbar offers operations freedom and flexibility to build upon an abstracted physical infrastructure (what I’ve called “ready state“).

We have the opportunity to make open operations a reality together.

As a Crowbar founder and its acting community leader, you are welcome to contact me directly or through the crowbar list about how to get engaged in the Crowbar community or help get connected to like-minded Crowbar resources.

Open Operations [4/4 series on Operating Open Source Infrastructure]

This post is the final in a 4 part series about Success factors for Operating Open Source Infrastructure.

tl;dr Note: This is really TWO tightly related posts: 
  part 1 is OpenOps background. 
  part 2 is about OpenStack, Tempest and DefCore.

2012-01-11_17-42-11_374One of the substantial challenges of large-scale deployments of open source software is that it is very difficult to come up with a best practice, or a reference implementation that can be widely explained or described by the community.

Having a best practice deployment is essential for the growth of the community because it enables multiple people to deploy the software in a repeatable, stable way. This, in turn, fosters community growth so that more people can adopt software in a consistent way. It does little good if operators have no consistent pattern for deployment, because that undermines the developers’ abilities to extend, the testers’ abilities to ensure quality, and users’ ability to repeat the success of others.

Fundamentally, the goal of an open source project, from a user’s perspective, is that they can quickly achieve and repeat the success of other people in the community.

When we look at these large-scale projects we really try to create a pattern of success that can be repeated over and over again. This ensures growth of the user base, and it also helps the developer reduce time spent troubleshooting problems.

That does not mean that every single deployment should be identical, but there is substantial value in having a limited number of success patterns. Customers can then be assured not only of quick time to value with these projects., They can also get help without having everybody else in the community attempt to untangle how one person created a site-specific. This is especially problematic if someone created an unnecessarily unique scenario. That simply creates noise and confusion in the environment., Noise is a huge cost for the community, and needs to be eliminated nor an open source project to flourish.

This isn’t any different from in proprietary software but most of these activities are hidden. A proprietary project vendor can make much stronger recommendations and install guidance because they are the only source of truth in that project. In an open source project, there are multiple sources of truth, and there are very few people who are willing to publish their exact reference implementation or test patterns. Consequently, my team has taken a strong position on creating a repeatable reference implementation for Openstack deployments, based on extensive testing. We have found that our test patterns and practices are grounded in successful customer deployments and actual, physical infrastructure deployments. So, they are very pragmatic, repeatable, and sustained.

We found that this type of testing, while expensive, is also a significant value to our customers, and something that they appreciate and have been willing to pay for.

OpenStack as an Example: Tempest for Reference Validation

The Crowbar project incorporated OpenStack Tempest project as an essential part of every OpenStack deployment. From the earliest introduction of the Tempest suite, we have understood the value of a baselining test suite for OpenStack. We believe that using the same tests the developers use for a single node test is a gate for code acceptance against a multi-node deployment, and creates significant value both for our customers and the OpenStack project as a whole.  This was part of my why I embraced the suggestion of basing DefCore on tests.

While it is important to have developer tests that gate code check-ins, the ultimate goal for OpenStack is to create scale-out multi-node deployments. This is a fundamental design objective for OpenStack.

With developers and operators using the same test suite, we are able to proactively measure the success of the code in the scale deployments in a way that provides quick feedback for the developers. If Tempest tests do not pass a multi-node environment, they are not providing significant value for developers to ensure that their code is operating against best practice scenarios. Our objective is to continue to extend the Tempest suite of tests so that they are an accurate reflection of the use cases that are encountered in a best practice, referenced deployment.

Along these lines, we expect that the community will continue to expand the Tempest test suite to match actual deployment scenarios reflected in scale and multi-node configurations. Having developers be responsible for passing these tests as part of their day-to-day activities ensures that development activities do not disrupt scale operations. Ultimately, making proactive gating tests ensures that we are creating scenarios in which code quality is continually increasing, as is our ability to respond and deploy the OpenStack infrastructure.

I am very excited and optimistic that the expanding the Tempest suite holds the key to making OpenStack the most stable, reliable, performance cloud implementation available in the market. The fact that this test suite can be extended in the community, and contributed to by a broad range of implementations, only makes that test suite more valuable and more likely to fully encompass all use cases necessary for reference implementations.