Open Operations [4/4 series on Operating Open Source Infrastructure]

This post is the final in a 4 part series about Success factors for Operating Open Source Infrastructure.

tl;dr Note: This is really TWO tightly related posts: 
  part 1 is OpenOps background. 
  part 2 is about OpenStack, Tempest and DefCore.

2012-01-11_17-42-11_374One of the substantial challenges of large-scale deployments of open source software is that it is very difficult to come up with a best practice, or a reference implementation that can be widely explained or described by the community.

Having a best practice deployment is essential for the growth of the community because it enables multiple people to deploy the software in a repeatable, stable way. This, in turn, fosters community growth so that more people can adopt software in a consistent way. It does little good if operators have no consistent pattern for deployment, because that undermines the developers’ abilities to extend, the testers’ abilities to ensure quality, and users’ ability to repeat the success of others.

Fundamentally, the goal of an open source project, from a user’s perspective, is that they can quickly achieve and repeat the success of other people in the community.

When we look at these large-scale projects we really try to create a pattern of success that can be repeated over and over again. This ensures growth of the user base, and it also helps the developer reduce time spent troubleshooting problems.

That does not mean that every single deployment should be identical, but there is substantial value in having a limited number of success patterns. Customers can then be assured not only of quick time to value with these projects., They can also get help without having everybody else in the community attempt to untangle how one person created a site-specific. This is especially problematic if someone created an unnecessarily unique scenario. That simply creates noise and confusion in the environment., Noise is a huge cost for the community, and needs to be eliminated nor an open source project to flourish.

This isn’t any different from in proprietary software but most of these activities are hidden. A proprietary project vendor can make much stronger recommendations and install guidance because they are the only source of truth in that project. In an open source project, there are multiple sources of truth, and there are very few people who are willing to publish their exact reference implementation or test patterns. Consequently, my team has taken a strong position on creating a repeatable reference implementation for Openstack deployments, based on extensive testing. We have found that our test patterns and practices are grounded in successful customer deployments and actual, physical infrastructure deployments. So, they are very pragmatic, repeatable, and sustained.

We found that this type of testing, while expensive, is also a significant value to our customers, and something that they appreciate and have been willing to pay for.

OpenStack as an Example: Tempest for Reference Validation

The Crowbar project incorporated OpenStack Tempest project as an essential part of every OpenStack deployment. From the earliest introduction of the Tempest suite, we have understood the value of a baselining test suite for OpenStack. We believe that using the same tests the developers use for a single node test is a gate for code acceptance against a multi-node deployment, and creates significant value both for our customers and the OpenStack project as a whole.  This was part of my why I embraced the suggestion of basing DefCore on tests.

While it is important to have developer tests that gate code check-ins, the ultimate goal for OpenStack is to create scale-out multi-node deployments. This is a fundamental design objective for OpenStack.

With developers and operators using the same test suite, we are able to proactively measure the success of the code in the scale deployments in a way that provides quick feedback for the developers. If Tempest tests do not pass a multi-node environment, they are not providing significant value for developers to ensure that their code is operating against best practice scenarios. Our objective is to continue to extend the Tempest suite of tests so that they are an accurate reflection of the use cases that are encountered in a best practice, referenced deployment.

Along these lines, we expect that the community will continue to expand the Tempest test suite to match actual deployment scenarios reflected in scale and multi-node configurations. Having developers be responsible for passing these tests as part of their day-to-day activities ensures that development activities do not disrupt scale operations. Ultimately, making proactive gating tests ensures that we are creating scenarios in which code quality is continually increasing, as is our ability to respond and deploy the OpenStack infrastructure.

I am very excited and optimistic that the expanding the Tempest suite holds the key to making OpenStack the most stable, reliable, performance cloud implementation available in the market. The fact that this test suite can be extended in the community, and contributed to by a broad range of implementations, only makes that test suite more valuable and more likely to fully encompass all use cases necessary for reference implementations.

Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure]

This post is the third in a 4 part series about Success factors for Operating Open Source Infrastructure.

turning upIn an automated configuration deployment scenario, problems surface very quickly. They prevent deployment and force resolution before progress can be made. Unfortunately, many times this appears to be a failure within the deployment automation. My personal experience has been exactly the opposite: automation creates a “fail fast” environment in which critical issues are discovered and resolved during provisioning instead of sleeping until later.

Our ability to detect and stop until these issues are resolved creates exactly the type of repeatable, successful deployment that is essential to long-term success. When we look at these deployments, the most important success factors are that the deployment is consistent, known and predictable. Our ability to quickly identify and resolve issues that do not match those patterns dramatically improves the long-term stability of the system by creating an environment that has been benchmarked against a known reference.

Benchmarking against a known reference is ultimately the most significant value that we can provide in helping customers bring up complex solutions such as Openstack and Hadoop. Being successful with these deployments over the long term means that you have established a known configuration, and that you have maintained it in a way that is explainable and reference-able to other places.

Reference Implementation

The concept of a reference implementation provides tremendous value in deployment. Following a pattern that is a reference implementation enables you to compare notes, get help and ultimately upgrade and change deployment in known, predictable ways. Customers who can follow and implement a vendors’ reference, or the community’s reference implementation, are able to ask for help on the mailing lists, call in for help and work with the community in ways that are consistent and predictable.

Let’s explore what a reference implementation looks like.

In a reference implementation you have a consistent, known state of your physical infrastructure that has been implemented based upon a RA. That implementation follows a known best practice using standard gear in a consistent, known configuration. You can therefore explain your configuration to a community of other developers, or other people who have similar configuration, and can validate that your problem is not the physical configuration. Fundamentally, everything in a reference implementation is driving towards the elimination of possible failure cause. In this case, we are making sure that the physical infrastructure is not causing problems (getting to a ready state), because other people are using a similar (or identical) physical infrastructure configuration.

The next components of a reference implementation are the underlying software configurations for operating system management monitoring network configuration, IP networking stacks. Pretty much the entire component of the application is riding on. There are a lot of moving parts and complexity in this scenario, witha high likelihood of causing failures. Implementing and deploying the software stacks in an automated way, has enabled us to dramatically reduce the potential for problems caused by misconfiguration. Because the number of permutations of software in the reference stack is so high, it is essential that successful deployment tightly manages what exactly is deployed, in such a way that they can identify, name, and compare notes with other deployments.

Achieving Repeatable Deployments

In this case, our referenced deployment consists of the exact composition of the operating system, infrastructure tooling, and capabilities for the deployment. By having a reference capability, we can ensure that we have the same:

  • Operating system
  • Monitoring
  • Configuration stacks
  • Security tooling
  • Patches
  • Network stack (including bridges and VLAN, IP table configurations)

Each one of these components is a potential failure point in a deployment. By being able to configure and maintain that configuration automatically, we dramatically increase the opportunities for success by enabling customers to have a consistent configuration between sites.

Repeatable reference deployments enable customers to compare notes with Dell and with others in the community. It enables us to take and apply what we have learned from one site to another. For example, if a new patch breaks functionality, then we can quickly determine how that was caused. We can then fix the solution, add in the complimentary fix, and deploy it at that one site. If we are aware that 90% of our other sites have exactly the same configuration, it enables those other sites to avoid a similar problem. In this way, having both a pattern and practice referenced deployment enables the community to absorb or respond much more quickly, and be successful with a changing code base. We found that it is impractical to expect things not to change.

The only thing that we can do is build resiliency for change into these deployments. Creating an automated and tested referenceable deployment is the best way to cope with change.

 

 

 

Success Factors of Operating Open Source Infrastructure [Series Intro]

2012-10-28_14-13-24_502Building a best practices platform is essential to helping companies share operations knowledge.   In the fast-moving world of open source software, sharing documentation about what to do is not sufficient.  We must share the how to do it also because the operations process is tightly coupled to achieving ongoing success.

Further, since change is constant, we need to change our definition of “stability” to reflect a much more iterative and fluid environment.

Baseline testing is an essential part of this platform. It enables customers to ensure not only fast time to value, but the tests are consistently conforming with industry best practices, even as the system is upgraded and migrates towards a continuous deployment infrastructure.

The details are too long for a single post so I’m going to explore this as three distinct topics over the next two weeks.

  1. Reference Deployments talks about needed an automated way to repeat configuration between sites.
  2. Ops Validation using Development Tests talks about having a way to verify that everyone uses a common reference platform
  3. Shared Open Operatons / DevOps (pending) talks about putting reference deployment and common validation together to create a true open operations practice.

OpenStack, Hadoop, Ceph, Docker and other open source projects are changing the landscape for information technology. Customers seeking to become successful with these evolving platforms must look beyond the software bits, and consider both the culture and operations.  The culture is critical because interacting with the open source projects community (directly or through a proxy) can help ensure success using the software. Operations are critical because open source projects expect the community to help find and resolve issues. This results in more robust and capable products. Consequently, users of open source software must operate in a more fluid environment.

My team at Dell saw this need as we navigated the early days of OpenStack.  The Crowbar project started because we saw that the community needed a platform that could adapt and evolve with the open source projects that our advanced customers were implementing. Our ability to deliver an open operations platform enables the community to collaborate, and to skip over routine details to refocus on shared best practices.

My recent focus on the OpenStack DefCore work reinforces these original goals.  Using tests to help provide a common baseline is a concrete, open and referenceable way to promote interoperability.  I hope that this in turn drives a dialog around best practices and shared operations because those help mature the community.

refined: 10 OpenStack Core Positions

core flowTHIS POST IS #8 IN A SERIES ABOUT “WHAT IS CORE.”

Last week, I posted a streamlined visual of the core discussion that distilled the 12 positions into 10.  Here are reordered and cleaned up matching positions.  This should make it much easier to understand the context.

Note 11/3: The Core Definition is now maintained on the OpenStack Wiki.  This list may not reflect the latest changes.
  1. Implementations that are Core can use OpenStack trademark (OpenStack™)

    1. This is the legal definition of “core” and the  why it matters to  the community.

    2. We want to make sure that the OpenStack™ mark means something.

    3. The OpenStack™ mark is not the same as the OpenStack brand; however, the Board uses it’s control of the mark as a proxy to help manage the brand.

  2. Core is a subset of the whole project

    1. The OpenStack project is supposed to be a broad and diverse community with new projects entering incubation and new implementations being constantly added.  This innovation is vital to OpenStack but separate from the definition of Core.

    2. There may be other marks that are managed separately by the foundation, and available for the platform ecosystem as per the Board’s discretion

    3. “OpenStack API Compatible ” mark not part of this discussion and should be not be assumed.

  3. Core definition can be applied equally to all usage models

    1. There should not be multiple definitions of OpenStack depending on the operator (public, private, community, etc)

    2. While expected that each deployment is identical, the differences must be quantifiable

  4. Claiming OpenStack requiring use of designated upstream code

    1. Implementation’s claiming the OpenStack™ mark must use the OpenStack upstream code (or be using code submitted to upstream)

    2. You are not OpenStack, if you pass all the tests but do not use the API framework

    3. This prevents people from using the API without joining the community

    4. This also surfaces bit-rot in alternate implementations to the larger community

    5. This behavior improves interoperability because there is more shared code between implementation

  5. Projects must have an open reference implementation

    1. OpenStack will require an open source reference base plug-in implementation for projects (if not part of OpenStack, license model for reference plug-in must be compatible).

    2. Definition of a plug-in: alternate backend implementations with a common API framework that uses common _code_ to implement the API

    3. This expects that projects (where technically feasible) are expected to implement a plug-in or extension architecture.

    4. This is already in place for several projects and addresses around ecosystem support, enabling innovation

    5. Reference plug-ins are, by definition, the complete capability set.  It is not acceptable to have “core” features that are not functional in the reference plug-in

    6. This will enable alternate implementations to offer innovative or differentiated features without forcing changes to the reference plug-in implementation

    7. This will enable the reference to expand without forcing other  alternate implementations to match all features and recertify

  6. Vendors may substitute alternate implementations

    1. If a vendor plug-in passes all relevant tests then it can be considered a full substitute for the reference plug-in

    2. If a vendor plug-in does NOT pass all relevant test then the vendor is required to include the open source reference in the implementation.

    3. Alternate implementations may pass any tests that make sense

    4. Alternate implementations should add tests to validate new functionality.

    5. They must have all the must-pass tests (see #10) to claim the OpenStack mark.

  7. OpenStack Implementations are verified by open community tests

    1. Vendor OpenStack implementations must achieve 100% of must-have coverage?

    2. Implemented tests can be flagged as may-have requires list  [Joshua McKenty]

    3. Certifiers will be required to disclose their testing gaps.

    4. This will put a lot of pressure on the Tempest project

    5. Maintenance of the testing suite to become a core Foundation responsibility.  This may require additional resources

    6. Implementations and products are allowed to have variation based on publication of compatibility

    7. Consumers must have a way to determine how the system is different from reference (posted, discovered, etc)

    8. Testing must respond in an appropriate way on BOTH pass and fail (the wrong return rejects the entire suite)

  8. Tests can be remotely or self-administered

    1. Plug-in certification is driven by Tempest self-certification model

    2. Self-certifiers are required to publish their results

    3. Self-certified are required to publish enough information that a 3rd party could build the reference implementation to pass the tests.

    4. Self-certified must include the operating systems that have been certified

    5. It is preferred for self-certified implementation to reference an OpenStack reference architecture “flavor” instead of defining their own reference.  (a way to publish and agree on flavors is needed)

    6. The Foundation needs to define a mechanism of dispute resolution. (A trust but verify model)

    7. As an ecosystem partner, you have a need to make a “works against OpenStack” statement that is supportable

    8. API consumer can claim working against the OpenStack API if it works against any implementation passing all the “must have” tests(YES)

    9. API consumers can state they are working against the OpenStack API with some “may have” items as requirements

    10. API consumers are expected to write tests that validate their required behaviors (submitted as “may have” tests)

  9. A subset of tests are chosen by the Foundation as “must-pass”

    1. An OpenStack body will recommend which tests are elevated from may-have to must-have

    2. The selection of “must-pass” tests should be based on quantifiable information when possible.

    3. Must-pass tests should be selected from the existing body of “may-pass” tests.  This encourages people to write tests for cases they want supported.

    4. We will have a process by which tests are elevated from may to must lists

    5. Potentially: the User Committee will nominate tests that elevated to the board

  10. OpenStack Core means passing all “must-pass” tests

    1. The OpenStack board owns the responsibility to define ‘core’ – to approve ‘musts’

    2. We are NOT defining which items are on the list in this effort, just making the position that it is how we will define core

    3. May-have tests include items in the integrated release, but which are not core.

    4. Must haves – must comply with the Core criteria defined from the IncUp committee results

    5. Projects in Incubation or pre-Incubation are not to be included in the ‘may’ list