Mark Stouse’s “Making Predictions for 14” series

Posted on December 23, 2013 by Rob H

I was invited to be part of Mark Stouse’s 2014 big data & cloud predictions series. His questions had me thinking deeply about the past year and I’m happy to repost them here with links to the other predictors too including (Robert Scoble, Shel Israel, and David H. Deans).

1. Describe in one sentence what you do and why you’re good at it.

I specialize in architecture for infrastructure software for scale data center operations (aka “cloud”) and I have 14 years of battle scars that inform my designs.

2. Cloud Computing, Big Data or Consumerization: Which trend do you feel is having the most impact on IT today and why?

Cloud, Data & Consumerization are all connected, so there’s no one clear “most impactful” winner except that all three are forcing IT to rethink how we handle operations. The pace of change for these categories (many of which are open source driven) is so fast that traditional IT governance cannot keep up. I’m specifically talking about the DevOps and Lean Software Delivery paradigms. These approaches do not mean that we’re trading speed for quality; in fact, I’ve seen that we’re adopting techniques that deliver both higher quality and speed.

3. What do you think is the biggest misconception about Cloud computing/Big Data/Consumerization?

That someone can purchase them as a SKU. These are really architectural concepts that impact how we solve problems instead of specific products. My experience is that customers overlook their need to understand how to change their business to take advantage of these technologies. It’s the same classic challenge for ROI from most new technologies – they don’t exist apart from the business matching changes to the business to leverage them.

4. Which (Cloud Computing/Big Data/Consumerization) trend has surprised you most in the last five years?

Open source has surprised me because we’ve seen it transform from a cost concern into a supply chain concern. When I started doing open source work for Dell, customers were very interested in innovation and controlling license costs. This has really changed over the last few years. Today, customers are more concerned with community participation and transparency of their product code base. This surprised me until I realized that they are really seeking to ensure that they had maximum control and visibility into their “IT Supply Chain.” It may seem like a paradox, but open source software is uniquely positioned to help companies maintain more control of their critical IT because they are not tightly coupled to a single vendor.

5. How has Cloud Computing/Big Data/Consumerization had the biggest impact in YOUR life to date?

Beyond it being my career, I believe these technologies have created a new degree of freedom for me. I’m answering these questions from the SFO airport where I’m carrying all of the tools I need to do my job in a space small enough to fit under the seat in front of me plus a free Wifi connection. I believe we are only just learning how access to information and portable computing will change our experience. This learning process will be both liberating and painful as we work out the right balances between access, identity and privacy.

6. On a lighter note – If Cloud/Big Data/Consumerization could be personified by a superhero, which superhero would it be and why?

The Hulk. Looks like a friendly geek but it’s going to crush you if you’re not careful.

7. What aspect of (Cloud Computing/Big Data/ Consumerization) are you most excited about in the future, and what excites you about it?

The Internet of Things (even if I hate the term) is very exciting because we’re moving into a place where we have real ways to connect our virtual and physical lives. That translates into cool technologies like self-driving cars and smart power utilities. I think it will also motivate a revolution in how people interact with computers and each other. It’s going to open up a whole new dimension on our personal interaction with our surroundings. I’m specifically thinking about a book “Rainbows End” by Vernor Vinge that paints this future in vivid detail.

OpenStack Havana provides foundation for XXaaS you need

Posted on October 17, 2013 by Rob H

It’s been a long time, and a lot of summits, since I posted how OpenStack was ready for workloads (back in Cactus!). We’ve seen remarkable growth of both the platform technology and the community surrounding it. So much growth that now we’re struggling to define “what is core” for the project and I’m proud be on the Foundation Board helping to lead that charge.

So what’s exciting in Havana?

There’s a lot I am excited about in the latest OpenStack release.

Complete Split of Compute / Storage / Network services

In the beginning, OpenStack IaaS was one service (Nova). We’ve been breaking that monolith into distinct concerns (Compute, Network, Storage) for the last several releases and I think Havana is the first release where all of the three of the services are robust enough to take production workloads.

This is a major milestone for OpenStack because knowledge that the APIs were changing inhibited adoption.

ENABLING TECH INTEGRATION: Docker & Ceph

We’ve been hanging out with the Ceph and Docker teams, so you can expect to see some interesting. These two are proof of the a fallacy that only OpenStack projects are critical to OpenStack because neither of these technologies are moving under the official OpenStack umbrella. I am looking forward to seeing both have dramatic impacts in how cloud deployments.

Docker promises to make Linux Containers (LXC) more portable and easier to use. This paravirtualization approach provides near bear metal performance without compromising VM portability. More importantly, you can oversubscribe LXC much more than VMs. This allows you to dramatically improve system utilization and unlocks some other interesting quality of service tricks.

Ceph is showing signs of becoming the scale out storage king. Beyond its solid data dispersion algorithm, a key aspect of its mojo is that is delivers both block and object storage. I’ve seen a lot of interest in consolidating both types of storage into a single service. Ceph delivers on that plus performance and cost. It’s a real winner.

Crowbar Integration & High Availability Configuration!

We’ve been making amazing strides in the Crowbar + OpenStack integration! As usual, we’re planning our zero day community build (on the “Roxy” branch) to get people started thinking about operationalizing OpenStack. This is going to be especially interesting because we’re introducing it first on Crowbar 1 with plans to quickly migrate to Crowbar 2 where we can leverage the attribute injection pattern that OpenStack cookbooks also use. Ultimately, we expect those efforts to converge. The fact that Dell is putting reference implementations of HA deployment best practices into the open community is a major win for OpenStack.

Tests, Tests, Tests & Continuous Delivery

OpenStack continue to drive higher standards for reviews, integration and testing. I’m especially excited to the volume and activity around our review system (although backlogs in reviews are challenges). In addition, the community continues to invest in the test suites like the Tempest project. These are direct benefits to operators beyond simple code quality. Our team uses Tempest to baseline field deployments. This means that OpenStack test suites help validate live deployments, not just lab configurations.

We achieve a greater level of quality when we gate code check-ins on tests that matter to real deployments. In fact, that premise is the basis for our “what is core” process. It also means that more operators can choose to deploy OpenStack continuously from trunk (which I consider to be a best practice scale ops).

Where did we fall short?

With growth comes challenges, Havana is most complex release yet. The number of projects that are part the OpenStack integrated release family continues to expand. While these new projects show the powerful innovation engine at work with OpenStack, they also make the project larger and more difficult to comprehend (especially for n00bs). We continue to invest in Crowbar as a way to serve the community by making OpenStack more accessible and providing open best practices.

We are still struggling to resolve questions about interoperability (defining core should help) and portability. We spent a lot of time at the last two summits on interoperability, but I don’t feel like we are much closer than before. Hopefully, progress on Core will break the log jam.

Looking ahead to Ice House?

I and many leaders from Dell will be at the Ice House Summit in Hong Kong listening and learning.

The top of my list is the family of XXaaS services (Database aaS, Load Balanacer aaS, Firewall aaS, etc) that have appeared. I’m a firm believer that clouds are more than compute+network+storage. With a stable core, OpenStack is ready to expand into essential platform services.

If you are at the summit, please join Dell (my employer) and Intel for the OpenStack Summit Welcome Reception (RSVP!) kickoff networking and social event on Tuesday November 5, 2013 from 6:30 – 8:30pm at the SkyBistro in the SkyCity Marriott. My teammate, Kamesh Pemmaraju, has a complete list of all Dell the panels and events.

In scale-out infrastructure, tools & automation matter

Posted on July 1, 2013 by Rob H

Scale out platforms like Hadoop have different operating rules. I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.

In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.

Finding and correcting this type of failure can be difficult. While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause. If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.

Consequently, I’ve found that management tooling and automation are essential for success. Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.

I’d also like to give a shout out to benchmarks as part of your tooling suite. Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.

Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic. A single person can have a disproportionate impact on overall team performance.

Thanks! I’m enjoying my conversation with you

Posted on May 15, 2013 by Rob H

I write because I love to tell stories and to think about how actions we take today will impact tomorrow. Ultimately, everything here is about a dialog with you because you are my sounding board and my critic. I appreciate when people engage me about posts here and extend the conversation into other dimensions. Feel free to call me on points and question my position – that’s what this is all about.

Thank you for being at part of my blog and joining in. I’m looking forward to hearing more from you.

During the OpenStack Summit, I got to lead and participate in some excellent presentations and panels. While my theme for this summit was interoperability, there are many other items discussed.

I hope you enjoy them.

Did one of these topics stand out? Is there something I missed? Please let me know!

OpenStack steps toward Interopability with Temptest, RAs & RefStack.org

Posted on April 26, 2013 by Rob H

I’m a cautious supporter of OpenStack leading with implementation (over API specification); however, it clearly has risks. OpenStack has the benefit of many live sites operating at significant scale. The short term cost is that those sites were not fully interoperable (progress is being made!). Even if they were, we are lack the means to validate that they are.

The interoperability challenge was a major theme of the Havana Summit in Portland last week (panel I moderated) . Solving it creates significant benefits for the OpenStack community. These benefits have significant financial opportunities for the OpenStack ecosystem.

This is a journey that we are on together – it’s not a deliverable from a single company or a release that we will complete and move on.

There were several themes that Monty and I presented during Heat for Reference Architectures (slides). It’s pretty obvious that interop is valuable (I discuss why you should care in this earlier post) and running a cloud means dealing with hardware, software and ops in equal measures. We also identified lots of important items like Open Operations, Upstreaming, Reference Architecture/Implementation and Testing.

During the session, I think we did a good job stating how we can use Heat for an RA to make incremental steps. and I had a session about upgrade (slides).

Even with all this progress, Testing for interoperability was one of the largest gaps.

The challenge is not if we should test, but how to create a set of tests that everyone will accept as adequate. Approach that goal with standardization or specification objective is likely an impossible challenge.

Joshua McKenty & Monty Taylor found a starting point for interoperability FITS testing: “let’s use the Tempest tests we’ve got.”

We should question the assumption that faithful implementation test specifications (FITS) for interoperability are only useful with a matching specification and significant API coverage. Any level of coverage provides useful information and, more importantly, visibility accelerates contributions to the test base.

I can speak from experience that this approach has merit. The Crowbar team at Dell has been including OpenStack Tempest as part of our reference deployment since Essex and it runs as part of our automated test infrastructure against every build. This process does not catch every issue, but passing Tempest is a very good indication that you’ve got the a workable OpenStack deployment.

Crowbar and our Pivot (or, how we slipped and shipped Grizzly)

Posted on April 24, 2013 by Rob H

My team at Dell uses Lean process because it forces us to be honest about making hard choices. Our recent decision to pivot back to Crowbar 1.x for the OpenStack Grizzly release is a great example how the pivot process works.

4/24 note: I have a longer post and ISO for Grizzly on Crowbar waiting until we enter QA. The Crowbar community is already very active around this work and you’re encouraged to join.

Like any refactor, there was schedule risk when we started the Crowbar 2.x release. To mitigate this risk, we made two critical choices. First, we choose to advance the OpenStack barclamps on the 1.x code base in parallel with the 2.x work. Second, we chose a pivot date for the team to choose releasing Grizzly on the 1.x or 2.x trunks.

Choosing to jump back to 1.x was one of the hardest choices I’ve made in my career. I’m proud that we had the foresight to keep that as an option and prouder that our team rallied to make it happen.

I acknowledge that 1.x has gaps; however, getting Grizzly into the field for PoCs and pilots with 1.x provide substantial benefits to the community. That said, there are barclamps for HA deployments and other production features that are under development on the 1.x branch and will be available in the community.

The 2.x code base provides important features but we are building from on the 1.x deployment recipes. This means that development, testing and tuning applied to the Grizzly barclamps will translates directly into Crowbar 2.x field readiness. In fact, more completeness on OpenStack can dramatically simplify Crowbar 2.x testing efforts. This is especially true on the OpenStack Networking (fka Quantum) barclamps because they are new work.

Delivering solutions is a balance between features, timing and field experience. The Crowbar team’s preference is to collaborate with operators in the field and that means making workable software available quickly.

I hope that you’ll agree with our approach and help us make Grizzly the most deployable OpenStack yet.

OpenStack’s next hurdle: Interoperability. Why should you care?

Posted on April 10, 2013 by Rob H

SXSW life size Newton’s Cradle

The OpenStack Board spent several hours (yes, hours) discussing interoperability related topics at the last board meeting. Fundamentally, the community benefits when uses can operate easily across multiple OpenStack deployments (their own and/or public clouds).

Cloud interoperability: the ability to transfer workloads between systems without changes to the deployment operations management infrastructure.

This is NOT hybrid (which I defined as a workload transparently operating in multiple systems); however it is a prereq to achieve scalable hybrid operation.

Interoperability matters because the OpenStack value proposition is all about creating a common platform. IT World does a good job laying out the problem (note, I work for Dell). To create sites that can interoperate, we have to some serious lifting:

Continue to standardize APIs
Create tests that validate the APIs
Agree to common operations practices
Provide clear upgrade processes and tolerances to clients using different API versions

At the OpenStack Summit, there are multiple chances to engage on this. I’m moderating a panel about Interop and also sharing a session about the highly related topic of Reference Architectures with Monty Tayor.

The Interop Panel (topic description here) is Tuesday @ 5:20pm. If you join, you’ll get to see me try to stump our awesome panelists

Jonathan LaCour, DreamHost
Troy Toman, Rackspace
Bernard Golden, Enstratius
Monty Taylor, OpenStack Board (and HP)
Peter Pouliot, Microsoft

PS: Oh, and I’m also talking about DevOps Upgrades Patterns during the very first session (see a preview).

5 things keeping DevOps from playing well with others (Chef, Crowbar and Upstream Patterns)

Posted on March 8, 2013 by Rob H

Since my earliest days on the OpenStack project, I’ve wanted to break the cycle on black box operations with open ops. With the rise of community driven DevOps platforms like Opscode Chef and Puppetlabs, we’ve reached a point where it’s both practical and imperative to share operational practices in the form of code and tooling.

Being open and collaborating are not the same thing.

It’s a huge win that we can compare OpenStack cookbooks. The real victory comes when multiple deployments use the same trunk instead of forking.

This has been an objective I’ve helped drive for OpenStack (with Matt Ray) and it has been the Crowbar objective from the start and is the keystone of our Crowbar 2 work.

This has proven to be a formidable challenge for several reasons:

diverging DevOps patterns that can be used between private, public, large, small, and other deployments -> solution: attribute injection pattern is promising
tooling gaps prevent operators from leveraging shared deployments -> solution: this is part of Crowbar’s mission
under investing in community supporting features because they are seen as taking away from getting into production -> solution: need leadership and others with join
drift between target versions creates the need for forking even if the cookbooks are fundamentally the same -> solution: pull from source approaches help create distro independent baselines
missing reference architectures interfere with having a stable baseline to deploy against -> solution: agree to a standard, machine consumable RA format like OpenStack Heat.

Unfortunately, these five challenges are tightly coupled and we have to progress on them simultaneously. The tooling and community requires patterns and RAs.

The good news is that we are making real progress.

Judd Maltin (@newgoliath), a Crowbar team member, has documented the emerging Attribute Injection practice that Crowbar has been leading. That practice has been refined in the open by ATT and Rackspace. It is forming the foundation of the OpenStack cookbooks.

Understanding, discussing and supporting that pattern is an important step toward accelerating open operations. Please engage with us as we make the investments for open operations and help us implement the pattern.

The Atlantic magazine explains why Lean process rocks (and saves companies $$)

Posted on December 7, 2012 by Rob H

I’m certain that the Atlantic‘s Charles Fishman was not thinking software and DevOps when he wrote the excellent article about “The Insourcing Boom.” However, I strongly recommend reading this report for anyone who is interested in a practical example of the inefficiencies of software lean process (If you are impatient, jump to page 2 and search for toaster).

It’s important to realize that this article is not about software! It’s an article about industrial manufacturing and the impact that lean process has when you are making stuff. It’s about how US companies are using Lean to make domestic plants more profitable than Asian ones. It turns out that how you make something really matters – you can’t really optimize the system if you treat major parts like a black box.

When I talk about Agile and Lean, I am talking about proven processes being applied broadly to companies that want to make profit selling stuff. That’s what this article is about

If you are making software then you are making stuff! Your install and deploy process is your assembly line. Your unreleased code is your inventory.

This article does a good job explaining the benefits of being close to your manufacturing (DevOps) and being flexible in deployment (Agile) and being connected to customers (Lean). The software industry often acts like it’s inventing everything from scratch. When it comes to manufacturing processes, we can learn a lot from industry.

Unlike software, industry has real costs for scrap and lost inventory. Instead of thinking “old school” perhaps we should be thinking of it as the school of hard knocks.

My Dilemma with Folsom – why I want to jump to G

Posted on December 6, 2012 by Rob H

These views are my own. Based on 1×1 discussions I’ve had in the OpenStack community, I am not alone.

If you’ve read my blog then you know I am a vocal and active supporter of OpenStack who is seeking re-election to the OpenStack Board. I’m personally and professionally committed to the project’s success. And, I’m confident that OpenStack’s collaborative community approach is out innovating other clouds.

A vibrant project requires that we reflect honestly: we have an equal measure of challenges: shadow free fall Dev, API vs implementation, forking risk and others. As someone helping users deploy OpenStack today, I find my self straddling between a solid release (Essex) and a innovative one (Grizzly). Frankly, I’m finding it very difficult to focus on Folsom.

Grizzly excites me and clearly I’m not alone. Based on pace of development, I believe we saw a significant developer migration during feature freeze free fall.

In Grizzly, both Cinder and Quantum will have progressed to a point where they are ready for mainstream consumption. That means that OpenStack will have achieved the cloud API trifecta of compute-store-network.

Cinder will get beyond the “replace Nova Volume” feature set and expands the list of connectors.
Quantum will get to parity with Nova Network, addresses overlapping VM IPs and goes beyond L2 with L3 feature enablement like load balancing aaS.
We are having a real dialog about upgrades while the code is still in progress
And new projects like Celio and Heat are poised to address real use problems in billing and application formation.

Everything I hear about Folsom deployment is positive with stable code and significant improvements; however, we’re too late to really influence operability at the code level because the Folsom release is done. This is not a new dilemma. As operators, we seem to be forever chasing the tail of the release.

The perpetual cycle of implementing deployment after release is futile, exhausting and demoralizing because we finish just in time for the spotlight to shift to the next release.

I don’t want to slow the pace of releases. In Agile/Lean, we believe that if something is hard then we do should it more. Instead, I am looking at Grizzly and seeing an opportunity to break the cycle. I am looking at Folsom and thinking that most people will be OK with Essex for a little longer.

Maybe I’m a dreamer, but if we can close the deployment time gap then we accelerate adoption, innovation and happy hour. If that means jilting Folsom at the release altar to elope with Grizzly then I can live with that.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: DevOps