Crowbar 2 Status Update > I can feel the rumble of the engines

two

Crowbar Two

While I’ve been more muted on our Crowbar 2 progress since our pivot back to CB1 for Grizzly, it has been going strong and steady.  We took advantage of the extra time to do some real analysis about late-binding, simulated annealing, emergent services and functional operations that are directly reflected in Crowbar’s operational model (yes, I’m working on posted explaining each concept).

We’re planning Crowbar 2 hack-a-thon in Hong Kong before the OpenStack Ice House Summit (11/1-3).  We don’t expect a big crowd on site, but the results will be fun to watch remote and it should be possible to play along (watch the crowbar list for details).

In the mean time, I wanted to pass along this comprehensive status update by Crowbar’s leading committer, Victor Lowther:

It has been a little over a month since my last status report on
Crowbar 2.0, so now that we have hit the next major milestone
(installing the OS on a node and being able to manage it afterwards),
it is time for another status report.

Major changes since the initial status report:

* The Crowbar framework understands node aliveness and availability.
* The Network barclamp is operational, and can manage IPv4 and IPv6 in
  the same network.
* delayed_jobs + a stupidly thin queuing layer handle all our
  long-running tasks.
* We have migrated to postgresql 9.3 for all our database needs.
* DHCP and DNS now utilize the on_node_* role hooks to manage their
  databases.
* We support a 2 layer deployment tree -- system on top, everything
  else in the second layer.
* The provisioner can install Ubuntu 12.04 on other nodes.
* The crowbar framework can manage other nodes that are not in
  Sledgehammer.
* We have a shiny installation wizard now.

In more detail:

Aliveness and availability:

Nodes in the Crowbar framework have two related flags that control
whether the annealer can operate on them.

Aliveness is under the control of the Crowbar framework and
encapsulates the framework's idea of whether any given node is
manageable or not.  If a node is pingable and can be SSH'ed into as
root without a password using the credentials of the root user on
the admin node, then the node is alive, otherwise it is dead.
Aliveness is tested everytime a jig tries to do something on a node
-- if a node cannot be pinged and SSH'ed into from at least one of
its addresses on the admin network, it will be marked as
dead.  When a node is marked as dead, all of the noderoles on that
node will be set to either blocked or todo (depending on the state of
their parent noderoles), and those changes will ripple down the
noderole dependency graph to any child noderoles.

Nodes will also mark themselves as alive and dead in the course of
their startup and shutdown routines.

Availability is under the control of the Crowbar cluster
administrators, and should be used by them to tell Crowbar that it
should stop managing noderoles on the node.  When a node is not
available, the annealer will not try to perform any jig runs on a
node, but it will leave the state of the noderoles alone.

A node must be both alive and available for the annealer to perform
operations on it.

The Network Barclamp:

The network barclamp is operational, with the following list of
features:

* Everything mentioned in Architecture for the Network Barclamp in
  Crowbar 2.0
* IPv6 support.  You can create ranges and routers for IPv6 addresses
  as well as IPv4 addresses, and you can tell a network that it should
  automatically assign IPv6 addresses to every node on that network by
  setting the v6prefix setting for that network to either:
  * a /64 network prefix, or
  * "auto", which will create a globally unique RFC4193 IPv6 network
    prefix from a randomly-chosen 40 bit number (unique per cluster
    installation) followed by a subnet ID based on the ID of the
    Crowbar network.
  Either way, nodes in a Crowbar network that has a v6prefix will get
  an interface ID that maps back to their FQDN via the last 64 bits of
  the md5sum of that FQDN. For now, the admin network will
  automatically create an RFC4193 IPv6 network if it is not passed a
  v6prefix so that we can easily test all the core Crowbar components
  with IPv6 as well as IPv4.  The DNS barclamp has been updated to
  create the appropriate AAAA records for any IPv6 addresses in the
  admin network.

Delayed Jobs and Queuing:

The Crowbar framework runs all jig actions in the background using
delayed_jobs + a thin queuing layer that ensures that only one task is
running on a node at any given time.  For now, we limit ourselves to
having up to 10 tasks running in the background at any given time,
which should be enough for the immediate future until we come up with
proper tuning guidelines or auto-tuning code for significantly larger
clusters.

Postgresql 9.3:

Migrating to delayed_jobs for all our background processing made it
immediatly obvious that sqlite is not at all suited to handling real
concurrency once we started doing multiple jig runs on different nodes
at a time. Postgresql is more than capable of handling our forseeable
concurrency and HA use cases, and gives us lots of scope for future
optimizations and scalability.

DHCP and DNS:

The roles for DHCP and DNS have been refactored to have seperate
database roles, which are resposible for keeping their respective
server roles up to date.  Theys use the on_node_* roles mentioned in
"Roles, nodes, noderoles, lifeycles, and events, oh my!" along with a
new on_node_change event hook create and destroy DNS and DHCP database
entries, and (in the case of DHCP) to control what enviroment a node
will PXE/UEFI boot into.  This gives us back the abiliy to boot into
something besides Sledgehammer.

Deployment tree:

Until now, the only deployment that Crowbar 2.0 knew about was the
system deployment.  The system deployment, however, cannot be placed
into proposed and therefore cannot be used for anything other than
initial bootstrap and discovery.  To do anything besides
bootstrap the admin node and discover other nodes, we need to create
another deployment to host the additional noderoles needed to allow
other workloads to exist on the cluster.  Right now, you can only
create deployments as shildren of the system deployment, limiting the
deployment tree to being 2 layers deep.

Provisioner Installing Ubuntu 12.04:

Now, we get to the first of tqo big things that were added in the last
week -- the provisioner being able to install Ubuntu 12.04 and bring
the resulting node under management by the rest of the CB 2.0
framework.  This bulds on top of the deployment tree and DHCP/DNS
database role work.  To install Ubuntu 12.04 on a node from the web UI:

1: Create a new deployment, and add the provisioner-os-install role to
that deployment.  In the future you will be able to edit the
deployment role information to change what the default OS for a
deployment should be.
2: Drag one of the non-admin nodes onto the provisioner-os-install
role.  This will create a proposed noderole binding the
provisioner-os-install role to that node, and in the future you would
be able to change what OS would be installed on that node by editing
that noderole before committing the deployment.
3: Commit the snapshot.  This will cause several things to happen:
  * The freshly-bound noderoles will transition to TODO, which will
    trigger an annealer pass on the noderoles.
  * The annealer will grab all the provisioner-os-install roles that
    are in TODO, set them in TRANSITION, and hand them off to
    delayed_jobs via the queuing system.
  * The delayed_jobs handlers will use the script jig to schedule a
    reboot of the nodes for 60 seconds in the future and then return,
    which will transition the noderole to ACTIVE.
  * In the crowbar framework, the provisioner-os-install role has an
    on_active hook which will change the boot environment of the node
    passed to it via the noderole to the appropriate os install state
    for the OS we want to install, and mark the node as not alive so
    that the annealer will ignore the node while it is being
    installed.
  * The provisioner-dhcp-database role has an on_node_change handler
    that watches for changes in the boot environment of a node.  It
    will see the bootenv change, update the provisioner-dhcp-database
    noderoles with the new bootenv for the node, and then enqueue a
    run of all of the provisioner-dhcp-database roles.
  * delayed_jobs will see the enqueued runs, and run them in the order
    they were submitted.  All the runs sholuld happen before the 60
    seconds has elapsed.
  * When the nodes finally reboot, the DHCP databases should have been
    updated and the nodes will boot into the Uubntu OS installer,
    install, and then set their bootenv to local, which will tell the
    provisioner (via the provisioner-dhcp-database on_node_change
    hook) to not PXE boot the node anymore.
  * When the nodes reboot off their freshly-installed hard drive, they
    will mark themselves as alive, and the annealer will rerun all of
    the usual discovery roles.
The semi-astute observer will have noticed some obvious bugs and race
conditions in the above sequence of steps.  These have been left in
place in the interest of expediency and as learning oppourtunities for
others who need to get familiar with the Crowbar codebase.

Installation Wizard:

We have a shiny installation that you can use to finish bootstrapping
your admin node.  To use it, pass the --wizard flag after your FQDN to
/opt/dell/bin/install-crowbar when setting up the admin node, and the
install script will not automatically create an admin network or an
entry for the admin node, and logging into the web UI will let you
customize things before creating the initial admin node entry and
committing the system deployment.  

Once we get closer to releasing CB 2.0, --wizard will become the default.

Crowbar lays it all out: RAID & BIOS configs officially open sourced

MediaToday, Dell (my employer) announced a plethora of updates to our open source derived solutions (OpenStack and Hadoop). These solutions include the latest bits (Grizzly and Cloudera) for each project. And there’s another important notice for people tracking the Crowbar project: we’ve opened the remainder of its provisioning capability.

Yes, you can now build the open version of Crowbar and it has the code to configure a bare metal server.

Let me be very specific about this… my team at Dell tests Crowbar on a limited set of hardware configurations. Specifically, Dell server versions R720 + R720XD (using WSMAN and iIDRAC) and C6220 + C8000 (using open tools). Even on those servers, we have a limited RAID and NIC matrix; consequently, we are not positioned to duplicate other field configurations in our lab. So, while we’re excited to work with the community, caveat emptor open source.

Another thing about RAID and BIOS is that it’s REALLY HARD to get right. I know this because our team spends a lot of time testing and tweaking these, now open, parts of Crowbar. I’ve learned that doing hard things creates value; however, it’s also means that contributors to these barclamps need to be prepared to get some silicon under their fingernails.

I’m proud that we’ve reached this critical milestone and I hope that it encourages you to play along.

PS: It’s worth noting is that community activity on Crowbar has really increased. I’m excited to see all the excitement.

Update a pull request from git command line

Hand up

Sometimes we just need to feed the SEO gods… in this case, I could not find the simple git command line to update a pull request that I had in flight.

I was looking for the following:

git push -f personal HEAD:[pull branch]

Github.com happily gave me instructions from the pull branch but not the CLI version of the command.  The trick is that you need to know your remote (git remote) for the command so it’s not perfectly generic.  In the example above, my personal repo is named “personal”.

Deconstructing the command: you are pushing your to your personal clone from the local HEAD commit against the branch created for the pull request.  That’s because the pull request creates a branch from your clone to be pulled into the upstream repo.  That’s why it’s a PULL request not a push.

Ultimately, this is pretty basic git.  My experience with git is that the definition of “pretty basic” is a binary function.  Once you know how git works, everything in git is pretty basic.  Until then it’s completely opaque.

Side note: this is my 301st post on this blog!

8/1/2013 Post Script from Crowbar Contributor Adam Spiers

He noticed that I should include the -f in the git push instruction.  Read more at about that on his blog.

Connecting the dots: Dell stays course on OpenStack private

rob pdx drivingWhen it comes to OpenStack, I don’t just work for Dell: I’m the technical lead for our OpenStack-powered private Cloud Solution and an elected director to the OpenStack Foundation board.

Frankly, the announcement of our change in public cloud strategy overshadowed our increasing level of investment in OpenStack-powered private cloud solutions (we are hiring!).  Sam Greenblatt, Dell Product Group VP and Chief Architect, is very specific that the recent announcements are about increasing investment where Dell is already successful plus accelerating with new features (such as leadership in HyperV enablement).

The fact that we focused on our decision to pivot away from Dell hosted public cloud distracted from the strategic choices that we’ve been making.  In the lean process that we use, pivots are a positive sign of listening and self-honesty.  Sadly, that distraction led to confusion, misleading comments, and implications that Dell was dropping OpenStack or questioning OpenStack sustainability and market success.

For the record, Dell was one of the first companies to support OpenStack with supporting quotes from Forrest Norrod (Dell GM for Servers and my direct boss) way back  in July 2010.  Our private OpenStack based cloud, built on open source Crowbar, was the first to market 2 years ago (deploying Cactus!).  We’ve been investing steadily in both fundamental improvements to OpenStack deployment and being early supporting the Grizzly release.

I am not implying that OpenStack’s future is certain (we have a lot of work to do) or that Dell OpenStack strategy will not change again; however, I know first-hand that both are on much firmer footing than some reports have implied.

Crowbar cuts OpenStack Grizzly (“pebbles”) branch & seeks community testing

Pebbles CutThe Crowbar team (I work for Dell) continues to drive towards “zero day” deployment readiness. Our Hadoop deployments are tracking Dell | Cloudera Hadoop-powered releases within a month and our OpenStack releases harden within three months.

During the OpenStack summit, we cut our Grizzly branch (aka “pebbles”) and switched over to the release packages. Just a reminder, we basically skipped Folsom. While we’re still tuning out issues on OpenStack Networking (OVS+GRE) setup, we’re also looking for community to start testing and tuning the Chef deployment recipes.

We’re just sprints from release; consequently, it’s time for the Crowbar/OpenStack community to come and play! You can learn Grizzly and help tune the open source Ops scripts.

While the Crowbar team has been generating a lot of noise around our Crowbar 2.0 work, we have not neglected progress on OpenStack Grizzly.  We’ve been building Grizzly deploys on the 1.x code base using pull-from-source to ensure that we’d be ready for the release. For continuity, these same cookbooks will be the foundation of our CB2 deployment.

Features of Crowbar’s OpenStack Grizzly Deployments

  • We’ve had Nova Compute, Glance Image, Keystone Identity, Horizon Dashboard, Swift Object and Tempest for a long time. Those, of course, have been updated to Grizzly.
  • Added Block Storage
    • importable Ceph Barclamp & OpenStack Block Plug-in
    • Equalogic OpenStack Block Plug-in
  • Added Quantum OpenStack Network Barclamp
    • Uses OVS + GRE for deployment
  • 10 GB networking configuration
  • Rabbit MQ as its own barclamp
  • Swift Object Barclamps made a lot of progress in Folsom that translates to Grizzly
    • Apache Web Service
    • Rack awareness
    • HA configuration
    • Distribution Report
  • “Under the covers” improvements for Crowbar 1.x
    • Substantial improvements in how we configure host networking
    • Numerous bug fixes and tweaks
  • Pull from Source via the Git barclamp
    • Grizzly branch was switched to use Ubuntu & SUSE packages

We’ve made substantial progress, but there are still gaps. We do not have upgrade paths from Essex or Folsom. While we’ve been adding fault-tolerance features, full automatic HA deployments are not included.

Please build your own Crowbar ISO or check our new SoureForge download site then join the Crowbar List and IRC to collaborate with us on OpenStack (or Hadoop or Crowbar 2). Together, we will make this awesome.

OpenStack steps toward Interopability with Temptest, RAs & RefStack.org

Pipes are interoperableI’m a cautious supporter of OpenStack leading with implementation (over API specification); however, it clearly has risks. OpenStack has the benefit of many live sites operating at significant scale. The short term cost is that those sites were not fully interoperable (progress is being made!). Even if they were, we are lack the means to validate that they are.

The interoperability challenge was a major theme of the Havana Summit in Portland last week (panel I moderated) .  Solving it creates significant benefits for the OpenStack community.  These benefits have significant financial opportunities for the OpenStack ecosystem.

This is a journey that we are on together – it’s not a deliverable from a single company or a release that we will complete and move on.

There were several themes that Monty and I presented during Heat for Reference Architectures (slides).  It’s pretty obvious that interop is valuable (I discuss why you should care in this earlier post) and running a cloud means dealing with hardware, software and ops in equal measures.  We also identified lots of important items like Open OperationsUpstreamingReference Architecture/Implementation and Testing.

During the session, I think we did a good job stating how we can use Heat for an RA to make incremental steps.   and I had a session about upgrade (slides).

Even with all this progress, Testing for interoperability was one of the largest gaps.

The challenge is not if we should test, but how to create a set of tests that everyone will accept as adequate.  Approach that goal with standardization or specification objective is likely an impossible challenge.

Joshua McKenty & Monty Taylor found a starting point for interoperability FITS testing: “let’s use the Tempest tests we’ve got.”

We should question the assumption that faithful implementation test specifications (FITS) for interoperability are only useful with a matching specification and significant API coverage.  Any level of coverage provides useful information and, more importantly, visibility accelerates contributions to the test base.

I can speak from experience that this approach has merit.  The Crowbar team at Dell has been including OpenStack Tempest as part of our reference deployment since Essex and it runs as part of our automated test infrastructure against every build.  This process does not catch every issue, but passing Tempest is a very good indication that you’ve got the a workable OpenStack deployment.

Crowbar and our Pivot (or, how we slipped and shipped Grizzly)

Crowbar Grizzly PostMy team at Dell uses Lean process because it forces us to be honest about making hard choices. Our recent decision to pivot back to Crowbar 1.x for the OpenStack Grizzly release is a great example how the pivot process works.

4/24 note: I have a longer post and ISO for Grizzly on Crowbar waiting until we enter QA. The Crowbar community is already very active around this work and you’re encouraged to join.

Like any refactor, there was schedule risk when we started the Crowbar 2.x release. To mitigate this risk, we made two critical choices. First, we choose to advance the OpenStack barclamps on the 1.x code base in parallel with the 2.x work. Second, we chose a pivot date for the team to choose releasing Grizzly on the 1.x or 2.x trunks.

Choosing to jump back to 1.x was one of the hardest choices I’ve made in my career. I’m proud that we had the foresight to keep that as an option and prouder that our team rallied to make it happen.

I acknowledge that 1.x has gaps; however, getting Grizzly into the field for PoCs and pilots with 1.x provide substantial benefits to the community.  That said, there are barclamps for HA deployments and other production features that are under development on the 1.x branch and will be available in the community.

The 2.x code base provides important features but we are building from on the 1.x deployment recipes. This means that development, testing and tuning applied to the Grizzly barclamps will translates directly into Crowbar 2.x field readiness. In fact, more completeness on OpenStack can dramatically simplify Crowbar 2.x testing efforts.  This is especially true on the OpenStack Networking (fka Quantum) barclamps because they are new work.

Delivering solutions is a balance between features, timing and field experience.  The Crowbar team’s preference is to collaborate with operators in the field and that means making workable software available quickly.

I hope that you’ll agree with our approach and help us make Grizzly the most deployable OpenStack yet.

5 things keeping DevOps from playing well with others (Chef, Crowbar and Upstream Patterns)

Sharing can be hardSince my earliest days on the OpenStack project, I’ve wanted to break the cycle on black box operations with open ops. With the rise of community driven DevOps platforms like Opscode Chef and Puppetlabs, we’ve reached a point where it’s both practical and imperative to share operational practices in the form of code and tooling.

Being open and collaborating are not the same thing.

It’s a huge win that we can compare OpenStack cookbooks. The real victory comes when multiple deployments use the same trunk instead of forking.

This has been an objective I’ve helped drive for OpenStack (with Matt Ray) and it has been the Crowbar objective from the start and is the keystone of our Crowbar 2 work.

This has proven to be a formidable challenge for several reasons:

  1. diverging DevOps patterns that can be used between private, public, large, small, and other deployments -> solution: attribute injection pattern is promising
  2. tooling gaps prevent operators from leveraging shared deployments -> solution: this is part of Crowbar’s mission
  3. under investing in community supporting features because they are seen as taking away from getting into production -> solution: need leadership and others with join
  4. drift between target versions creates the need for forking even if the cookbooks are fundamentally the same -> solution: pull from source approaches help create distro independent baselines
  5. missing reference architectures interfere with having a stable baseline to deploy against -> solution: agree to a standard, machine consumable RA format like OpenStack Heat.

Unfortunately, these five challenges are tightly coupled and we have to progress on them simultaneously. The tooling and community requires patterns and RAs.

The good news is that we are making real progress.

Judd Maltin (@newgoliath), a Crowbar team member, has documented the emerging Attribute Injection practice that Crowbar has been leading. That practice has been refined in the open by ATT and Rackspace. It is forming the foundation of the OpenStack cookbooks.

Understanding, discussing and supporting that pattern is an important step toward accelerating open operations. Please engage with us as we make the investments for open operations and help us implement the pattern.

double Block Head with OpenStack+Equallogic & Crowbar+Ceph

Block Head

Whew….Yesterday, Dell announced TWO OpenStack block storage capabilities (Equallogic & Ceph) for our OpenStack Essex Solution (I’m on the Dell OpenStack/Crowbar team) and community edition.  The addition of block storage effectively fills the “persistent storage” gap in the solution.  I’m quadrupally excited because we now have:

  1. both open source (Ceph) and enterprise (Equallogic) choices
  2. both Nova drivers’ code is in the open at part of our open source Crowbar work

Frankly, I’ve been having trouble sitting on the news until Dell World because both features have been available in Github before the announcement (EQLX and Ceph-Barclamp).  Such is the emerging intersection of corporate marketing and open source.

As you may expect, we are delivering them through Crowbar; however, we’ve already had customers pickup the EQLX code and apply it without Crowbar.

The Equallogic+Nova Connector

block-eqlx

If you are using Crowbar 1.5 (Essex 2) then you already have the code!  Of course, you still need to have the admin information for your SAN – we did not automate the configuration of the storage system, but the Nova Volume integration.

We have it under a split test so you need to do the following to enable the configuration options:

  1. Install OpenStack as normal
  2. Create the Nova proposal
  3. Enter “Raw” Attribute Mode
  4. Change the “volume_type” to “eqlx”
  5. Save
  6. The Equallogic options should be available in the custom attribute editor!  (of course, you can edit in raw mode too)

Want Docs?  Got them!  Check out these > EQLX Driver Install Addendum

Usage note: the integration uses SSH sessions.  It has been performance tested but not been tested at scale.

The Ceph+Nova Connector

block-ceph

The Ceph capability includes a Ceph barclamp!  That means that all the work to setup and configure Ceph is done automatically done by Crowbar.  Even better, their Nova barclamp (Ceph provides it from their site) will automatically find the Ceph proposal and link the components together!

Ceph has provided excellent directions and videos to support this install.

Interview Transcript about OpenStack, Foundation, Dell & Crowbar

Rafael Knuth, Dell Rafael Knuth (@RafaelKnuth with Dell as EMEA Social Media Community Manager) has set an ambitious objective to interview all 24 of the OpenStack Foundation Board members.  I have the privilege of being the first interview posted (Japanese version!).   Here’s the whole series.

This is not puff interview – We spent an hour together and Rafael did not shy away from asking hard questions like “Why did Dell jump into OpenStack?” and “is VMware a threat to OpenStack?”  Rather than posting the whole transcript (it’s posted here), I’m including the questions (as a teaser) below.   There is some real meat in these answers about OpenStack, Dell, Crowbar and challenges facing the project.

WARNING: My job is engineering, not marketing.  You may find my answers (which are MY OWN ANSWERS) to be more direct that you are expecting.  If you find yourself needing additional circumlocution then simply close your browser and move on.

Some highlights…

Dell’s interest in OpenStack has been very pragmatic. OpenStack is something we really see a market need for.

Rackspace …  runs on OpenStack pretty much off trunk …  That’s exactly the type of vibrant community we want to see.  At the same time, there is a growing community that wants to use OpenStack distributions with support, certifications and they are fine with being 6 months behind OpenStack off trunk. That’s good, and we want that shadow, we want that combination of pure minded early adopters and less sophisticated OpenStack users both working together.

We are working with different partners to bring OpenStack to different customers in different ways. It is confusing. Your question about Dell Crowbar was right … it is targeted at a certain class of users, and I don’t want enterprise customers who expect a lot of shiny chrome and zero touch. That’s not the target by now for Dell Crowbar. We definitely need that sort of magic decoder page to help customers understand our commercial offering.

Questions from Interview:

  1. Dell is one of the very early contributors to OpenStack. Why is Dell engaging in this project?
  2. How does Dell contribute to OpenStack?
  3. Let’s talk a bit about Dell Crowbar, your team’s deployment mechanism for OpenStack.
  4. Let’s talk a bit about OpenStack raw vs. OpenStack distributions.
  5. What are the biggest barriers to OpenStack adoption as of now?
  6. What does a customer specifically need to do when moving from OpenStack Essex to Folsom for example?
  7. My next question is around proof of concept versus production, Rob. How are customers using OpenStack and can you give examples for both scenarios?
  8. I hear very often two different statements: “Open Stack is an alternative to Amazon.” The other is: “OpenStack is an alternative to VMware … maybe, hopefully in two or three years from now.”  Which of both statements is true?
  9. How do you view VMware joining OpenStack. Is it a threat to OpenStack or does VMware add value to the project?
  10. Let us speak about market adoption. Who are the early adopters of OpenStack? And when do you expect OpenStack to hit the tipping point for mass market adoption?
  11. Rob, for all those interested in Dell’s commercial offering around OpenStack … can you give a brief overview?
  12. Dell TechCenter that provides customers an overview over our OpenStack offering: Dell Crowbar as our DevOps tool in its various shapes and forms, OpenStack distros we support … cloud services we build around OpenStack … hardware capabilities optimized for OpenStack.
  13. What are the challenges for the OpenStack Board of Directors?