I’ve been relatively quiet about the OpenCrowbar Drill release and that’s not a fair reflection of the level of capability and maturity reflected in the code base; however, it really just sets the stage for truly ground breaking ops automation work in the next release (“Epoxy”).
The primary focus for this release was proving our functional operations architectural pattern against a wide range of workloads and that is exactly what the RackN team has been doing with Ceph, Docker, Kubernetes, CloudFoundry and StackEngine workloads.
In addition to workloads, we put the platform through its paces in real ops environments at scale. That resulted in even richer network configurations and options plus performance and tuning. The RackN team continues to adapt OpenCrowbar to match real work ops.
One critical lesson you’ll see more in the Epoxy release: OpenCrowbar and the team at RackN will keep finding ways to adapt to the ops environment. We believe that tools should adapt to their environments: we’ve encountered some pretty extreme quirks and our philosophy is embrace don’t force change.
Cloud Ops is a brutal business: operators are expected to maintain a stable and robust operating environment while also embracing waves of disruptive changes using unproven technologies. While we want to promote these promising new technologies, the unicorns, operators still have to keep the lights on; consequently, most companies turn to outside experts or internal strike teams to get this new stuff working.
Our experience is that doing an on-site deployment by professional services (PS) is often much harder than expected. Why? Because of inherent mission conflict. The PS paratrooper team sent to accomplish the “install Foo!” mission are at odds with the operators’ maintain and defend mission. Where the short-term team is willing to blast open a wall for access, the long-term team is highly averse to collateral damage. Both teams are faced with an impossible situation.
I’ve been promoting Open Ops around a common platform (obviously, OpenCrowbar in my opinion) as a way to solve address cross-site standardization.
Why would a physical automation standard help? Generally, the pros expect to arrive with everything “ready state” including OS installed and all the networking ready. Unfortunately, there’s a significant gap between an OS installed and … installs are always a journey of discovery as the teams figure out the real requirements.
Here are some questions that we’ve put together to gauge is the installs are really going the way you think:
How often is the customer site ready for deployment? If not, how long does that take to correct?
How far into a deployment do you get before an error in deployment is detected? How often that error repeated across all the systems?
How often is an “error” actually an operational requirement at the site that cannot be changed without executive approval and weeks of work?
How often are issues found after deployment is started that cause a install restart?
Can the deployment be recreated on another site? Can the install be recreated in a test environment for troubleshooting?
How often are systems hand or custom updated as part of a routine installation?
How often are developers needed to troubleshoot issues that end up being configuration related?
How often are back-doors left in place to help with troubleshooting?
What is the upgrade process like? Is the state of the “as left” system sufficiently documented to plan an upgrade?
What happens if there’s a major OS upgrade or patch required that impacts the installed application?
The RackN team has started designing reference architectures for containers on metal (discussed on TheNewStack.io) with the hope of finding hardware design that is cost and performance optimized for containers instead of simply repurposing premium virtualized cloud infrastructure. That discussion turned up something unexpected…
This container alternative likely escapes notice of many because it requires hardware capabilities that are not/partially exposed inside cloud virtual machines; however, it could be a very compelling story for operators looking for containers on metal.
Here’s my basic understanding: these technologies offer container-like light-weight & elastic behavior with the isolation provided by virtual machines. This is possible because they use CPU capabilities to isolate environments.
7/3 Update: Feedback about this post has largely been “making it easier for VMs to run docker automatically is not interesting.” What’s your take on it?
Like any scale install, once you’ve got a solid foundation, the actual installation goes pretty quickly. In Kubernetes’ case, that means creating strong networking and etcd configuration.
Here’s a 30 minute video showing the complete process from O/S install to working Kubernetes:
Here are the details:
Clustered etcd – distributed key store
etcd is the central data service that maintains the state for the Kubernetes deployment. The strength of the installation rests on the correctness of etcd. The workload builds an etcd cluster and synchronizes all the instances as nodes are added.
Networking with Flannel and Proxy
Flannel is the default overlay network for Kubernetes that handles IP assignment and intercontainer communication with UDP encapsulation. The workload configures Flannel as for networking with etcd as the backing store.
An important part of the overall networking setup is the configuration of a proxy so that the nodes can get external access for Docker image repos.
Docker Setup
We install the latest Docker on the system. That may not sound very exciting; however, Docker iterates faster than most Linux images so it’s important that we keep you current.
Master & Minion Kubernetes Nodes
Using etcd as a backend, the workload sets up one (or more) master nodes with the API server and other master services. When the minions are configured, they are pointed to the master API server(s). You get to choose how many masters and which systems become masters. If you did not choose correctly, it’s easy to rinse and repeat.
Highly Available using DNS Round Robin
As the workload configures API servers, it also adds them to a DNS round robin pool (made possible by [new DNS integrations]). Minions are configured to use the shared DNS name so that they automatically round-robin all the available API servers. This ensures both load balancing and high availability. The pool is automatically updated when you add or remove servers.
Installed on Real Metal
It’s worth including that we’ve done cluster deployments of 20 physical nodes (with 80 in process!). Since OpenCrowbar architecture abstracts the vendor hardware, the configuration is multi-vendor and heterogenous. That means that this workload (and our others) delivers tangible scale implementations quickly and reliably.
Future Work for Advanced Networking
Flannel is really very basic SDN. We’d like to see additional networking integrations including OpenContrail as per Pedro Marques work.
At this time, we are not securing communication with etcd. This requires advanced key management is a more advanced topic.
Why is RackN building this? We are a physical ops automation company.
We are seeking to advance the state of data center operations by helping get complex scale platforms operationalized. We want to work with the relevant communities to deliver repeatable best practices around next-generation platforms like Kubernetes. Our speciality is in creating a general environment for ops success: we work with partners who are experts on using the platforms.
We want to engage with potential users before we turn this into a open community project; however, we’ve chosen to make the code public. Please get us involved (community forum)! You’ll need a working OpenCrowbar or RackN Enterprise install as a pre-req and we want to help you be successful.
OpenCrowbar Drill release (will likely become v2.3) is wrapping up in the next few weeks and it’s been amazing to watch the RackN team validate our designs by pumping out workloads and API integrations (list below).
I’ve posted about the acceleration from having a ready state operations base and we’re proving that out. Having an automated platform is critical for metal deployment because there is substantial tuning and iteration needed to complete installations in the field.
Getting software setup once is not a victory: that’s called a snowflake
Real success is tearing it down and having work the second, third and nth times. That’s because scale ops is not about being able to install platforms. It’s about operationalizing them.
Integration: the difference between install and operationalization.
When we build a workload, we are able to build up the environment one layer at a time. For OpenCrowbar, that starts with a hardware inventory and works up though RAID/BIOS and O/S configuration. After the OS is ready, we are able to connect into the operational environment (SSH keys, NTP, DNS, Proxy, etc) and build real multi-switch/layer2 topographies. Next we coordinate multi-node actions like creating Ceph, Consul and etcd clusters so that the install is demonstrably correct across nodes and repeatable at every stage. If something has to change, you can repeat the whole install or just the impacted layers. That is what I consider integrated operation.
It’s not just automating a complex install. We design to be repeatable site-to-site.
Here’s the list of workloads we’ve built on OpenCrowbar and for RackN in the last few weeks:
Ceph (OpenCrowbar) with advanced hardware optimization and networking that synchronizes changes in monitors.
StackEngine (RackN) builds a multi-master cluster and connects all systems together.
Kubernetes (RackN) that includes automatic high available DNS configuration, flannel networking and etcd cluster building.
CloudFoundry on Metal via BOSH (RackN) uses pools of hardware that are lifecycle managed OpenCrowbar including powering off systems that are idle.
I don’t count the existing RackN OpenStack via Packstack (RackN) workload because it does not directly leverage OpenCrowbar clustering or networking. It could if someone wanted to help build it.
And… we also added a major DNS automation feature and updated the network mapping logic to work in environments where Crowbar does not manage the administrative networks (like inside clouds). We’ve also been integrating deeply with Hashicorp Consul to allow true “ops service discovery.”
After writing pages of notes about the impact of Docker, microservice architectures, mainstreaming of Ops Automation, software defined networking, exponential data growth and the explosion of alternative hardware architecture, I realized that it all boils down to the death of cloud as we know it.
OK, we’re not killing cloud per se this year. It’s more that we’ve put 10 pounds of cloud into a 5 pound bag so it’s just not working in 2015 to call it cloud.
Cloud was happily misunderstood back in 2012 as virtualized infrastructure wrapped in an API beside some platform services (like object storage).
That illusion will be shattered in 2015 as we fully digest the extent of the beautiful and complex mess that we’ve created in the search for better scale economics and faster delivery pipelines. 2015 is going to cause a lot of indigestion for CIOs, analysts and wandering technology executives. No one can pick the winners with Decisive Leadership™ alone because there are simply too many possible right ways to solve problems.
Here’s my list of the seven cloud disrupting technologies and frameworks that will gain even greater momentum in 2015:
Docker – I think that Docker is the face of a larger disruption around containers and packaging. I’m sure Docker is not the thing alone. There are a fleet of related technologies and Docker replacements; however, there’s no doubt that it’s leading a timely rethinking of application life-cycle delivery.
New languages and frameworks – it’s not just the rapid maturity of Node.js and Go, but the frameworks and services that we’re building (like Cloud Foundry or Apache Spark) that change the way we use traditional languages.
Microservice architectures – this is more than containers, it’s really Functional Programming for Ops (aka FuncOps) that’s a new generation of service oriented architecture that is being empowered by container orchestration systems (like Brooklyn or Fleet). Using microservices well seems to redefine how we use traditional cloud.
Mainstreaming of Ops Automation – We’re past “if DevOps” and into the how. Ops automation, not cloud, is the real puppies vs cattle battle ground. As IT creates automation to better use clouds, we create application portability that makes cloud disappear. This freedom translates into new choices (like PaaS, containers or hardware) for operators.
Software defined networking – SDN means different things but the impacts are all the same: we are automating networking and integrating it into our deployments. The days of networking and compute silos are ending and that’s going to change how we think about cloud and the supporting infrastructure.
Exponential data growth – you cannot build applications or infrastructure without considering how your storage needs will grow as we absorb more data streams and internet of things sources.
Explosion of alternative hardware architecture – In 2010, infrastructure was basically pizza box or blade from a handful of vendors. Today, I’m seeing a rising tide of alternatives architectures including ARM, Converged and Storage focused from an increasing cadre of sources including vendors sharing open designs (OCP). With improved automation, these new “non-cloud” options become part of the dynamic infrastructure spectrum.
Today these seven items create complexity and confusion as we work to balance the new concepts and technologies. I can see a path forward that redefines IT to be both more flexible and dynamic while also being stable and performing.
Want more 2015 predictions? Here’s my OpenStack EOY post about limiting/expanding the project scope.
This quick advice for preparing for a college interview is also useful for any interview: identify three key strengths and activities then prepare short insightful stories that show your strengths in each activity. Stories are the strongest way to convey information.
I’ve been doing engineering college interviews since 2013 for my Alma mater, Duke University. I love meeting the upcoming generation of engineers and seeing how their educational experiences will shape their future careers. Sadly, I also find that few students are really prepared to showcase themselves well in these interviews. Since it makes my job simpler if you are prepared, I’m going to post my recommendation for future interviews!
It does not take much to prepare for a college interview: you mainly need to be able to tell some short, detailed stories from your experiences that highlight your strengths.
In my experience, the best interviewees are good at telling short and specific stories that highlight their experiences and strengths. It’s not that they have better experiences, they are just better prepared at showcasing them. Being prepared makes you more confident and comfortable which then helps you control of how the interview goes and ensures that you leave the right impression.
1/9/15 Note: Control the interview? Yes! You should be planning to lead the interviewer to your strengths. Don’t passively expect them to dig that information out of you. It’s a two-way conversation, not an interrogation.
Here’s how it works:
Identify three activities that you are passionate about. They do not have to represent the majority of you effort. Select ones that define who you are, or caused you to grow in some way. They could be general items like “reading” or very specific like “summer camp 2016.” You need to get excited when you talk about these items. Put these on the rows/y-axis of a 3×3 grid (see below).
Identify three attributes that describe you (you may want help from friends or parents here). These words should be enough to give a fast snapshot of who you are. In the example below, the person would be something like “an adventure seeking leader who values standing out as an individual.” Put these attributes on the columns/x-axis of your grid as I’ve shown below.
Come up with the nine short stories (3-6 sentences!) for the intersections on the grid where you demonstrated the key attribute during the activity. They cannot just be statements – you must have stories because they provide easy to remember examples for your interview. If you don’t have a story for an intersection, then talk about how you plan to work this in the future.
Note: This might feel repetitive when you construct your grid but this technique works exceptionally well during an hour-long interview. You should repeat yourself because you need reinforce your strengths and leave the interviewer with a sure sense of who you are.
Sample Grid – Click to Enlarge
Remember: An admissions, alumni or faculty interview is all about making a strong impression about who you are and, more importantly, what you will bring to the university.
Having a concrete set of experiences and attributes makes sure that you reinforce your strengths. By showing them in stories, you will create a much richer picture about who you are than if you simply assert statements about yourself. Remember the old adage of “show, don’t tell.”
Don’t use this grid as the only basis for your interview! It should be a foundation that you can come back to during your conversations with college representatives. These are your key discussion points – let them help you round out the dialog.
Good luck!
PS: Google your interviewer! I expect the candidates to know me before they meet me. It is perfectly normal and you’d be crazy to not take advantage of that.
Server management interfaces stink. They are inconsistent both between vendors and within their own product suites. Ideally, Vendors would agree on a single API; however, it’s not clear if the diversity is a product of competition or actual platform variation. Likely, it’s both.
What is Redfish? It’s a REST API for server configuration that aims to replace both IPMI and vendor specific server interfaces (like WSMAN). Here’s the official text from RedfishSpecification.org.
Redfish is a modern intelligent [server] manageability interface and lightweight data model specification that is scalable, discoverable and extensible. Redfish is suitable for a multitude of end-users, from the datacenter operator to an enterprise management console.
I think that it’s great to see vendors trying to get on the same page and I’m optimistic that we could get something better than IPMI (that’s a very low bar). However, I don’t expect that vendors can converge to a single API; it’s just not practical due to release times and pressures to expose special features. I think the divergence in APIs is due both to competitive pressures and to real variance between platforms.
Even if we manage to a grand server management unification; the problem of interface heterogeneity has a long legacy tail.
In the best case reality, we’re going from N versions to N+1 (and likely N*2) versions because the legacy gear is still around for a long time. Adding Redfish means API sprawl is going to get worse until it gets back to being about the same as it is now.
Putting pessimism aside, the sprawl problem is severe enough that it’s worth supporting Redfish on the hope that it makes things better.
That’s easy to say, but expensive to do. If I was making hardware (I left Dell in Oct 2014), I’d consider it an expensive investment for an uncertain return. Even so, several major hardware players are stepping forward to help standardize. I think Redfish would have good ROI for smaller vendors looking to displace a major player can ride on the standard.
Redfish is GREAT NEWS for me since RackN/Crowbar provides hardware abstraction and heterogeneous interface support. More API variation makes my work more valuable.
One final note: if Redfish improves hardware security in a real way then it could be a game changer; however, embedded firmware web servers can be tricky to secure and patch compared to larger application focused software stacks. This is one area what I’m hoping to see a lot of vendor collaboration! [note: this should be it’s own subject – the security issue is more than API, it’s about system wide configuration. stay tuned!]
OpenStack has grown dramatically in many ways but we have failed to integrate development, operations and business communities in a balanced way.
My most urgent observation from Paris is that these three critical parts of the community are having vastly different dialogs about OpenStack.
At the Conference, business people were talking were about core, stability and utility while the developers were talking about features, reorganizing and expanding projects. The operators, unfortunately segregated in a different location, were trying to figure out how to share best practices and tools.
Much of this structural divergence was intentional and should be (re)evaluated as we grow.
OpenStack events are split into distinct focus areas: the conference for business people, the summit for developers and specialized days for operators. While this design serves a purpose, the community needs to be taking extra steps to ensure communication. Without that communication, corporate sponsors and users may find it easier to solve problems inside their walls than outside in the community.
The risk is clear: vendors may find it easier to work on a fork where they have business and operational control than work within the community.
Inside the community, we are working to help resolve this challenge with several parallel efforts. As a community member, I challenge you to get involved in these efforts to ensure the project balances dev, biz and ops priorities. As a board member, I feel it’s a leadership challenge to make sure these efforts converge and that’s one of the reasons I’ve been working on several of these efforts:
OpenStack Project Managers (was Hidden Influencers) across companies in the ecosystem are getting organized into their own team. Since these managers effectively direct the majority of OpenStack developers, this group will allow
DefCore Committee works to define a smaller subset of the overall OpenStack Project that will be required for vendors using the OpenStack trademark and logo. This helps the business community focus on interoperability and stability.
Technical leadership (TC) lead “Big Tent” concept aligns with DefCore work and attempts to create a stable base platform while making it easier for new projects to enter the ecosystem. I’ve got a lot to say about this, but frankly, without safeguards, this scares people in the ops and business communities.
An operations “ready state” baseline keeps the community from being able to share best practices – this has become a pressing need. I’d like to suggest as OpenCrowbar an outside of OpenStack a way to help provide an ops neutral common starting point. Having the OpenStack developer community attempting to create an installer using OpenStack has proven a significant distraction and only further distances operators from the community.
We need to get past seeing the project primarily as a technology platform. Infrastructure software has to deliver value as an operational tool for enterprises. For OpenStack to thrive, we must make sure the needs of all constituents (Dev, Biz, Ops) are being addressed.
During this blog series, we’ve explored how important culture is in the work place. The high tech areas are especially sensitive because they disproportionately embrace the millennial culture which often causes conflicts.
Our world has changed, driven by technology, new thinking, and new methodologies yet we may be using 20th century management techniques on 21st century customers and workers. There is an old business axiom that states, “If you can’t measure it, you can’t manage it.” Yet how much of our process, interaction, successes, and failures never wind up on a spreadsheet, yet impact it?
Customers don’t leave bad companies; they leave companies that miss the mark when it comes to customer engagement. To better serve our customers we need to understand and adapt to the psychology of a new customer … one who has been trained to work as a Digital Native.
What would that look like? Tech people who interact with patience, collaboration, deep knowledge, and an openness to input, adapting to a customer’s needs in real-time. Wouldn’t that create a relationship that is second to none and unbreakable? Wouldn’t that be a leg up on the competition?
By understanding that new business culture has been influenced by the gaming experience, we have a deeper understanding of what is important to our customer base. And like a video game, if you cling to hierarchy, you lose. If you get caught up in linear time management, you lose. If you cling to bottlenecks and tradition you lose.
Three key takeaways: speed, adaptation, and collaboration
Those three words sum up today’s business environment. By now, you should not be surprised that those drivers are skills honed in video games.
We’ve explored the radically different ways that Digital Natives approach business opportunities. As the emerging leaders of the technological world, we must shift our operations to be more open, collaborative, iterative, and experience based.
Rob challenges you to get involved in his and other collaborative open source projects. Brad challenges you to try new leadership styles that engage with the Cloud Generation. Together, we challenge our entire industry to embrace a new paradigm that redefines how we interact and innovate. We may as well embrace it because it is the paradigm that we’ve already trained the rising generation or workers to intuitively understand.
What’s next?
Brad and Rob collaborated on this series with the idea of extending the concepts beyond a discussion of the “digital divide” and really looking at how culture impacts business leadership. Lately, we’ve witnessed that the digital divide is not about your birthday alone. We’ve seen that age alone does not drive the all cultural differences we’ve described here. Our next posts will reflect the foundations for different ways that we’ve seen people respond to each other with a focus on answering “can digital age workers deliver?”