Back of the Napkin to Presentation in 30 seconds

Posted on August 8, 2014 by Rob H

I wanted to share a handy new process for creating presentations that I’ve been using lately that involves using cocktail napkins, smart phones and Google presentations.

Here’s the Process:

sketch an idea out with my colleagues on a napkin, whiteboard or notebook during our discussion.
snap a picture and upload it to my Google drive from my phone,
import the picture into my presentation using my phone,
tell my team that I’ve updated the presentation using Slack on my phone.

Clearly, this is not a finished presentation; however, it does serve to quickly capture critical content from a discussion without disrupting the flow of ideas. It also alerts everyone that we’re adding content and helps frame what that content will be as we polish it. When we immediately position the napkin into a deck, it creates clear action items and reference points for the team.

While blindingly simple, having a quick feedback loop and visual placeholders translates into improved team communication.

SDN’s got Blind Spots! What are these Projects Ignoring? [Guest Post by Scott Jensen]

Posted on July 8, 2014 by Rob H

Scott Jensen returns as a guest poster about SDN! I’m delighted to share his pointed insights that expand on previous 2 Part serieS about NFV and SDN. I especially like his Rumsfeldian “unknowable workloads”

In my [Scott’s] last post, I talked about why SDN is important in cloud environments; however, I’d like to challenge the underlying assumption that SDN cures all ops problems.

SDN implementations which I have looked at make the following base assumption about the physical network. From the OpenContrails documentation:

The role of the physical underlay network is to provide an “IP fabric” – its responsibility is to provide unicast IP connectivity from any physical device (server, storage device, router, or switch) to any other physical device. An ideal underlay network provides uniform low-latency, non-blocking, high-bandwidth connectivity from any point in the network to any other point in the network.

The basic idea is to build an overlay network on top of the physical network in order to utilize a variety of protocols (Netflow, VLAN, VXLAN, MPLS etc.) and build the networking infrastructure which is needed by the applications and more importantly allow the applications to modify this virtual infrastructure to build the constructs that they need to operate correctly.

All well and good; however, what about the Physical Networks?

That is where you will run into bandwidth issues, QOS issues, latency differences and where the rubber really meets the road. Ignoring the physical networks configuration can (and probably will) cause the entire system to perform poorly.

Does it make sense to just assume that you have uniform low latency connectivity to all points in the network? In many cases, it does not. For example:

Accesses to storage arrays have a different traffic pattern than a distributed storage system.
Compute resources which are used to house VMs which are running web applications are different than those which run database applications.
Some applications are specifically sensitive to certain networking issues such as available bandwidth, Jitter, Latency and so forth.
Where others will perform actions over the network at certain times of the day but then will not require the network resources for the rest of the day. Classic examples of this are system backups or replication events.

If the infrastructure you are trying to implement is truly unknown as to how it will be utilized then you may have no choice than to over-provision the physical network. In building a public cloud, the users will run whichever application they wish it may not be possible to engineer the appropriate traffic patterns.

This unknowable workload is exactly what these types of SDN projects are trying to target!

When designing these systems you do have a good idea of how it will be utilized or at least how specific portions of the system will be utilized and you need to account for that when building up the physical network under the SDN.

It is my belief that SDN applications should not just create an overlay. That is part of the story, but should also take into account the physical infrastructure and assist with modifying the configuration of the Physical devices. This balance achieves the best use of the network for both the applications which are running in the environment AND for the systems which they run on or rely upon for their operations.

We need to reframe our thinking about SDN because we cannot just keep assuming that the speeds of the network will follow Moore’s Law and that you can assume that the Network is an unlimited resource.

Networking in Cloud Environments, SDN, NFV, and why it matters [part 2 of 2]

Posted on May 5, 2014 by Rob H

Scott Jensen is an Engineering Director and colleague of mine from Dell with deep networking and operations experience. He had first hand experience deploying OpenStack and Hadoop and has a critical role in defining Dell’s Reference Architectures in those areas. When I saw this writeup about cloud networking (first post), I asked if it would be OK to post it here and share it with you.

GUEST POST 2 OF 2 BY SCOTT JENSEN:

So what is different about Cloud and how does it impact on the network

In a traditional data center this was not all that difficult (relatively). You knew what was going to running on what system (physically) and could plan your infrastructure accordingly. The majority of the traffic moved in a North/South direction. Or basically from outside the infrastructure (the internet for example) to inside and then responded back out. You knew that if you had to design a communication channel from an application server to a database server this could be isolated from the other traffic as they did not usually reside on the same system.

Virtualization made this more difficult. In this model you are sharing systems resources for different applications. From the networks point of view there are a large number of systems available behind a couple of links. Live Migration puts another wrinkle in the design as you now have to deal with a specific system moving from one physical server to another. Network Virtualization helps out a lot with this. With this you can now move virtual ports from one physical server to another to ensure that when one virtual machine moves from a physical server to another that the network is still available. In many cases you managed these virtual networks the same as you managed your physical network. As a matter of fact they were designed to emulate the physical as much as possible. The virtual machines still looked a lot like the physical ones they replaced and can be treated in vary much the same way from a traffic flow perspective. The traffic still is primarily a North/South pattern.

Cloud, however, is a different ball of wax. Think about the charistics of the Cattle described above. A cloud application is smaller and purpose built. The majority of its traffic is between VMs as different tiers which were traditionally on the same system or in the same VM are now spread across multiple VMs. Therefore its traffic patterns are primarily East/West. You cannot forget that there is a North/South pattern the same as what was in the other models which is typically user interaction. It is stateless so that many copies of itself can run in tandem allowing it to elastically scale up and down based on need and as such they are appearing and disappearing from the network. As these VMs are spawned on the system they may be right next to each other or on different servers or potentially in different Data Centers. But it gets even better. scj-net2

Cloud architectures are typically multi-tenant. This means that multiple customers will utilize this infrastructure and need to be isolated from each other. And of course Clouds are self-service. Users/developers can design, build and deploy whenever they want. Including designing the network interconnects that their applications need to function. All of this will cause overlapping IP address domains, multiple virtual networks both L2 and L3, requirements for dynamically configuring QOS, Load Balancers and Firewalls. Lastly in our list of headaches is not the least. Cloud systems tend to breed like rabbits or multiply like coat hangers in the closet. There are more and more systems as 10 servers become 40 which becomes 100 then 1000 and so on.

So what is a poor Network Engineer to do?

First get a handle on what this Cloud thing is supposed to be for. If you are one of the lucky ones who can dictate the use of the infrastructure then rock on! Unfortunately, that does not seem to be the way it goes for many. In the case where you just cannot predict how the infrastructure will be used I am reminded of the phrase “there is not replacement for displacement”. Fast links, non-blocking switches, Network Fabrics are all necessary for the physical network but will not get you there. Sense as a network administrator you cannot predict the traffic patterns who can? Well the developer and the application itself. This is what SDN is all about. It allows a programmatic interface to what is called an overlay network. A series of tunnels/flows which can build virtual networks on top of the physical network giving that pesky application what it was looking for. In some cases you may want to make changes to the physical infrastructure. For example change the configuration of the Firewall or Load Balancer or other network equipment. SDN vendors are creating plug-ins that can make those types of configurations. But if this is not good enough for you there is NFV. The basic idea here is that why have specialized hardware for your core network infrastructure when we can run them virtualized as well? Let’s run those in VM’s as well, hook them into the virtual network and SDN to configure them and we now can virtualize the routers, load balancers, firewalls and switches. These technologies are in very much a state of flux right now but they are promising none the less. Now if we could just virtualize the monitoring and troubleshooting of these environments I’d be happy.

Ops Validation using Development Tests [3/4 series on Operating Open Source Infrastructure]

Posted on May 5, 2014 by Rob H

This post is the third in a 4 part series about Success factors for Operating Open Source Infrastructure.

In an automated configuration deployment scenario, problems surface very quickly. They prevent deployment and force resolution before progress can be made. Unfortunately, many times this appears to be a failure within the deployment automation. My personal experience has been exactly the opposite: automation creates a “fail fast” environment in which critical issues are discovered and resolved during provisioning instead of sleeping until later.

Our ability to detect and stop until these issues are resolved creates exactly the type of repeatable, successful deployment that is essential to long-term success. When we look at these deployments, the most important success factors are that the deployment is consistent, known and predictable. Our ability to quickly identify and resolve issues that do not match those patterns dramatically improves the long-term stability of the system by creating an environment that has been benchmarked against a known reference.

Benchmarking against a known reference is ultimately the most significant value that we can provide in helping customers bring up complex solutions such as Openstack and Hadoop. Being successful with these deployments over the long term means that you have established a known configuration, and that you have maintained it in a way that is explainable and reference-able to other places.

Reference Implementation

The concept of a reference implementation provides tremendous value in deployment. Following a pattern that is a reference implementation enables you to compare notes, get help and ultimately upgrade and change deployment in known, predictable ways. Customers who can follow and implement a vendors’ reference, or the community’s reference implementation, are able to ask for help on the mailing lists, call in for help and work with the community in ways that are consistent and predictable.

Let’s explore what a reference implementation looks like.

In a reference implementation you have a consistent, known state of your physical infrastructure that has been implemented based upon a RA. That implementation follows a known best practice using standard gear in a consistent, known configuration. You can therefore explain your configuration to a community of other developers, or other people who have similar configuration, and can validate that your problem is not the physical configuration. Fundamentally, everything in a reference implementation is driving towards the elimination of possible failure cause. In this case, we are making sure that the physical infrastructure is not causing problems (getting to a ready state), because other people are using a similar (or identical) physical infrastructure configuration.

The next components of a reference implementation are the underlying software configurations for operating system management monitoring network configuration, IP networking stacks. Pretty much the entire component of the application is riding on. There are a lot of moving parts and complexity in this scenario, witha high likelihood of causing failures. Implementing and deploying the software stacks in an automated way, has enabled us to dramatically reduce the potential for problems caused by misconfiguration. Because the number of permutations of software in the reference stack is so high, it is essential that successful deployment tightly manages what exactly is deployed, in such a way that they can identify, name, and compare notes with other deployments.

Achieving Repeatable Deployments

In this case, our referenced deployment consists of the exact composition of the operating system, infrastructure tooling, and capabilities for the deployment. By having a reference capability, we can ensure that we have the same:

Operating system
Monitoring
Configuration stacks
Security tooling
Patches
Network stack (including bridges and VLAN, IP table configurations)

Each one of these components is a potential failure point in a deployment. By being able to configure and maintain that configuration automatically, we dramatically increase the opportunities for success by enabling customers to have a consistent configuration between sites.

Repeatable reference deployments enable customers to compare notes with Dell and with others in the community. It enables us to take and apply what we have learned from one site to another. For example, if a new patch breaks functionality, then we can quickly determine how that was caused. We can then fix the solution, add in the complimentary fix, and deploy it at that one site. If we are aware that 90% of our other sites have exactly the same configuration, it enables those other sites to avoid a similar problem. In this way, having both a pattern and practice referenced deployment enables the community to absorb or respond much more quickly, and be successful with a changing code base. We found that it is impractical to expect things not to change.

The only thing that we can do is build resiliency for change into these deployments. Creating an automated and tested referenceable deployment is the best way to cope with change.

Networking in Cloud Environments, SDN, NFV, and why it matters [part 1 of 2]

Posted on May 1, 2014 by Rob H

Guest Post 1 of 2 by Scott Jensen:

Having a basis in enterprise data center networking, Cloud computing I have many conversations with customers implementing a cloud infrastructure. Their design the networking infrastructure can and should be different from a classic network configuration and many do not understand why. Either due to a lack of knowledge in networking or due to a lack of understanding as to why cloud computing is different from virtualization. Once you have an understanding of both of these areas you can begin to see why emerging technologies such as SDN (Software Defined Networking) and NFV (Network Function Virtualization) begin to address some of the issues that Cloud Computing can cause with your network.

Networking is all about traffic flows. In order to properly design your infrastructure you need to understand where traffic is originating, where it is going and how much traffic will be following a specific route and at what times.

There are many differences between Cloud Computing and virtualization. In many cases people I will talk to think of Cloud as virtualization in a different environment. Of course this will work just fine however it does not take advantage of the goodness that a Cloud infrastructure can bring. Some of the major differences between Virtualization and Cloud Computing have profound effects on how the network is utilized. This all has to do with the application. That is really what it is all about anyway. Rob Hirschfeld has a great post on the difference between Pets and Cattle which describes this well.

Pets and Cattle as a workload evolution

In typical virtualized infrastructures, the applications have a fairly common pattern. Many people describe these as Pets and are managed largely the same as a physical system. They have a name, they are one of a kind, they are cared for, and when the die it can be traumatic (I know I have been there).

They run on large stateful VMs
They have a lifecycle which is typically very long such as years
The applications themselves are not designed to tolerate failures. Other technologies are brought in to ensure uptime.
The application is scaled up when demands increase. This is done by adding more memory or CPU to the VM.

Cloud applications are different. Some people describe them as cattle and they are treated like cattle in many ways. They do not necessarily have a name and if one dies it is sad but not a really big deal. We should probably figure out what killed it but life goes on.

They run on smaller stateless VMs
They have a lifecycle measured in hours or months. Sometimes even less than an hour.
The application is designed to expect failures
The application scales out by increasing the number of instances which is running when the demand increases.

In his follow-up post next week, Scott discusses how this impacts the network and how SDN and NFV promises to help.

Rob Hirschfeld

On Computing, Containers, Cloud & Tech Culture

Category Archives: Scott Jensen