Understanding application resiliency
The term "resiliency" when referring to applications refers to how well the application is able to recover when a vital component or underlying resource is missing. In virtualization environments, a key consideration is what will happen to a business critical application when the underlying host on which the virtual machine suddenly becomes unavailable. What will happen to the transactions currently being processed by the application? What will a customer or client who is using the application experience when this happens?
These are no trivial matters for a business, and because of this the big players in the virtualization market like Microsoft and VMware have developed various solutions to ensure that virtualized applications can recover quickly and gracefully from host failures without incurring data loss.
But before we examine several of these solutions, let's step back from the technologies for a moment and consider what drives (or should drive) decision-making with regard to virtualization solutions, namely the business requirements of your organization.
Aligning technology with business requirements
To understand the kinds of decisions that may be involved when selecting a technology or solution to meet a business need, we'll consider a simple analogy around a familiar tool: the screwdriver.
Consider a business that needs a screwdriver to keep their operations running smoothly. Company A looks at what's available on Amazon and they decide to purchase Screwdriver Number One (see http://amzn.to/1fjGALC):
Figure 1: A tool that can help ensure operational resiliency for a mature business.
Did they make the right decision? If the company is a mature business that is only concerned with continuing to crank out more of their existing product at the rate they've been doing for many years, then the answer is yes. Not every company is looking for ways to expand or capitalize on the latest market trends. Does a business that makes toothpicks need a private cloud solution? Maybe they only need a Hyper-V host running some custom code on a Windows XP virtual machine disconnected from the Internet for controlling their toothpick making machine.
Then there's Company B which sees their market share as eroding in the near future. As a result, they want to diversify into new products. Since this means they might soon have to purchase some new industrial equipment, they want to ensure they'll have the necessary tools to maintain such new equipment whatever it might be. So they purchase Screwdriver Number Two instead (see http://amzn.to/1jvfwjP):
Figure 2: A tool that can help ensure operational resiliency for a business thinking of diversifying.
Not only does the above screwdriver include a variety of standard bits, these bits are also attached so they're easier to use and won't get lost. And it has a built-in flashlight so you can fix your machinery when the lights go off! Surely any business that considers diversification a key to their long-term survival will want to buy this screwdriver. But a business that's merely interested in maintaining their current market share with their existing product line would probably consider it a waste of their money to purchase a high-tech solution like this.
Then there are the nimble ones, the agile ones, high tech companies that need to grow fast so they can jump on trends and dominate emerging markets. Funded by angel investors, they're ready and willing to throw cash at a problem to ensure their operations can scale quickly and handle anything that might arise. Company C is an example of such a business, and they have no qualms ordering Screwdriver Number Three (see http://amzn.to/1eQ8QZm):
Figure 3: A tool that can help ensure operational resiliency for a business that needs agility.
Powerful and reliable, with hex bit tips for easy changeover, this highly-rated tool may cost a lot but it will help ensure the resiliency of your industrial line of operations in any situation.
Unfortunately the prevalent line of thinking today is that every business needs this kind of tool, and that simply isn't true. Instead, the tool you buy should align with the needs of your business, and different kinds of businesses have different kinds of needs.
The same kind of thinking applies today to ensuring application resiliency in virtualization environments. There are several different solutions available from different virtualization vendors, and the solution you choose for your organization should be based on the specific needs of your business. Let's briefly look at the pros and cons of two such solutions: VMware Fault Tolerance and Hyper-V guest clustering.
VMware Fault Tolerance
VMware Fault Tolerance (FT) can help ensure application resiliency by keeping a live shadow copy of a virtual machine running in lockstep with the virtual machine. VMware FT was introduced in version 4.0 of VMware vSphere and requires that VMware High Availability (HA) be configured for the hosts involved. With FT, any file operations and other processing that happen on the primary virtual machine also take place on the secondary (shadow) virtual machine to keep the two virtual machines in an identical state at all times. Then if something happens that causes the primary virtual machine's host to go down, the secondary virtual machine (which runs on a different host) can immediately jump in and pick up where the primary left off to continue whatever operations were being executed on the primary with no interruption and no loss of data. The resulting solution (HA + FT) can provide continuous availability with essentially zero downtime for mission critical virtualized applications and services.
The advantages of VMware FT include the following:
- Simplicity - The concept of having a live shadow of an application running in the background and ready to take over in the event of a disruption is simple and appealing.
- Easy to implement - Provided you already have VMware HA set up in your environment, configuring FT takes only a couple of clicks of the mouse.
- Broad application support - Basically any application on any operating system that can be virtualized using VMware ESX can be made more resilient by enabling VMware FT. Specifically, FT can be used to increase the resiliency even of applications that aren't cluster-aware.
But there are some significant limitations if you're thinking of using FT:
- Processing limitations - A virtual machine that has FT enabled can only have one virtual CPU. This limits the kinds of applications that can leverage FT for providing application resiliency. For example, you probably won't be able to use it for running a large database application since it might be processor-bound with only one vCPU. Some other advanced processor features like Hyper-Threading and SLAT are also not supported by FT.
- Virtual networking limitations - Single-root I/O virtualization (SR-IOV) allows supported network adapters to be directly assigned to a virtual machine to maximize network throughput while minimizing network latency and the CPU overhead required for processing network traffic. The result is increased virtual networking I/O for the virtual machine. While high-end SKUs of vSphere 5.1 support SR-IOV, it's not supported in conjunction with FT. This can limit network throughput (and therefore transactions per second) for bandwidth-hungry virtualized applications. NIC passthrough is also not supported in conjunction with FT.
- Other functional limitations - FT cannot be used in conjunction with the following VMware features: snapshots, Storage vMotion, linked clones, VMware Consolidated Backup, Virtual SAN, thin provisioning of storage, N_Port ID Virtualization (NPIV), hot-pluggable devices. You can see the complete list of limitations here.
But do these limitations mean that FT is a poor choice for ensuring application resiliency? Certainly not provided that one or more of the following is true:
- Your mission critical business application can perform well under the limitations described above.
- Your application can't be made more resilient using another vendor's solution.
- You already have a VMware HA infrastructure in place and know how to use it.
Hyper-V guest clustering
Guest clustering on Microsoft's Windows Server platform involves installing and configuring the Failover Clustering feature within the guest operating system of virtual machines. In a typical guest clustering deployment, the clustered virtual machines are each running on different Hyper-V hosts, and these hosts are themselves nodes in a host cluster. This means you'll have two instances of Failover Clustering configured: inside the operating system of the Hyper-V hosts and inside the guest operating system of the virtual machines running on the hosts.
Why would you want to do that? Basically because it can provide much faster failover should one of your hosts fail or be taken down for maintenance. When you have only host clustering configured and one of the hosts in a host cluster goes down, the virtual machines are restarted on the surviving host. The failover time in this instance includes the time for the guest operating system and applications to restart. But if your virtualized applications are cluster-aware and in addition to host clustering you also have guest clustering configured, then the failover time is greatly reduced because the active application workload itself can be failed over instead of the entire virtual machine being failed over. This makes the combination of guest and host clustering a good choice for a resiliency solution for applications that are cluster-aware.
The main advantage of using guest clustering is no reduced functionality. In other words, the only limitations on processing, virtual networking and storage are those of the operating system itself. So by using Windows Server 2012 R2 as the host and guest operating system, you can leverage the capabilities of support for SLAT and Hyper Threading, support for up to 64 virtual processors, support for SR-IOV, support for up to 64 nodes per cluster, Live Migration, Hyper-V Replica, and so on. For a comparison of Windows Server 2012 R2 Hyper-V and VMware vSphere 5.5 from a Microsoft perspective, see this link.
That all sounds good, but are there any possible disadvantages to using guest clustering? Consider the following:
- More complexity - While the VMware HA/FT solution requires configuring only one instance of clustering (on the hosts) and then flipping a few switches, Microsoft's guest clustering entails configuring two separate instances of Failover Clustering (one on the hosts and the other within the virtual machines).
- Applications must be cluster-aware - To use guest clustering to provide resiliency for a virtualized application, the application must be cluster-aware. TechNet says that a cluster-aware application "is an application that calls the cluster APIs to determine the context under which it is running...and can failover between nodes for high availability" (see this link). Most Microsoft server applications are cluster-aware including the latest versions of SQL Server and Exchange, server roles like DHCP server and Scale-out File Server, and so on. If you're running a custom application that is not cluster-aware, you would need to re-architect the application it if you wanted to use guest clustering to provide resiliency.
For a good explanation of guest clustering works and how to configure it, see this link.
Getting back to our screwdriver analogy, the best approach from a business perspective is to match the right tool with the job. For example, VMware FT is probably an ideal solution for adding resiliency to a legacy business application that is not cluster-aware and for which re-coding the application to making it cluster-aware would incur unnecessary cost for the company. The analogy here would be Figure 1 which shows a simple but effective tool that can help ensure operational resiliency for a mature business with no plans to diversify or take on the world. On the other hand, businesses that need to diversify or become agile in order to compete in rapidly evolving markets may want to consider deploying a Hyper-V guest clustering solution even if they are currently a VMware shop so they can take advantage of the resiliency such a solution can provide for business applications built using SQL Server 2012 Always On, Scale-out File Server and similar cluster-aware applications.