Real World - A SAN Migration Experience
First of all, I’ll lay the groundwork. Our iSCSI-based SAN supported a number of workloads, including:
- A 50 virtual server strong VMware cluster running on four vSphere 4.0/4.1 hosts.
- Our Exchange 2007 mail stores. Our Exchange 2007 system was fully physical while it ran on the iSCSI SAN.
- Our primary database server databases. Our primary SQL server is a physical server with all databases stored on the iSCSI SAN.
I’ll continue by saying that I didn’t go into the SAN decision with a requirement that it be a Fibre Channel solution, but the ultimate option we chose moved us down that path. Although I’ve had relatively little experience with Fibre Channel, getting it up and running was really pretty easy. Frankly, my initial desire was to stay on iSCSI, but I wasn’t so locked into that desire that I was unwilling to consider non-iSCSI solutions.
Here’s a look at the infrastructure we had in place before March of 2011:
- iSCSI SAN - two management modules with redundant paths. The storage system is fully redundant, as it should be.
- A single Dell M1000e blade chassis with 6 x M3220 Ethernet blades in the back. Each of these Ethernet blades connects to a NIC port in each server.
- Four VMware ESX hosts, each with 6 Ethernet ports spread across three NICs (two onboard NICs and 2x2 port mezzanine cards). Each port maps to an Ethernet switch module in the back of the chassis. At the beginning of the process, three of the ESX hosts had 32 GB of RAM each and one had 48 GB of RAM.
We faced some challenges, though. First of all, in order to make the move to Fibre Channel, we needed to replace one of the Ethernet cards in each of the existing servers with a Fibre Channel card and also replace two of the six Ethernet switches in the blade chassis with Fibre Channel blades to match up with the new Fibre Channel mezzanine card being installed in each server.
However, although the M1000e chassis does allow for hot swapping of the chassis blades, you can’t replace a blade of one type when one or more of the servers still have mezzanine cards of another type. So, as long as we had any blade servers with an Ethernet card in the slot we were targeting for Fibre Channel, we couldn’t install the Fibre Channel module in the chassis. Had we tried, the chassis has a mechanism that disables the communications slot in order to prevent damage. So, the first order of business was running through the servers and removing the third Ethernet adapter… right after we added one new ESX host.
A new addition
As a part of this migration process, we also made the decision to replace one of our existing ESX servers containing 32 GB of RAM with a new unit containing 96 GB of RAM; our cluster was becoming RAM-bound. Believe it or not, it was less expensive to buy a whole new server than it was to just upgrade the RAM in the old server and add an extended warranty. To add some icing to the cake, the new server has dual six core processors while the old unit used quad core processors. So, we got 64 GB more RAM and 4 more processing cores for less than the cost of a RAM upgrade and warranty extension.
We kicked off our process by installing vSphere 4.1u1 on the new server, applying a host profile and adding this new server to our vSphere cluster with the 60-day grace period license provided by VMware.
The process begins
Obviously, we needed to accomplish our migration goals with as little downtime as possible. So, we did it by brute force. Starting with the first vSphere server, we moved that host into maintenance mode which allowed vCenter to evacuate the server for us and automatically vMotion all of its workloads to other hosts in the cluster, including our shiny new 96 GB, 12 core unit. Once all workloads were evacuated, we pulled the server from the chassis and removed the third Ethernet adapter and installed the Fibre Channel adapter.
All was not well.
With this first server, we ran into a problem. Our original intent was to pull the Ethernet card and simply replace it with the new Fibre Channel adapter, slide it back into the chassis and turn it back on. When we tried this with the first server, the system wouldn’t boot. Upon further investigation, we determined this: If any of the servers in the blade chassis still had Ethernet cards in the slot we were intending for the Fibre Channel adapters, the server would not boot. Period. According to Dell, this is by design in order to prevent system damage. So, we revised our plan a bit. We pulled the newly installed Fibre card, brought the server back up without it and then exited maintenance mode, thus putting that server back into production and able to host workloads. From there, we moved on to the other vSphere hosts and, one by one, moved them to maintenance mode and removed the third Ethernet adapter and then placed them back into production.
Once the Ethernet adapter removal was complete and none of the servers in the chassis had an Ethernet adapter in their third slot, we installed the Fibre Channel switch modules into the back of the server chassis. Viola! Success! The modules powered up with no issues.
At that point, we started the process of running back through each of the vSphere servers to install the Fibre Channel adapter into each one. Again, we placed each system into maintenance mode and shut it down. We removed the blade from the server chassis and installed the Fibre Channel adapter. Once installed, we rebooted the system and placed it back into production, with one exception – we completely decommissioned one of the 32 GB, 8-core vSphere hosts and applied its license to the new 96 GB, 12-core host. This was a part of our replacement strategy.
Once all of the servers had their Fibre Channel adapters and I was certain that the Fibre Channel switch in the back of the server chassis could see all of the Fibre Channel ports, I connected the new SAN to one of the external ports on the new Fibre Channel switches; I made one connection from each Fibre Channel switch to each of the Fibre Channel ports on each management module in the storage array.
I’ll admit up front that my first experience with Fibre Channel was this project. It was, fortunately, a piece of cake. I simply created a zone on each of the Fibre Channel switches and added the appropriate switch ports to each of the new zones. Finally, I configured the vSphere hosts to be able to see volumes on the new SAN; I’d created a small test volume for this purpose.
The migration begins
Now, the stage was set. All of the vSphere servers could see both the old and the new storage devices. I created new LUNs on the new storage array and, from there, created new VMFS volumes. Next, I started a Storage vMotion operation for each of the 50 or so virtual machines that had to be migrated from the iSCSI SAN to the Fibre Channel unit. Again, this was done with (almost) no downtime, as was the hardware modification process from earlier.
This Storage vMotion process took about two days to complete. I ran into minor problems migrating one virtual machine, which I was able to mitigate by bringing the machine down and moving it during a longer maintenance window.
The end result: All of the running VMware-based workloads were moved to the new storage with the only downtime being for one non-critical workload. All of the work I described above was done live, during a combination of business hours and off hours. All of the hardware work was done during business hours while the vMotion work was done throughout the day and night.
A few years ago, this kind of project would have been crazy to undertake during business hours. There would be significant downtime and the business would be negatively impacted during the migration. However, with the help of more than 150 vMotion operations, we were able to complete our project very, very quickly and with the business noticing, well, nothing. This is exactly the kind of benefit that we should be seeing from our virtual environment and this kind of capability was one of the drivers for the implementation of or virtual infrastructure.
Early in this article, I mentioned that we had some physical servers running SQL and Exchange with workloads running on the iSCSI SAN. In addition, we had some other physical workloads that needed to move to new hardware due to lease expiration on the existing units. Specifically, there were two additional physical servers that support our SharePoint 2007 environment – a single MOSS 2007 server and a dedicated SQL 2005 server. All of the storage for the SharePoint and dedicated SQL environment was local with no data residing on the old SAN.
So, what did we do:
- Exchange 2007 server. Our Exchange 2007 system holds mailboxes for 1,300 students, faculty and staff and all of the mail storage was on the iSCSI SAN, although the Exchange Server itself was a physical device. We used PlateSpinPowerConvert to perform a P2V operation to migrate this physical server to the newly expanded vSphere cluster. This took the better part of a Saturday night during an extended maintenance window, but we have now completely virtualized that workload. This also accomplished the goal of removing the Exchange 2007 databases from the iSCSI SAN. They’re now stored with the virtual machine on the new Fibre Channel unit.
- SharePoint 2007 system. Our institutional web site currently runs on SharePoint 2007, as it has for about three years. The physical server used direct-attached storage. Again, we used a P2V operation to virtualize the workload.
- SQL 2005. This dedicated SQL Server for SharePoint 2007 also used direct-attached storage. Again, a P2V operation handily virtualized this critical workload.
At this point, we have one more physical server that we intend to virtualize. Our primary database server is a blade server running SQL Server 2008 R2. This server stores all of its data on the iSCSI SAN and, unfortunately, still does. It’s the only remaining workload on the SAN. Although I have a very high level of confidence in P2V, we’re at a point in our semester at which we cannot afford extended downtime, even overnight. As such, we’re waiting for another couple of weeks before we finalize this process.
All in all, I’m extremely pleased by the end result. Although there were quite a few steps that had to be taken, none of them were particularly onerous or difficult. We were able to very quickly move our entire VMware inventory to our new storage and further consolidate/virtualize our physical inventory through the use of P2V software. At ths point, we’re running an environment that is well over 90% virtual and it shows. We’re able to much more easily perform maintenance and stand up new services than we were before and our new SAN gives us the performance and capacity that will help us grow.