We recently moved everything – SQL (2008 R2), Exchange (2007 & 2010), SharePoint (2007 & 2010), file servers, etc. – to our VMware environment which now runs on 12 core Dell blade servers connected to a shiny new Xiotech SAN. Everything is running really well. However, even the most robust infrastructures still have a failure from time to time, whether it’s hardware or software failure. At Westminster, even though we’re all virtual, we still maintain a traditional agent-based backup system. Last fall, we moved to Microsoft Data Protection Manager 2010 and have installed a backup agent inside each of our virtual machines. Today, at around 2PM, our primary production database went offline and appeared to have been corrupted. We quickly copied the affected database to a different location, copied all of the database server log files to that same location and then restored the 27 GB database from tape. The restore took 8 minutes. Fortunately, because DPM provides continuous data protection, we take hourly backups of our SQL servers. We were able to restore a backup that had been taken at 1:10 PM. In total, we lost 50 minutes worth of data entry. By 2:25 – 25 minutes after the failure – we had the system back in production. Of course, now we need to figure out what happened.
While there was still minor data loss, this incident showed that our processes and backups work exactly as they’re intended. Although this is supposed to be the case and we constantly test our processes, when the time comes, it’s really nice to see that things sometimes go right!