One of the worst things that can happen as an administrator of a Forefront Threat Management Gateway (TMG) 2010 firewall is hearing about connectivity issues from users or the help desk. Hearing about slowness or a full blown network outage from others is always bad news, as it makes the administrator appear out of touch. Working in reactive mode is always challenging, especially in large environments where the demands of many thousands or tens of thousands of users (especially the executives!) is high. When deploying TMG in any environment, large or small, it’s a good idea to keep a close eye on it so if/when problems arise you can be proactive and hopefully resolve the issue before the problem escalates into a service disruption. At the very least you’ll have advance notice of the issue and you can inform others that you’re working to resolve the issue. In this month’s article I’ll share with you some simple strategies along with tips and tricks for monitoring your production TMG deployment.
When preparing a monitoring strategy for your Forefront TMG 2010 firewall, the first thing I recommend is configuring connectivity verifiers to monitor network connectivity and important supporting services. This is a simple, highly effective tool that can be tremendously useful when connectivity or performance issues arise. Forefront TMG is highly dependent on supporting infrastructure services such as DNS and Active Directory, so it’s a good idea to monitor them closely from the TMG firewall’s perspective. In addition, if upstream connectivity is impaired, no amount of troubleshooting on the TMG firewall is going to address the issue until problems with outbound Internet connectivity are resolved. It’s a good idea too, then, to also monitor this link from the TMG firewall.
Begin by configuring connectivity verifiers for internal supporting infrastructure services such as Active Directory and DNS. Also, if you are using TMG to publish internal resources, for example Exchange Outlook Web App (OWA), SharePoint, or any other application or service, configure connectivity verifiers for those resources also. In addition, configure at least two connectivity verifiers for external web sites to ensure that outbound network connectivity is available and responsive. I recommend using large internet properties like Bing.com and Yahoo.com. Microsoft also maintains an “internet beacon” service at internetbeacon.msedge.net that reliably responds to ICMP, which can also be used as an additional connectivity verifier.
Once you’ve configured your connectivity verifiers you can monitor them to ensure that the TMG firewall has connectivity to the supporting infrastructure services necessary to service requests, and that the path to the Internet is active and healthy. You can also configure alert definitions to proactively notify you via e-mail, run a program or script, report the event to the Windows event log, or to stop or start services. For detailed information about configuring Forefront TMG connectivity verifiers, click here.
Intuitively, the most common tool to monitor the performance of the TMG firewall is the performance monitor. I’ve written in the past about using this tool and others to troubleshoot performance related issue on the TMG firewall, so I won’t cover that in detail here. However, there are some performance monitor objects and counters that can be watched closely to provide signs of potential trouble on the TMG firewall. Here are a two of the most common ones:
High CPU Utilization – Monitoring CPU utilization on the TMG firewall is a no-brainer. High levels of CPU utilization may indicate that the TMG firewall is overloaded. This could be due to higher than expected usage, which in my experience is often related to increases in user population caused by an acquisition, a new enterprise application deployment, or sometimes major sporting events like the NCAA basketball tournament or World Cup soccer. For enterprise deployments, excessive CPU utilization could also be caused by an array member going offline or an issue with load balancing.
Low CPU Utilization – A very important and often overlooked monitoring strategy for CPU utilization is watching for underutilization. While administrators tend to instinctively watch for high levels of usage, low levels can be strong indicators of potential problems. For example, if a load balancing issue arises, traffic may not get routed to all array members. If a single member isn’t handling enough of the load, it would be a good idea to be alerted to this so further investigation can be taken. Also, if there’s a network connectivity issue preventing traffic from reaching the TMG server, utilization will be abnormally low. Again, a simple alert to this situation can be extremely helpful.
Network Utilization – Monitoring network interfaces for high levels of utilization is an excellent idea. Watch for signs of saturation and queuing and alert accordingly. As with CPU monitoring, it is a good idea to also watch for signs of underutilization. Again, when a normally busy TMG firewall suddenly shows signs of handling little or no network traffic at all, it’s a good indicator that something else might be wrong and should be investigated. Pay special attention to the number of connections and the number of requests per second and compare them to your baseline. For enterprise arrays, the number of connections and requests per second should be pretty even across all array members. If they are not, look carefully for signs of connectivity or load balancing issues.
You can take performance monitoring to much greater levels of detail than just monitoring CPU and network performance, of course. For more details about monitoring performance on the TMG firewall, click here.
An important part of a successful TMG firewall monitoring strategy is understanding what utilization levels look like under normal circumstances. Here, a solid performance baseline should be gathered to determine what specifically “normal” looks like in your environment. Monitor the TMG firewall for extended periods of time to gain insight to the usage patterns and utilization levels unique to your deployment. Be sure to record this information for later review and comparison. In addition, conduct this baseline exercise on regular intervals to ensure that new usage patterns, perhaps caused by company growth or new applications deployed internally, are properly accounted for in the baseline. Another excellent idea is to create a performance monitor “black box” that continuously records performance monitor data on the firewall. In this configuration, a data collector set acts like a flight data recorder that can be reviewed to provide essential performance data leading up to an event. This can be helpful in determining the root cause of an outage. For details on configuring a performance monitor black box flight recorder, click here.
Another effective method for enhancing your monitoring strategy is to understand exactly what your TMG firewall deployment is capable of withstanding under the worst conditions. For larger enterprises there are load generation tools such as Spirent Avalanche, Ixia BreakingPoint, HP LoadRunner, and many more that can be used to simulate heavy traffic demands. For smaller environments, free tools such as NTttcp and iPerf can be leveraged, although it may require deploying many virtual machines in order to generate enough load. Hosted public cloud solutions like Microsoft Azure can be helpful for this too.
The point of load testing is to stress the TMG firewall deployment to the point of failure in a controlled situation, with monitoring tools running and recording at the time in order to provide valuable information about where exactly the breaking point of the system occurs. Also, these types of stress tests provide valuable insight to the events that lead up to a failure, so they can be monitored closely in production environments.
After successfully implementing a Forefront TMG 2010 firewall, monitoring the solution after production deployment can be beneficial for administrators of deployments large and small. Keeping a close eye on the health of TMG’s supporting services can reduce troubleshooting time significantly by illuminating where a problem might lie and allowing the administrator to focus their troubleshooting efforts there. Watching for signs of over utilization is key, but don’t forget to not overlook underutilization too as it can provide valuable clues and early indication of potential problems elsewhere. Having a good performance baseline is vital to understanding the health of the TMG deployment, and can serve as an essential reference for times when performance issues arise. Don’t let your users, the help desk, or worse yet, your boss, let you know when trouble is happening. Use the monitoring strategies outlined here to stay ahead of the curve and work proactively to resolve connectivity and performance issues before they get out of hand. You’ll be glad you did!