Performance Anxiety Monitoring your Exchange 2000 Server
You probably already know the important of regular monitoring of your Windows 2000 servers. In fact, you are probably already performing daily and weekly monitoring and baselining on your servers for such critical component areas as hard-disks, memory, processors and network adapters. You are in control of you own domain and you have no performance anxiety my friend. What's that? You maybe haven't been as diligent at monitoring your server's performance as you ought to have been? Oh, well then:sit back and relax a bit:no need to get anxious. If you need an overview or a refresher on monitoring system performance, you've come to the right place.
Monitoring: Who Needs It?!?!
If you stop to think about it, no large organization or enterprise exists without doing monitoring of some sort. Let me give you some examples:
The nuclear power plant that I live near has, at any particular time, between eight and sixteen people monitoring all facets of its operations. They have people in the main control room monitoring the reactor and all of its associated control and safety systems. They have people who roam around the electrical switching area monitoring the electrical switch-gear. They have people who roam around in the mechanical support buildings monitoring the electric generating turbine generators. In short, they have a lot of monitoring going on at that plant. Why? Without adequate monitoring, they have no idea of what is going on at any particular moment in time. Even more important, without adequate monitoring they have no way to perform trend analysis on their equipment. Monitoring is not only a performance issue for them-it's a safety issue as well.
It seems like the roads in my city are always under destruction. It can't be construction-they actually get anything constructed. In order for the city planners to know that the road needs to be widened, they had to perform monitoring over time and then perform trend analysis on the collected data. They most likely looked at things such as number of vehicles traveling a stretch of road in a 24-hour period, peak number of vehicles in a one-hour time period, etc. At any rate, they could not arrive at a reasonable idea of what needed to be done to the roads if they did not perform monitoring. Well, that's how it's supposed to work:I can't vouch for the city personally!
So, as you can see, monitoring comes into play in many aspects of our daily routine. It should come as no surprise that it's important and should be done on your servers. Now that we know we should be monitoring our servers, how exactly are we going to get it done? Fortunately for us, Windows 2000 comes with a fairly robust performance monitoring framework that most applications can work within and add to as required. Before we get any further, there are three definitions that Microsoft thinks are important when discussing performance monitoring. I think so too, so here they are: (1)
Throughput is a measure of the work done in a unit of time, typically evaluated from the server side in a client/server environment. Throughput tends to increase as the load increases, up to a peak level. It then begins to fall, and a queue might develop. Throughput in an end-to-end system, such as client/server, is determined by how each component performs. The slowest point in the system sets the throughput rate for the system as a whole. Often this slow point is referred to as a bottleneck. Performance monitoring identifies where bottlenecks occur in your system. The resource that shows the highest use is often the bottleneck, but not always. The bottleneck can also be a resource that successfully handles a great deal of activity. There is no bottleneck if no queues develop.
A queue is a group of jobs that are waiting to run. A queue can form under a variety of circumstances. For example, a queue can develop when requests come in for service by the resource at a faster rate than the resource's throughput, or if requests demand more time from the resource than the system can handle. A queue can also form if the requests occur at random intervals, such as large batches at the same time. When a queue becomes long, work is not handled efficiently and you might experience delays in response time.
Response time is the time required to do work from start to finish. In a client/server environment you typically measure response time on the client side. Response time generally increases as the load increases. You can measure response time by dividing the queue length for the resource by the resource throughput.
Now that we've seen why we should be monitoring our servers and discussed some basic terminology pertinent to monitoring, let's dive in a bit deeper and get started with monitoring.
Setting the Baseline
As we've already noted, one of the two main purposes of performance monitoring is to be able to conduct trend analysis. In order to do this, it is often helpful to have a baseline, or starting point to compare against. As you collect data over time, you will encounter periods of high, low and average usage. In order to determine what is acceptable on your network, however, you will need something to compare this data against. This is where a baseline data sample comes into play. Using your baseline you can detect bottlenecks and see long-term changes in patterns that may require your intervention to prevent bottlenecks from appearing.
The process of collecting a baseline is not a quick one or one that should be underestimated in importance. You need to collect performance data over an extended period of time-perhaps several days to weeks. The time-period that you are using for collection of the baseline data should be fairly representative of normal day-to-day conditions on your network or the baseline will mean little if anything when it comes time to compare it to newly acquired data. You should establish the baseline by collecting performance and diagnostic data over an extended period during varying, but typical, types of workloads and user connections. If your baseline is done correctly, it will be easier to notice problems before they get serious.
Remember when creating your baselines to make a note of the time period they cover. In order to be able to use your baseline effectively, you will need to know what types of work is being done on the network and how it affects network and resource utilization. An example of this would be in the morning when a large number of users typically log onto the network and download email, thus creating a large load on both network resources (bandwidth) and server resources (Domain Controllers and Global Catalogs to authenticate users, Exchange Servers to handle messaging duties, etc.). Failure to know what your baselines represent can lead to making an analytical mistake later when trying to compare current performance logs to the baseline.
Using the Logs
Now that you've got some understanding of what a baseline is, what it is used for and things to watch out for when creating and using a baseline, let's get down to business and create a baseline for a server. After creating a baseline, you can then use it to compare against current network conditions.
Creating a baseline Log
Note that in this example, I have only created a baseline log of about 12 minutes. You will want to let your baseline run for much longer than that, perhaps several days or even a week.
To create a baseline log:
Click Start | Programs | Administrative Tools | Performance. This will open up the Performance Monitoring window as shown in Figure 1.
Figure 1 - The Performance window.
To create a baseline log, expand Performance Logs and Alerts and then right-click on Counter Logs. Select New Log Settings: from the context menu, as shown in Figure 2.
Figure 2 - Creating a new counter log.
Provide the log name and click OK.
The counter log dialog box will open. On the General tab, you will need to add the counters you want to monitor. Adding a counter is shown in Figure 3.
Figure 3 - Adding counters to the counter log.
After you have added all counters you want to track (see Table 1 later in this article for some key Exchange 2000 Server counter objects), configure the options on the Log Files and Schedule tabs. When you are done, click OK to close out the counter log dialog box.
Unless you opted to have the log start and stop automatically (not really the best method for creating baselines), you will now want to manually start the counter log by right-clicking it and selecting Start from the context menu, as shown in Figure 4.
Figure 4 - Starting the counter log.
When you are done collecting the baseline data, right-click the counter log again and select Stop from the context menu.
Now you have a baseline counter log. By default, the log is saved in the X:\PerfLogs folder with the name you gave the counter log when you created it, where X is the volume Windows 2000 is installed on. In the case of this example, the log is called Exchange_Baseline_000002, because we are going to be using the second baseline log I have created from this counter log.
Creating a performance log
After you've gotten a baseline log, you will want to compare the baseline values to those on the network from time to time. The process to create a performance log is pretty much the same as that for creating a baseline log, with the main difference being that you will not be saving the data.
To create a performance log:
From the Performance Monitoring window, add counters to the performance log by either right-clicking the log display area on the right side and selecting Add Counters: or by clicking the + icon on the tool bar. The window as shown in Figure 5 will open and you can then add the counters you want to monitor. You should add the same counters as you have in your baseline log.
Figure 5 - Adding performance counters.
After you have added the counters you want, close out the Add Counters window and you can see your counters incrementing in real time, as shown in Figure 6. Note that the standard display for the graph view, as shown, is 1 minute, 40 seconds.
Figure 6 - Performance counters in action.
Viewing the baseline log
From the Performance Monitoring window, click the View Log Data button on the tool bar, as shown in Figure 7.
Figure 7 - Opening a saved counter log.
The Select Log File window opens. Navigate to the location of your baseline counter log, select it and then click Open as shown in Figure 8.
Figure 8 - Selecting a log to open.
You will have to add the counters you want to look at in order to be able to view them. You add the counters in the same way as before with the exception that only the counters you saved in the baseline log will be available to add. You can add one, two, all of your counters or any number in between as you desire. This makes it easier to view the data. Figure 9 shows my baseline data that I chosen to show at this time. Note that the time span covered is 11 minutes, 29 seconds. To really be effective, the baseline should have been for a much longer period of time.
Figure 9 - Showing the baseline data.
In figure 10 you can see that I have highlighted a specific counter to make it easier to see. Highlighting is accomplished by first selecting a counter and then by clicking the Highlight icon on the tool bar.
Figure 10 - Highlighted data is easier to see.
Working with the Logs
Your baseline presents typical values that you should expect to see over time when your system or network is performing satisfactorily. There are some guidelines that should be followed, however, to prevent you from misinterpreting the counters and to eliminate misleading data. When looking at baseline data, you should be attentive the following things:
Ignore occasional spikes - You do not need to place too much importance on occasional spikes in data. These spikes might be due to the startup of a process and, if so, they are not an accurate reflection of counter values for that process over time. The effect of spikes can remain over time when using counters that average.
Use graphs for reporting - When you monitor performance over an extended period of time, you need to use graphs. Reports and histograms show only last values and averages, and they might not give an accurate picture of values.
Exclude startup events - Unless you specifically want to include startup events in your baseline, you must exclude them because they are temporary high values that tend to skew overall performance results.
Investigate zero values or missing data - Zero values or missing data can impede your ability to establish a meaningful baseline. You should investigate the source of these issues and obtain the missing data, if possible, before you attempt to establish a baseline.
Counters to Log
Table 1 contains the recommend objects and their thresholds according to Microsoft for Exchange 2000 Servers. (1) You can use these values as a guideline when attempting to identify a performance issue or bottleneck on your server or network.
Table 1 - Recommended thresholds for object counters.
|LogicalDisk\ % Free Space||15 percent||None|
|Hard disk||LogicalDisk\ % Disk Time||90 percent||None|
|Hard disk||PhysicalDisk\ Disk Reads/sec, PhysicalDisk\ Disk Writes/sec||Depends on manufacturer's specification||Check the specified transfer rate for your hard disks to verify that this rate does not exceed the specifications. Some SCSI disks can handle 50 to 70 I/O operations per second.|
|Hard disk||PhysicalDisk\ Current Disk Queue Length||Number of spindles plus 2||This is an instantaneous counter; observe its value over several intervals. For an average over time, use PhysicalDisk\ Avg. Disk Queue Length.|
|Memory||Memory\ Available Bytes||Less than 4 MB||Research memory usage and add memory if needed.|
|Memory||Memory\ Pages/sec||20||Research paging activity.|
|Network||Network Segment\ % Net Utilization||Depends on type of network||You must determine the threshold based on the type of network you use. For example, for Ethernet networks, 30 percent is the recommended threshold.|
|Paging File||Paging File\ % Usage||More than 70 percent||Find the process that is using a high percentage of processor time. Upgrade to a faster processor or install an additional processor.|
|Processor||Processor\Interrupts/sec||Depends on processor||A dramatic increase in this counter value without a corresponding increase in system activity indicates a hardware problem. Identify the network adapter or hard disk controller card causing the interrupts. You might need to install an additional adapter or controller card. For current CPUs, use a threshold of 1,500 interrupts per second.|
|Server||Server\ Bytes Total/sec||If the sum of bytes total/sec for all servers is roughly equal to the maximum transfer rates of your network, you might need to segment the network.|
|Server||Server\ Work Item Shortages||3||If the value reaches this threshold, consider tuning the InitWorkItems or MaxWorkItems entries in the registry (in HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Services\lanmanserver\ Parameters).|
|Server||Server Work Queues\ Queue Length||4||If the value reaches this threshold, there might be a processor bottleneck. This is an instantaneous counter; observe its value over several intervals.|
|Multiple Processors||System\ Processor Queue Length||2||This is an instantaneous counter; observe its value over several intervals.|
When dealing with performance, there is no absolute. There is no hard fast rule that you should be working from. The only thing you have to rely on is the guidelines provides by Microsoft and the experience you and your peers have with Exchange. From this experience will come good judgment and the ability to see beyond the numbers.
In many cases, there may actually be multiple bottlenecks in a system. This is a case that is going to require many days and much patience to work out fully. The key to tweaking the performance in a case like this is to move slowly, one thing at a time and keep a change log of what you have done. You might correct one bottleneck only to discover another soon thereafter.
Lastly, remember that poor performance in one component may be a result of problems in another computer. If you are short on memory, you can expect paging file usage and disk reads and writes to increase. Although you will be able to easily see the changes in the paging file and disk activity, you might miss the real problem of insufficient memory.
The most important thing to remember, however, is that in order to monitor performance accurately on your servers and network you must do it regularly. Playing a hit-and-miss game of randomly looking at performance will get you no where fast. Of course, that's just my two cents worth.