Monitoring Exchange and Finding Common Problems
"For a complete guide to security, check out 'Security+ Study Guide and DVD Training System' from Amazon.com"
Make Sure Exchange is Strong
Would you drive a truck over a glass bridge? No. Then why would you run an enterprise class server operating system hosting a mission critical application such as e-mail and messaging on an antiquated desktop? Don’t think it happens? It happens more than you think. In the past 5 years alone I have worked with many teams of experts weeding out these exact systems and replacing them with what should have been there before… a system that was thought out and built strong. Now, you don’t have to cluster everything you run, but it would help if your enterprise level servers were running with RAID as an example. RAID can help you in a pinch, when you lose a disk (and you will based on the MTBF), you can quickly recover with minimal downtime and no loss of your data.
There are ways around this, of course. When someone wants to save money, there is always a way around doing the right thing… like… how about putting the SMTP server on a desktop to save money, as an example. The truth is, if you simply follow the posted guidelines that are found on Microsoft’s Website, you will find that whether you go with an enterprise chassis or a overpowered desktop… do yourself the favor of at minimum, not short-changing yourself (or Exchange) of the posted minimum requirements needed to have Exchange function.
Exchange Server Minimum Requirements
In this article we cover a real world situation where I found an Exchange Server in desperate need of a hardware upgrade. The original problem was thought to be the network, but once some analysis time was spent on the project, it was deemed to be the Exchange Server itself. The clients complained of timing out… getting the dreaded bubble popup … the network was thought to be the culprit. A quick look at the Performance Console on the Exchange Server told me otherwise.
The Performance Console is a Microsoft Management Console (MMC) snap-in that enables monitoring numerous parts of the internal workings of your server. You would be amazed at what you can turn up if you: a) know what you are looking for and b) know how to read the console. In this article, we will take a look at the Performance Console in hopes of finding the problem. To open the console, you can go to your Administrative Tools folder in the Startup Menu (or in the Control Panel).
Once you open the Performance Console, you will see that there are a few items that are flagged to be monitored right off the bat… these counters alone will tell you a lot. In this example, I have recreated the same problem on the production server on a test system called \\shimonski.
The System Monitor Component of the Performance Console is what you can use to find problems. Before we look at the actual problem we should first do a quick refresher on what you are looking at. For one, to learn a whole lot about Performance in general, I suggest reading an article from one of our sister sites here at MSExchange.org. Mitch Tulloch wrote an exceptional article on Monitoring Key Performance Counters. I suggest reading this article:
Monitoring Key Performance Counters
As you may find over time, monitoring key performance counters will amaze you.
The Performance Console will help you to monitor key Exchange counters such as MSExchangeIS and a plethora of others. Installing Exchange, installs the counters as well. In this article however, our focus is on the most common: Processor, Memory, and PhysicalDisk counters.
In this example, we shall call the company 123 Ltd. 123 Ltd have a problem with the Exchange Server. It was initially thought to be network related problem that was ruled out with sniffers / protocol analysis software. A quick analysis of bandwidth reports from the WAN showed that there was indeed no network problem beyond your normal occasional outage of service. Once the network was ruled out, the systems were analyzed. Figure 2 shows the real system that was analyzed. As you can see, it looks much like the test system I configured for this example.
In the figure you can see that the Avg. Disk Queue Length counter (as well as the Pages/sec) are spiking and at the same time, showing a CPU spike. Believe it or not, but this is what is seen by opening the console and not even adding a counter! This means that either the IT team did not know what the counters meant or it means that the counters were never examined because they didn’t know how to. Further questioning of the IT team showed that this system was taken over by the current IT group that were replaced from an acquisition. Nobody had checked it out, ever. Nobody knew how to check it.
I added a few more counters. Just to show you other counters that can be added to gather information, Figure 3 shows the adding of a counter. To add a counter, click the “+” plus-sign on the top of the Performance Console within System Monitor. You can then add more counters.
Figure 4 shows you the Exchange counters that can be added to monitor new objects within Exchange. This server was a server that had many problems with the Information Store and seemed to be throwing users off all the time.
Figure 5 shows the ‘Explain’ button in use. Take a moment to utilize this button to see what a counter might be used for.
Table 1 shows a list of counters that you can select and what they might tell you. This exact information can be found online and on Microsoft TechNet. If you are in the business of taking care of servers (or in my business… of proving out the network), then learning this tool will definitely save you some time and headaches. The ‘Explain’ button will also help you with this exact information.
Table 1: Available Counters
Displays the amount of physical memory, in bytes, available to processes running on the computer.
The total rate of bytes transferred by the Web service. This counter is the sum of Bytes Sent/sec and Bytes Received/sec.
The latency of MAPI/remote procedure call (RPC) actions measured at the LoadSim/Microsoft Office Outlook client. This counter measures the time it takes for the server to fulfill the client request. It can be used to estimate the time a user would have to wait between initiating individual Outlook actions.
Database\Database Cache Size
The average amount of system memory used by the database cache manager to hold commonly used information from the database files to prevent file operations. If the database cache size seems too small for optimal performance and there is very little available memory on the system (see Memory/Available Mbytes), adding more memory to the system may increase performance. If there is a lot of available memory on the system and the database cache size is not growing beyond a certain point, the database cache size may be restricted to an artificially low limit. Increasing this limit may increase performance.
DB Disk Transfers/sec
The average sum of all random read/write, input/output (I/O) operations to the Microsoft Exchange Database disk volumes (both .edb and .stm files).
The average number of disk bytes written or read per second across all disk volumes.
The number of current Internet Message Access Protocol version 4rev1 (IMAP4) client connections.
The number of unique identifier (UID) commands per second.
ISAPI Extension Requests/sec
The number of requests per second for Outlook Web Access transactions.
The average sum of all sequential write I/O operations to the Exchange log file disk volumes (.log files).
MSExchangeIS Mailbox\Local Delivery Rate
The average rate at which messages are delivered locally to the Exchange store.
The rate at which RPC operations occur. This counter is a good rate counter to measure Exchange workload because all MAPI-based actions use the RPC protocol.
The number of client requests that are currently being processed by the Exchange store.
Network Interface\Bytes Total/sec
The average rate at which bytes are sent and received over each network adapter, including framing characters. Network Interface\Bytes Total/sec is the sum of Network Interface\Bytes Received/sec and Network Interface\Bytes Sent/sec.
Measures network traffic on the server going to and from the server's network adapter.
The number of message delete commands per second.
The number of STAT commands per second. A STAT command is issued once per each user's connection.
Displays the current number of bytes this process has allocated that cannot be shared with other processes.
Processor\% Processor Time
The average percentage of elapsed time that the processor spends to execute a non-idle thread. It is calculated by measuring the duration of time the idle thread is active in the sample interval, and subtracting that time from the interval duration. (Each processor has an idle thread that consumes cycles when no other threads are ready to run.) This counter is the primary indicator of processor activity, and it displays the average percentage of busy time observed during the sample interval.
SMTP Local Queue
The number of messages in the local queue waiting delivery to local users.
SMTP Messages Del/sec
The number of messages being delivered each second to local users.
SMTP Messages Sent/sec
The number of messages being sent each second to a remote server.
Store Virtual Bytes
The average size, in bytes, of the virtual address space that the Store.exe process is using. Use of virtual address space does not necessarily imply corresponding use of either disk or main memory pages. Virtual space is finite, and the process can limit its ability to load libraries.
The combined average rate at which all processors in the computer are switched from one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. This counter is the sum of Thread\Context Switches/sec for all threads running on all processors in the computer, and it is measured in numbers of switches. There are context switch counters on the System and Thread objects. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.
Web ISAPI Extension Requests/sec
The rate at which Internet Server Application Programming Interface (ISAPI) extension requests are received by the Web service. Internet server API requests are used by Outlook Web Access to access the Exchange server.
The set of memory pages (areas of memory allocated to a process) recently used by the threads in a process. If available memory on the server is above a specified threshold, pages remain in the Working Set of a process even if they are not in use. When available memory falls below a specified threshold, pages are removed from the Working Set. If these pages are needed, they will be returned back to the Working Set before they leave main memory and are made available for other processes to use.
What was the Problem?
Now that you are familiar with the Performance Console, System Monitor and things that can be monitored with it, we should think about our original problem. We had an Exchange Server experiencing problems that were thought to be network related. Now that we see that it is not, we need to come back to the base system and see what the problem could be.
Now that you understand what is being monitored, let’s take a look at figure 6. Figure 6 shows us that we have a server running on sub-par equipment. After careful analysis it was determined that the server was old (5 years old, which is about 100 in computer and dog years) and it had 512 MB of RAM, there was less than 1 GHz on the CPU and there was only 4 GB of free space left on both the system and data drives. The Swap file was located on system drive.
I am cutting to the chase here with what went wrong on purpose… it doesn’t matter what we can look for, the default settings told us exactly what the problems were. Every time you see a spike in Figure 6, that’s when all the users got frozen or disconnected from the Exchange Server… every time. If the spike was continous, then all users would lock up. This was nothing more than a server that was low on hardware. Think about it, the network was checked, the systems were checked… to find that everything led back to this:
Exchange Server Minimum Requirements
I hope that this real world issue showed you how easy a problem can be to fix if you know how to look for it. Many times people just point their finger to the network… it’s the easiest thing to blame because it’s the least understood. Remember, you need hardware in the box, it’s that simple. Think about what you are running on the server. You have your information store (store.exe) which is one of the biggest individual consumers of memory in Exchange Server 2003. Store.exe processes and manages mailboxes and public information storage… you need disk space, you need swap file space, you need memory to handle Windows Server 2003, Active Directory and Exchange Server 2003. You have other processes such as Inetinfo.exe which processes and handles Internet protocols and IIS. You have the MTA (Emsmta.exe) and the System Attendant (Mad.exe). Antivirus software and backup software are also common. You can see it just add up in Task Manager. Make sure you think about your base systems and check out what your servers are telling you, you may find that they just need a little love and attention to get back to running primo again.