My CPU is at 100%, What Can I Do About It?
Processor capacity and Terminal Servers
On a Terminal server a considerable amount of users are working on the server together. All those users are sharing the same resources available in the system. Sharing these resources together results in the fact that all users notice that resources becomes scarce. On the top of common rare resources the CPU is definitely number one.
The main reasons for this scarcity are the changeable availability of CPU capacity and the unpredictable usage of the CPU by various applications.
When CPU resources are scarce, users immediately feel the impact. The performance of the Terminal Server decreases, which normally results in a user feeling that his/her application is not responding. Logically, Terminal Server administrators need to ensure that the performance of the servers is acceptable for the user environment. In other words, the administrator needs to control the resource usage of the CPU in such way that the users do not experience decreases in performance.
What if your Terminal Servers are constantly displaying the behavior that CPU resources are (almost) completely exhausted? To solve this behavior the environment needs to be investigated thoroughly. The first step for this investigation is to collect information to analyze why this behavior is occurring.
Regrettably, exhausted CPU resources do not occur immediately after the implementation of the Terminal Server environment but after a while has passed. Therefore it is very important to perform a baseline measurement of your servers. The following baseline measurements are ideal for analyzing your environment:
- Resource usage with no users connected right after installation and configuration of the server(s).
- Resource usage with a normal amount of users (with normal operation acting) just after the roll-out.
It is also a good idea to collect the same performance counters on a regular basis for trend analysis. I advise to monitor the following performance counters:
- Processor: processor time %
- System: Processor queue length
- Memory: Pages/sec
- Memory: % Committed bytes in use
- Network interface: bytes total/sec
- Physical disk: % disk time
- Physical disk: Average current disk queue length
- Paging file: % Usage
As soon as your server is actually experiencing a high CPU usage you should monitor the performance counters. You should detail your monitoring activities to other performance counters like:
- Physical disk: % Read time
- Physical disk: % Write time
- Physical disk: Current disk queue length
- Paging file: % Usage peak
- Processor: % Interrupt time
- Processor: % User time
- Network Interface: bytes received/sec
- Network Interface: bytes send/sec
- Memory: Page faults/sec
- Memory: Page reads/sec
- Memory: Page writes/sec
- Process: % Processor Time
- Thread: % Processor Time
It is important to monitor several kinds of resources to analyze your CPU behavior.
Analyzing performance counters
Collecting all these performance counters is easy, but analyzing these counters and drawing a conclusion from these data is another story entirely. Every situation is different, so there is no standard manual for analyzing and pointing out a cause. The most common situations are described below.
Single application level
One of the causes is one application that is consuming too many CPU resources. The main reason for this would probably be badly written code for use on a Terminal Server. Often these kinds of applications claim CPU resources and do not release them when the task is finished; or a specific task is consuming all of the CPU resources for a long time without giving other threads the possibility of using some CPU capacity. If this behavior is present, the counter Processor: % Processor Time, System: Processor queue length, Process: % Processor Time and Thread: % Processor Time will show high values.
Multiple application level
When there are multiple applications on a system it is possible that these applications are fighting to get some CPU capacity. Because of these conflicts the CPU can become overwhelmed with handling those requests with the result that the tasks behind these requests cannot be handled anymore. Since we are talking about CPU capacity the counter Processor: % Processor Time will have a high value. Besides this, the counter Processor: % Interrupt time will have a high value, while the counter Processor: % User time will have a low value.
When analyzing the performance data there may be the trend that more resources are showing high values like the counters Paging File: % Usage, Memory: Page faults/sec, Physical disk: % disk time and comparable counters. In this situation there is the possibility that another hardware resource is overwhelmed causing the CPU to rise to a high value. In this case you should analyze which component also has a high value and try to figure out why that resource is overwhelmed.
Too many users level
In the cases where only the Processor: % Processor Time and no other counters show high values, and in special cases the counters Process: % Processor Time and Threads: % Processor Time - in other words there are no demonstrable causes - this could mean that with normal usage of the applications the CPU is at the limit. If this behavior occurs, the sizing of the Terminal Servers should be scaled again.
Too many applications
This situation is comparable with too many users level. Again the sizing should be scaled for the Terminal Server.
Setting up a solution
After analyzing the data and making some conclusions, the next step is to set up a solution. Logically the solution depends on the conclusions made.
Adding more hardware
If the CPU is constantly at a high level, some consideration should be taken towards upgrading the servers. Remember, though, that the systems cannot be extended endlessly. Best practices show that a Terminal Server best performs with two processors. When adding more processors other resources become the bottleneck in the system so the processor is not used fully. Also, lots of applications don’t use additional processors because the software is not written for multi-processor systems. Adding an additional CPU can be considered when the machine is currently using only one CPU and the causes are too many users level, too many applications or single application level.
Adding more Terminal Servers
Adding more Terminal Servers to your environment can also add more CPU capacity. Dividing the users over more servers automatically implies more resources available per user. When you have more terminal servers in your infrastructure the administrative tasks increase. Everything within a server needs some administrating and monitoring. Also the budget must be available to purchase the additional hardware. This solution can be used for the resource level and too many users’ level conclusions.
Introducing the Silo concept
The silo concept is a well-known concept to solve application conflicts. Applications that are known to cause conflicts are placed on special Terminal Servers. On this Terminal Server only that application (or a small set of applications) are installed. Through this separation the most used (normal) applications do not suffer from the influence of those “bad” applications. For this solution additional hardware also needs to be purchased, thus adding extra administrative duties for the IT department. This concept can be used for the single application level; multiple application level and the too many applications level conclusions.
Performance Management tools
The above mentioned solutions all use, in some way, additional hardware. You can also use a software solution to solve this kind of behavior. These so called performance management tools can be divided into two groups. The first group of products use the priority method within Windows. Through several means (algorithms, fixed value) the products determine which processes should have a lower priority. Products which work in such a way are RES Powerfuse, and Max-IT. The second group uses so called CPU clamping. With CPU clamping the product detects when a specified CPU usage is exceeded and by controlling all threads in such way, the CPU usage is immediately brought below this value. Another way these products work is by setting up a maximum for CPU usage for each thread. Well-known products are Appsense Performance Suite, WMSoftware’s Relevos and Threadmaster.
The most important step when dealing with high CPU usage is thoroughly analyzing your infrastructure before drawing a conclusion. When the cause is found there are several solutions. Compare the possible solutions and choose the one which best fits within your organization and infrastructure.