A very quick guide to Monitoring the Performance of your Exchange Server…..

Article
01/02/2007

Performance Monitoring is one of the dark arts which administrators are rarely given the time to master. Performance Monitor on a Windows Server, in particular once Exchange Server is installed, gives us the capability to collect enormous amounts of data; ensuring that any understanding of how our particular Exchange Server is performing is invariably lost beneath a sea of pretty but unintelligible peaks and troughs. However it isn’t actually very difficult at all to configure Performance Monitor on your Exchange Server, collect some useful data and then interpret the results. It is often then very easy to make changes to your servers which can have quite a dramatic effect on the performance of the server and the experience of Outlook clients alike.

There is now an enormous amount of information which can assist you in understanding what information to collect and how to interpret the results. The first place to start is the ‘Troubleshooting Microsoft Exchange Server Performance’ whitepaper which is available for download go.microsoft.com/fwlink/?LinkId=23454 or to view online. This is a fantastic whitepaper and deserves a couple of hours of any Exchange administrators’ time. For the purposes of this blog I will pick out a few of the main points – I believe it is possible to gather data for only a very few counters which can give you a very good starting point in understanding how your Exchange Servers are performing.

RPC Operations

Outlook clients use Remote Procedure Calls to interact with the Exchange Server. Any delay by the server in satisfying RPC requests could translate into a client side delay. Therefore collect MSExchangeIS\RPC Averaged Latency. This is the RPC latency in milliseconds, averaged for the past 1024 packets and should be below 50msat all times. (Make sure you take into account Outlook in cached mode here. If Outlook is in cached mode a lot of operations will be taking place asynchronously. In other words the client will be operating against a cache for most operations and therefore any latency will not be felt directly by the client. Even so if the majority of my clients were in cached mode and I saw regular RPC latency peaks over 50ms I would still consider there to be a potential performance bottleneck.) If you do find that you are seeing RPC latency then the areas to look at next are; RPC load …are all, or some, of my clients placing an unusually high load on my Exchange Server because of mobile device usage, archive solutions or desktop search engines for example? (Consider using the ‘Microsoft Exchange Server User Monitor’ to determine what my RPC load is per user or client version) and; bottlenecks elsewhere …is there a delay in satisfying RPC requests because there is a delay committing data to disk for example?

DSAccess

We all know that Exchange now relies heavily on Active Directory. Under normal operations Exchange will be continuously querying AD to determine the destination for emails, expand distribution groups and determine what DC’s and GC’s are up and available for use etc.. If there are delays in satisfying these LDAP requests then we might see client side delays or delays delivering emails for example. Therefore collect MSExchangeDSAccess Domain Controllers\LDAP Read Time (for all servers) & MSExchangeDSAccess Domain Controllers\LDAP Search Time(for all servers). In both cases the average value should be below 50ms with spikes no higher than 100ms. (When you view this information for all DC’s listed you might see regular spikes to certain DC’s. This is likely to be the topology discovery process which occurs every 15 minutes. There may indeed be delays here if LDAP queries are being sent to DC’s outside of the Exchange Servers local ‘AD Site’ just to determine that they are up and available in the event of problems with the local DC’s. Therefore concentrate on the response times from the GC’s which are ‘local’ to the Exchange Server. Verify which ones these are using the Directory Access tab in the properties of the Exchange Server in ESM. If we see delays to these servers then there are likely to be insufficient GC’s (remember the 4:1 (Exchange processor : GC processor) ratio guideline) or we might want to start some performance monitoring of the DC’s themselves. (A symptom of delays here would be maximum values over 10 in the SMTP Server\Categorizer Queue Length.)

Memory Usage

Exchange Server is a memory intensive application. Administrators need to look at both user (application) and kernel mode memory consumption. For user mode memory consumption look at Memory\Available Mbytes (MB) and Memory\Pages/sec. There should be 50MB available at any one time and pages/sec should generally not rise beyond 1000. For kernel mode memory consumption look at Memory\Pool Nonpaged Bytes, Memory\Pool Paged Bytes & Memory\Free System Page Table Entries. Values should not exceed 100MB for Nonpaged Bytes and 200MB for Paged Bytes. Free System PTE’s should not drop below 8,000. I would use performance monitor in conjunction with ExBPA here. ExBPA (www.exbpa.com) is a vital tool for all Exchange Administrators. Firstly it will highlight whether your server is not configured correctly. i.e. do you have the correct switches in the boot.ini (/3GB & userva=3030) and settings in the registry (heapdecommitfreeblockthreshold for example). Secondly ExBPA will also highlight if any of the kernel memory counters are within certain thresholds and if so will offer some possible steps to troubleshoot this further.

Disks

There have been numerous articles, blogs and whitepapers concerning storage design and the sensitivity of Exchange to poorly designed disk subsystems. If your disks aren’t performing well your Outlook clients will suffer the consequences. Generally you want to isolate transaction logs, disks, SMTP directories, temp directories, page files, O\S’s etc.. because when these different processes; which exhibit very different disk I\O patterns, compete with each other for disk I\O, performance suffers. The most basic example is transaction logs and databases. A write operation by a client is written to cache and then to a transaction log on the transaction log drive (therefore delays to writes here will result in client side delay). We would not expect there to be any reads to the log drive because we do not read from these drives unless replaying logs after a restore for example. The I\O pattern to a transaction log drive is therefore high sequential writes and low reads. In comparison a database drive which contains a large file made up of millions of 4k pages will be reading and writing from random pages potentially spread right across the disk. A client does not write directly to a database but does read from the database. Therefore the client will be much more sensitive to delays in reading from the database disks. To understand any delays to the disks we need to firstly understand what is located where – this information will be gathered by ExBPA if you do not want to collect this information manually. Then we collect PhysicalDisk\Average Disk sec/Read & PhysicalDisk\Average Disk sec/Write for all drives. For database disks the average value should be below 20ms with spikes no higher than 50ms. For transaction log disks the Read values should average below 5ms with spikes no higher than 50ms whereas theWritevalues shouldaverage below 10ms with spikes no higher than 50ms.

As discussed above a write request by the client will be written to cache first before being written to a transaction log. This area of cache is reserved by the application and is called the log buffers. If there is a delay in write requests to the transaction log drive then the log buffer may fill before write to disk transactions complete. This will result in a stall. To measure this, look at Database\Log Record Stalls/sec. On a healthy server average values should be below 10 per second with spikes no higher than 100 per second. If we see log stalls beyond these values then the log cache can be optimised. This is an attribute of each storage group (msExchESEParamLogBuffers). See the ‘Optimizing Storage for Exchange Server 2003’ whitepaper for more information about disk configurations and Exchange Server.

Processor Utilisation

Lastly we should have a quick look at processor utilisation. In my opinion we just need a snapshot of how the processor is handling its workload. To get this collect Processor\% Processor Time (_Total) . On a healthy server the average value should be below 90% at all times. If processor utilisation is very high then we can offload roles like Distribution Group expansion to other servers and move backups etc.. to quiet times but generally this means that you need to consider a higher specification server or consider spreading the load to other Exchange Servers.

I recommend configuring Performance Monitor on your Exchange Server directly and collect data every 30 seconds for a period of 24 hours (Set performance monitor to stop automatically after 24 hours). Start the collection at the start of the working day. At the end of the 24 hours load the log file into performance monitor and drag the time slider across so that you are shown only average and maximum values for the working day. These are the figures that I would use to compare with those above.

I do not pretend that this is a definite guide to performance monitoring but I think it is a good place to start. If you come across any potential bottlenecks you may need to gather more information and as you become more adept at using performance monitor you can gather more information within your performance baselines and still be able to interpret it successfully.