Monitoring SQL Server I/O performance with SAP

The disk I/O performance is often the root cause of performance problems with SAP on SQL Server. There are several ways to monitor the I/O performance: You may use Windows performance counters, SQL Server dm-views or the SAP database monitor. However, caused by the different methods, the measured values may be different and somehow confusing.


We do not discuss here general hardware and Windows configuration or tuning. Instead we want to concentrate on the monitoring features of SAP and SQL Server. The unit of measurement here is a database file, while Windows performance counters measure “logical disks” or “physical disks”. SAP customers typically use I/O systems (consisting of many disks), which can be monitored by 3rd. party tools. Windows considers such an I/O system as a single “physical disk” and therefore cannot measure a real disk within the I/O system.


 


SQL Server I/O measurement
SQL Server tracks the I/O statistics per database file, separately for reads and writes. This includes the total number of I/Os, the total number of bytes transferred and the total stall time (SQL Server wait time on I/O). Based on these numbers you can easily calculate the average response time per I/O. In SQL Server 2000 you can use the function ::fn_virtualfilestats() to query the I/O statistics. In SQL Server 2005 and newer you can also use sys.dm_io_virtual_file_stats(), which shows the same data.


There are several things you should keep in mind when looking at this statistics:


·       First of all, all numbers are simply a sum since start of SQL Server. Calculating an average response time does not give you any idea about the current response time of database I/Os.


·       The way SQL Server 2000 measures the stall time is different from the way SQL Server 2005 and newer does. This results in much lower stall time values on SQL 2000. It does not mean that SQL 2000 is more efficient. The I/O statistics on SQL 2000 is simply not as accurate.


·       SQL Server 2005 used the high-resolution CPU counters (precision: 0.001ms) until Service Pack 2. This may result in faulty I/O statistics when using energy saving technologies, which change the CPU frequency on multi processor servers. Starting with Service Pack 3 a more reliable timer with less precision (1ms) is used. Therefore you should apply at least SP3 if there are any doubts about the reliability of SQL Server I/O statistics. See also http://support.microsoft.com/default.aspx?scid=kb;en-us;931279


·       Seeing high disk read response times does not necessarily point to a hardware issue or a faulty configured I/O system (although this is often the case). It simply means that the I/O system is not able to meet the demands. It does not tell you whether the I/O system is not working properly or the workload is (temporarily) too high. A bad execution plan of a single query may result in full table scans and increased I/O load, if this query is executed very often and in parallel by many users. However, on typical database servers of an SAP system you will rather see increased logical reads than physical reads. This will result in increased CPU load, not in increased I/O load.


 


I/O Performance in SAPs DBA Cockpit
You can easily check the SQL Server I/O statistics using SAP’s transaction DBACOCKPIT (or ST04 for older SAP releases). After starting DBACOCKPIT you have to select “I/O Performance” in the launch pad on the left hand side of the screen.


 clip_image002


The DBA Cockpit uses the SQL Server function fn_virtualfilestats to calculate the read and write response time per file. In the default view, DBA Cockpit shows the response time calculated on the statistics collected since SQL Server start. You can change the default view by pressing one of buttons on the top of the screen. By pressing the button “Since DB start” you will get back to the default view.


clip_image004


Like in many other screens in the DBA cockpit, there is a Reset/Since-reset logic implemented in the I/O Performance screen. Once you press the “Reset” button, SAP stores the current values of fn_virtualfilestats in the context of the logged on SAP user. This has no side effect on other SAP users or SQL Server. When pressing the “Since reset” button, the values of fn_virtualfilestats are retrieved a second time and the current response time based on the delta values are displayed. This gives you a much clearer idea of the current I/O performance.


clip_image006


SAP collects every 20 minutes performance data, including the values of fn_virtualfilestats (independent from the Reset/Since-reset logic). The I/O performance of the last 20 minutes interval can be easily checked by pressing the “Current values” button.


 


SAP’s DB collector
The DB collector consists of an ABAP report and a SQL Agent job. The job runs every 20 minutes and stores SQL Server performance data in a table of the SAP database. An SAP administrator does not need to configure or schedule this. Most of the SAP administrators probably even do not know about it. The great thing with the DB collector is that the data is stored for months and the collector works with almost all available SAP releases.


The ABAP report can be used to display the collected data. The report is part of SAP’s transaction DBACOCKPIT (or ST04). However, the user interface has changed a few times in the past within different SAP releases.


clip_image008


Currently you have to select “Performance” => “Overview” in the launch pad of DBACOCKPIT. Pressing the button “DB Collector” then opens the main screen of the DB Collector. In the future the DB Collector will have its own entry in the launch pad. In older SAP releases there was a button or a menu item in transaction ST04 which started the main screen of the DB Collector.


clip_image010


The counters collected by the DB collector may be technically SQL Performance counters or the result of any SQL Query. Describing all the features of the DB collector or all collected counters goes far beyond a blog entry. We rather want to have a closer look how to check the I/O performance measured by SQL Server from a timeframe in the past. The counters are stored in the SAP database in a highly compressed format. Besides DBA Cockpit, there is currently no other way to read this data. However, you can export the data from DBA Cockpit to an Excel sheet.


When selecting the tab “Time Series” you can query the collected counters. For I/O performance the category “Virtual Filestats (File)” is the most interesting. After entering the start and end date you have to press the “Refresh values” button. As a result you will get a list


clip_image012


As you can see at the size of the scroll bars, SAP collects really a lot of different counters. In the screenshot you can see in the second column the counter “IOStallRead [ms] / read request BSE:2”. It is calculated by dividing the read stall time by the number of read requests within the collector interval (20 minutes) for the file ID 2 of database “BSE”. File ID 2 is always the (first) log file of a database. That’s the reason why you typically do not see many read requests on this file.
You can easily export the whole list to an excel sheet as shown in the screenshot below:


clip_image014
Once the data is in Excel you can concentrate on the most interesting counters and display their values in a diagram.


 


Long-time monitoring using Windows performance counters
The statistics of SQL Server are the preferred source for an DBA to analyze the I/O system. If there is a correlation between bad SQL Server response time and increased I/O response time while the absolute I/O load does not increase, then you should involve the hardware guys. However, they typically do not trust the statistics form SQL Server. At this point in time it is best to run Windows performance counters in order to prove any issue with the disk arrays. The granularity of disk performance counters is a volume, not a file. Therefore the analysis will be much easier if you had separated the different types of SQL Server files to different volumes, even if all volumes reside on the same disk array. You should have different volumes at least for the following groups:


        One group for all data files of the SAP database


        One group for the log file(s) of the SAP database


        One group for the data and log files of tempdb


        One group for the Windows page file


        One group for other files


 


Short-time monitoring using hangman
SAP delivers a VB-script for analyzing a hanging situation of SAP systems running on SQL Server. The script HANGMAN.VBS is described and published in SAP note 948633. A detailed description can also be found here:
http://blogs.msdn.com/saponsqlserver/archive/2008/09/18/new-version-of-hangman-vbs-and-revised-documentation-released.aspx
http://blogs.msdn.com/saponsqlserver/archive/2008/10/24/analyzing-a-hangman-log-file-part-1.aspx
http://blogs.msdn.com/saponsqlserver/archive/2008/11/23/analyzing-a-hangman-log-file-part-2.aspx
The newest version of HANGMAN also collects I/O performance data. The idea of hangman is to get the state of the system from a specific point in time. Therefore you can use HANGMAN to monitor the current I/O response time. It does not tell you anything about the long-time average response time.


The following example is an extract of a HANGMAN log file created on a virtual machine. That’s the reason why you can see here really bad I/O response times. The output contains the number of reads and number of writes within the monitored time frame. Since there was not a single write within the monitored 15 seconds on file SR3DATA1.mdf, you will only see a dot for the response time (ms/write).


clip_image016


The response times, measured by the operating system, are based on volumes, not on files. They typically do not exactly match with the response times measured by SQL Server. Firstly there may be additional I/O on non-database files on the same volume. Secondly the time frame used for the SQL and Windows measurement is similar, but not identical.


clip_image018


 


SQL INSERT time versus I/O write time
INSERTs are typically the most expensive SQL statements in an SAP system, even on high-end database servers with very good write performance. This is often confusing for many customers. The only thing which has to be written for an INSERT is the log record in the databases log file. It results in a single I/O call, which typically simply fills the write cache of the disk array and does not require any physical I/O. All modified data pages are written after the database commit by the lazywriter or checkpoint thread. Therefore the time needed to write the data pages to disk is not included in the time measured for the INSERT. But why does an INSERT typically take so much time?
The answer is a little bit surprising. The most time is spent on physical reads. Almost all tables of an SAP system have a clustered primary key. Therefore the data and index rows are not simply appended at the end of the table. They have to be inserted into well defined pages, which have to be read from disk first. The more indexes a table has, the more time you need for an INSERT. For example, you need 40 logical reads for an INSERT of a single row into a table with a clustered primary key and nine secondary indexes of index depth 4. Dependent on the size and usage of SQL Server’s data cache, this results in about 10 to 15 physical reads. Blocking database locks and possible page splits further increase the runtime of an INSERT.
If INSERTs take very long and you suspect the I/O system as the root cause then you should rather analyze the read performance, not the write performance.