Performance Testing the Microsoft OLE DB Provider for DB2 (Part 1)

Article
05/18/2009

Technorati Tags: DB2,SQL Server,SSIS,Microsoft OLE DB Provider for DB2,Host Integration Server,Data Providers,Performance,OLEDB

There are many factors that can affect pulling data from DB2 into SQL Server (or other applications). These include the network speed, hard drive; machine CPU speed, memory, # of CPUs, loads (on DB2 or SQL). There are way too many factors to consider.

One thing that we offer is the ability to pre-fetch data from the DB2 server before the consumer asks for this data. This parameter is normally only used for SSIS or SSAS, but can be used by other consumers. It should ONLY be used when moving large amounts of data (we recommend millions of rows), but there are times when it can be useful in DQP applications (such as SQL Server’s Query Analyzer). The pre-fetch is controlled by a parameter on the connection string “Rowset Cache Size”. The default for this setting is 0 (off), but can be set to almost any number (we do not recommend going over 300 at this point). What the parameter allows is for us to pre-fetch data from DB2 before the consumer actually requests it. IE – if using SSIS, if SSIS asks for 10,000 rows of data, we will ask DB2 to send us 300,000 rows (if this parameter is set to 30), and stored this internally, thus avoiding delays in constantly asking DB2 for smaller chunks of data.

But, how do you know what to set this to? Well, normally we recommend 30 as a starting point, but this can be cumbersome to test.

Recently I worked a case which involved testing multiple queries across multiple linked servers, each with a separate setting for the Rowset Cache Size. I did not care about the actual data, but was concerned with performance at different settings. Manually creating queries, linked servers, gathering timing numbers, was very cumbersome at best, and I decided there had to be a better way.

So, I came up with 4 SQL Scripts that do all the work for me. They can easily be used by anyone, as being very generic, and only 1 script file needs modification. Since the scripts are generic in nature, I named these Generic Part1.sql, GenericPart2.sql, GenericPart3.sql, and GenericPart4.sql. These scripts assume you have a Database called DB2TestDatabases.

So, what do these do?

GenericPart1.sql

This script file creates 3 tables. 2 of the tables are used to create the 3^rd one.

To begin with, you need to make a few changes.

Ignoring the declare section, skip down to this section:

-- ********************* BEGIN USER DEFINED VARIABLES *********************

-- ***** first, need your connection string, WITHOUT ROWSET CACHE SIZE

set @constr = N'Provider=DB2OLEDB;User ID=USERNAME;Password=password;Initial Catalog=DB2400;Network Transport Library=TCP;Host CCSID=37;PC Code Page=1252;Network Address=DB2400;Network Port=446;Package Collection=SCHEMA;Default Schema=SCHEMA;Process Binary as Character=False;Units of Work=RUW;Default Qualifier=SCHEMA;DBMS Platform=DB2/AS400;Defer Prepare=False;Persist Security Info=True;Connection Pooling=False;Derive Parameters=False;'

-- ***** supply your servername prefix, I use the name of system I'm connecting to

set @srvprefix = N'DB2400_'

-- ***** supply your SQL Querys

set @sqlquery1 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'

set @sqlquery2 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'

set @sqlquery3 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'

set @sqlquery4 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'

set @sqlquery5 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'

set @sqlquery6 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'

set @sqlquery7 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'

set @sqlquery8 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'

set @sqlquery9 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'

-- ********************* END USER DEFINED VARIABLES *********************

-- create QueryTable

-- this table is used for looping purposes

-- when creating the trials table later

-- it can contain as many queries as you wish to test with

create table [dbo].[QueryTable] (query nVarChar(2000))

insert into [dbo].[QueryTable] values (@sqlquery1)

insert into [dbo].[QueryTable] values (@sqlquery2)

insert into [dbo].[QueryTable] values (@sqlquery3)

insert into [dbo].[QueryTable] values (@sqlquery4)

insert into [dbo].[QueryTable] values (@sqlquery5)

insert into [dbo].[QueryTable] values (@sqlquery6)

insert into [dbo].[QueryTable] values (@sqlquery7)

insert into [dbo].[QueryTable] values (@sqlquery8)

insert into [dbo].[QueryTable] values (@sqlquery9)

Only a few changes are needed here to make this usable for your needs.

For the variable @constr, this is set to your connection string, but without the “RowSet Cache Size=x” parameter. This will be appended on later, when the linked servers are created.

For the variable @srvprefix, this can be anything you want. I use the name of the DB2 server I want to connect to. This will become the name of the linked server, in incremental values (more on that later).

Next, enter the SQL Queries you want to run. In my case, I wanted to run 3 queries, with each of them pulling in 10,000, 100,000 and 1,000,000 rows of data. Nice and simple, but you can use any query you want here. I also wanted them to run 3 times each, so I could average the times to get a better idea of performance over time.

Once you have the queries, make sure the insert statements match up. In my case, 9 inserts. This piece of code creates a table that is read in later, in a loop, for building the test to be run later.

After the above is completed, you are done with any modifications you need to make to the script files.

In the next part of the script, we created another table which contains the link server names and the final connection string for each linked server:

-- Create Linked Server names in a table

-- tables contains the name of the linked server

-- we also construct the data provider connection string here

-- appending the RowSet Cache Size setting it here from 0 to 300 in increments of 30

create table [dbo].[LinkServerNames] (servername Nvarchar(255), constr Nvarchar(2000))

set @i = 0

while @i < 301 -- creating linked server names for rowset 0, 30, 60, 90, 120, 150, 180, 210, 270,300

begin

set @istr = cast(@i as nvarchar(3))

-- set connection string proproperty now

set @realconstr = @constr + 'Rowset Cache Size=' + @istr + ';'

set @j = len(@istr)

-- and now build istr up to 3 characters

-- for linked server name

set @rowstr = '000'

set @rowstr = stuff(@rowstr,4-@j,@j,@istr)

-- insert data

insert into [dbo].[LinkServerNames] values(@srvprefix + @rowstr, @realconstr )

set @i = @i + 10 -- increment value (was 30)

end

As I mentioned, the table contains 2 columns, the linked server name, and the connection string. These are built in a loop, which, in this case, will create 31 rows. The linked server names will use the prefix you put in earlier, plus a 3 column number. In my case, I would have DB2400_000, DB2400_010 ….DB2400_300. Also, the final connection string is setup here, appending the “Rowset Cache Size” parameter to the connection string entered earlier.

After the above is run, the next part of this script creates the linked servers. I will leave out that part of the script here, but it creates and executes SQL Commands similar to what you would do if creating the linked server manually (the commands are similar to what you would get if you scripted the linked server from SSMS to a query window).

Finally, this script creates the last table. This table will be used by both GenericPart2.sql and GenericPart4.sql.

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TRIALS]') AND type in (N'U'))

DROP TABLE [dbo].[TRIALS]

CREATE table [dbo].[TRIALS]

(trialname Nvarchar(255) -- test name

,servername Nvarchar(255) -- linked server name

,command Nvarchar(2000) -- command to run

,StartTime DateTime -- start time

,EndTime DateTime -- end time

,durationMs int -- run time in millisecnds

)

The table contains a test name (built on loop numbers), the linked server name, and the command which will be run. It also contains columns for the start and end times of the test, and the duration (in milliseconds) of the test.

Here are a few sample columns created in the table:

Trial #: 001-001	DB2400_000	select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY	NULL	NULL
Trial #: 001-002	DB2400_010	select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY	NULL	NULL
Trial #: 001-003	DB2400_020	select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY	NULL	NULL

In the above, you can see that the 1^st ‘run’ will be using a query to pull in 10,000 rows, against multiple linked servers. The linked servers in the above have Rowset Cache Size set to 0, 10 and 20. In my case, I have 279 separate rows. 3 separate queries repeated 3 times with 31 linked servers.

GenericPart2.sql

This is the workhorse of the test scripts provided here.

This script opens a cursor against the TRIALS table (created above), reads in the linked server name and query, and builds an exec statement, then executes it as follows:

DECLARE trial_cursor CURSOR FOR

SELECT trialname, servername, command

FROM [dbo].[TRIALS]

Open trial_cursor

FETCH NEXT FROM trial_cursor into @trial, @server, @command

WHILE @@FETCH_STATUS = 0

BEGIN

PRINT @trial + N' ' + @server

-- append server location to the command

SET @command= N'exec (''' +@command+ ''') at ' + @server

-- run it

SET @starttime = getdate()

exec @success=sp_executesql @command

Once the exec statement returns, the script updates the TRIALS table with the start and end times + the duration of the test, then loops back, until all tests in the table are completed.

-- NULL the end time only if it failed, otherwise set the end time

If @success=0 SET @endtime = getdate()

-- save the times and durations

Update [dbo].[TRIALS]

Set StartTime = @starttime, EndTime=@endtime, DurationMs=DateDiff(MS, @starttime, @endtime)

WHERE trialname =@trial

-- just in case of failure

SET @starttime =NULL

SET @endtime =NULL

FETCH NEXT FROM trial_cursor into @trial,@server, @command

END

CLOSE trial_cursor

DEALLOCATE trial_cursor

If an error happens, the script will stop running. There is no ‘error’ checking, as I could not find a way to catch DB2 errors if something happened, but it is possible to catch SQL Errors using a TRY-CATCH, but that requires a bit more work.

After the script runs, the TRIALS table will contain data similar to the below:

Trial #: 004-123	DB2400_290	select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY	5/11/2009 10:47:17 PM	5/11/2009 10:47:27 PM	9736
Trial #: 004-124	DB2400_300	select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY	5/11/2009 10:47:27 PM	5/11/2009 10:47:37 PM	9743
Trial #: 005-125	DB2400_000	select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY	5/11/2009 10:47:37 PM	5/11/2009 10:49:10 PM	92610
Trial #: 005-126	DB2400_010	select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY	5/11/2009 10:49:10 PM	5/11/2009 10:50:58 PM	108460
Trial #: 005-127	DB2400_020	select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY	5/11/2009 10:50:58 PM	5/11/2009 10:52:45 PM	107063
Trial #: 005-128	DB2400_030	select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY	5/11/2009 10:52:45 PM	5/11/2009 10:54:29 PM	104143

In the above table, you can see where test 4 (reading 10,000 rows) switched to test 5 (reading 100,000 rows). The actual test# being here were the 123^rd through the 128^th query being executed. The last 3 columns contain the start and stop time, plus the duration in milliseconds. Also, by looking at the linked server name, you can see that the tests were using rowsets of 290, 300, 0, 10, 20, and 30.

I would also highly recommend running this script from a command line, piping the output to nul

GenericPart3.sql

This is a simple script that cleans up most of part1 tables. It does not delete the TRIALS tables, but removes the linked servers that were created, and drops the LinkServerNames and Query tables.

GenericPart4.sql

Now, after running GenericPart2.sql, you have your table full of performance data. But, the data is intermingled. You know you should have 3 queries for 10,000 rows against each of the linked server, but how do you get that out? And, even better, can you average the 3 test times? Well, yes you can, with this script.

This script builds a pivot sql query against the trials database, using the actual query. To only get the 3 queries, it first opens up a cursor against the TRIALS database, reading in the DISTINCT commands that were issued. These are saved in a string to be used by the pivot query (the command strings become the column headers).

Next, the script builds the query. The query also takes the average of the durationMS column from the TRIALS table. The final query (in my case) would look similar to this:

Select servername

, [select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY]

from (select servername

, command

, durationms

from dbo.trials) AS rf

PIVOT ( avg(durationms)

FOR command IN (

[select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY])

) AS p

Once the script is executed, you have a nice table that can easily be reviewed to see which setting may work best for a particular query:

servername	select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY	select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY	select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY
DB2400_000	10798	103306	940229
DB2400_010	12305	113839	1175765
DB2400_100	10297	93475	925285
DB2400_110	10418	93156	941389
DB2400_120	10374	93995	938454
DB2400_130	10635	95111	940769
DB2400_140	10527	93925	936573
DB2400_150	10566	93308	944626
DB2400_160	10516	94326	941008
DB2400_170	10760	94126	938562
DB2400_180	10503	93899	935958
DB2400_190	10623	93555	922817
DB2400_020	11847	110398	1080601
DB2400_200	10894	93490	904607
DB2400_210	10658	93555	905882
DB2400_220	10799	93133	919812
DB2400_230	10687	95160	923053
DB2400_240	10848	96003	919006
DB2400_250	10623	96033	931469
DB2400_260	10486	98468	927744
DB2400_270	10663	96406	941534
DB2400_280	11876	96829	936327
DB2400_290	11311	98304	931216
DB2400_030	11903	105999	1068744
DB2400_300	11307	98747	927451
DB2400_040	11594	103962	1053893
DB2400_050	11363	103201	1032705
DB2400_060	11171	100775	1012864
DB2400_070	10931	98275	994035
DB2400_080	10623	96421	981286
DB2400_090	10338	94936	953186

I hope this helps. In part 2 and part 3 I will address performance testing SSIS and C# queries using the Microsoft OLE DB Provider for DB2.

Performance Testing the Microsoft OLE DB Provider for DB2 (Part 1)

GenericPart1.sql

GenericPart2.sql

GenericPart3.sql

GenericPart4.sql

Additional resources