Performance Testing the Microsoft OLE DB Provider for DB2 (Part 1)

Technorati Tags: DB2,SQL Server,SSIS,Microsoft OLE DB Provider for DB2,Host Integration Server,Data Providers,Performance,OLEDB

There are many factors that can affect pulling data from DB2 into SQL Server (or other applications). These include the network speed, hard drive; machine CPU speed, memory, # of CPUs, loads (on DB2 or SQL). There are way too many factors to consider.

One thing that we offer is the ability to pre-fetch data from the DB2 server before the consumer asks for this data. This parameter is normally only used for SSIS or SSAS, but can be used by other consumers. It should ONLY be used when moving large amounts of data (we recommend millions of rows), but there are times when it can be useful in DQP applications (such as SQL Server’s Query Analyzer). The pre-fetch is controlled by a parameter on the connection string “Rowset Cache Size”. The default for this setting is 0 (off), but can be set to almost any number (we do not recommend going over 300 at this point). What the parameter allows is for us to pre-fetch data from DB2 before the consumer actually requests it. IE – if using SSIS, if SSIS asks for 10,000 rows of data, we will ask DB2 to send us 300,000 rows (if this parameter is set to 30), and stored this internally, thus avoiding delays in constantly asking DB2 for smaller chunks of data.

But, how do you know what to set this to? Well, normally we recommend 30 as a starting point, but this can be cumbersome to test.

Recently I worked a case which involved testing multiple queries across multiple linked servers, each with a separate setting for the Rowset Cache Size. I did not care about the actual data, but was concerned with performance at different settings. Manually creating queries, linked servers, gathering timing numbers, was very cumbersome at best, and I decided there had to be a better way.

So, I came up with 4 SQL Scripts that do all the work for me. They can easily be used by anyone, as being very generic, and only 1 script file needs modification. Since the scripts are generic in nature, I named these Generic Part1.sql, GenericPart2.sql, GenericPart3.sql, and GenericPart4.sql. These scripts assume you have a Database called DB2TestDatabases.

So, what do these do?

GenericPart1.sql

This script file creates 3 tables. 2 of the tables are used to create the 3rd one.

To begin with, you need to make a few changes.

Ignoring the declare section, skip down to this section:

-- ********************* BEGIN USER DEFINED VARIABLES *********************

-- ***** first, need your connection string, WITHOUT ROWSET CACHE SIZE

set @constr = N'Provider=DB2OLEDB;User ID=USERNAME;Password=password;Initial Catalog=DB2400;Network Transport Library=TCP;Host CCSID=37;PC Code Page=1252;Network Address=DB2400;Network Port=446;Package Collection=SCHEMA;Default Schema=SCHEMA;Process Binary as Character=False;Units of Work=RUW;Default Qualifier=SCHEMA;DBMS Platform=DB2/AS400;Defer Prepare=False;Persist Security Info=True;Connection Pooling=False;Derive Parameters=False;'

-- ***** supply your servername prefix, I use the name of system I'm connecting to

set @srvprefix = N'DB2400_'

-- ***** supply your SQL Querys

set @sqlquery1 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'

set @sqlquery2 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'

set @sqlquery3 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'

set @sqlquery4 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'

set @sqlquery5 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'

set @sqlquery6 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'

set @sqlquery7 = N'select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY'

set @sqlquery8 = N'select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY'

set @sqlquery9 = N'select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY'

-- ********************* END USER DEFINED VARIABLES *********************

-- create QueryTable

-- this table is used for looping purposes

-- when creating the trials table later

-- it can contain as many queries as you wish to test with

create table [dbo].[QueryTable] (query nVarChar(2000))

insert into [dbo].[QueryTable] values (@sqlquery1)

insert into [dbo].[QueryTable] values (@sqlquery2)

insert into [dbo].[QueryTable] values (@sqlquery3)

insert into [dbo].[QueryTable] values (@sqlquery4)

insert into [dbo].[QueryTable] values (@sqlquery5)

insert into [dbo].[QueryTable] values (@sqlquery6)

insert into [dbo].[QueryTable] values (@sqlquery7)

insert into [dbo].[QueryTable] values (@sqlquery8)

insert into [dbo].[QueryTable] values (@sqlquery9)

Only a few changes are needed here to make this usable for your needs.

For the variable @constr, this is set to your connection string, but without the “RowSet Cache Size=x” parameter. This will be appended on later, when the linked servers are created.

For the variable @srvprefix, this can be anything you want. I use the name of the DB2 server I want to connect to. This will become the name of the linked server, in incremental values (more on that later).

Next, enter the SQL Queries you want to run. In my case, I wanted to run 3 queries, with each of them pulling in 10,000, 100,000 and 1,000,000 rows of data. Nice and simple, but you can use any query you want here. I also wanted them to run 3 times each, so I could average the times to get a better idea of performance over time.

Once you have the queries, make sure the insert statements match up. In my case, 9 inserts. This piece of code creates a table that is read in later, in a loop, for building the test to be run later.

After the above is completed, you are done with any modifications you need to make to the script files.

In the next part of the script, we created another table which contains the link server names and the final connection string for each linked server:

-- Create Linked Server names in a table

-- tables contains the name of the linked server

-- we also construct the data provider connection string here

-- appending the RowSet Cache Size setting it here from 0 to 300 in increments of 30

create table [dbo].[LinkServerNames] (servername Nvarchar(255), constr Nvarchar(2000))

set @i = 0

while @i < 301 -- creating linked server names for rowset 0, 30, 60, 90, 120, 150, 180, 210, 270,300

begin

set @istr = cast(@i as nvarchar(3))

-- set connection string proproperty now

set @realconstr = @constr + 'Rowset Cache Size=' + @istr + ';'

set @j = len(@istr)

-- and now build istr up to 3 characters

-- for linked server name

set @rowstr = '000'

set @rowstr = stuff(@rowstr,4-@j,@j,@istr)

-- insert data

insert into [dbo].[LinkServerNames] values(@srvprefix + @rowstr, @realconstr )

set @i = @i + 10 -- increment value (was 30)

end

As I mentioned, the table contains 2 columns, the linked server name, and the connection string. These are built in a loop, which, in this case, will create 31 rows. The linked server names will use the prefix you put in earlier, plus a 3 column number. In my case, I would have DB2400_000, DB2400_010 ….DB2400_300. Also, the final connection string is setup here, appending the “Rowset Cache Size” parameter to the connection string entered earlier.

After the above is run, the next part of this script creates the linked servers. I will leave out that part of the script here, but it creates and executes SQL Commands similar to what you would do if creating the linked server manually (the commands are similar to what you would get if you scripted the linked server from SSMS to a query window).

Finally, this script creates the last table. This table will be used by both GenericPart2.sql and GenericPart4.sql.

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[TRIALS]') AND type in (N'U'))

DROP TABLE [dbo].[TRIALS]

CREATE table [dbo].[TRIALS]

(trialname Nvarchar(255) -- test name

,servername Nvarchar(255) -- linked server name

,command Nvarchar(2000) -- command to run

,StartTime DateTime -- start time

,EndTime DateTime -- end time

,durationMs int -- run time in millisecnds

)

The table contains a test name (built on loop numbers), the linked server name, and the command which will be run. It also contains columns for the start and end times of the test, and the duration (in milliseconds) of the test.

Here are a few sample columns created in the table:

Trial #: 001-001

DB2400_000

select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY

NULL

NULL

0

Trial #: 001-002

DB2400_010

select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY

NULL

NULL

0

Trial #: 001-003

DB2400_020

select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY

NULL

NULL

0

In the above, you can see that the 1st ‘run’ will be using a query to pull in 10,000 rows, against multiple linked servers. The linked servers in the above have Rowset Cache Size set to 0, 10 and 20. In my case, I have 279 separate rows. 3 separate queries repeated 3 times with 31 linked servers.

GenericPart2.sql

This is the workhorse of the test scripts provided here.

This script opens a cursor against the TRIALS table (created above), reads in the linked server name and query, and builds an exec statement, then executes it as follows:

DECLARE trial_cursor CURSOR FOR

SELECT trialname, servername, command

FROM [dbo].[TRIALS]

Open trial_cursor

FETCH NEXT FROM trial_cursor into @trial, @server, @command

WHILE @@FETCH_STATUS = 0

BEGIN

PRINT @trial + N' ' + @server

-- append server location to the command

SET @command= N'exec (''' +@command+ ''') at ' + @server

-- run it

SET @starttime = getdate()

exec @success=sp_executesql @command

Once the exec statement returns, the script updates the TRIALS table with the start and end times + the duration of the test, then loops back, until all tests in the table are completed.

-- NULL the end time only if it failed, otherwise set the end time

If @success=0 SET @endtime = getdate()

-- save the times and durations

Update [dbo].[TRIALS]

Set StartTime = @starttime, EndTime=@endtime, DurationMs=DateDiff(MS, @starttime, @endtime)

WHERE trialname =@trial

-- just in case of failure

SET @starttime =NULL

SET @endtime =NULL

FETCH NEXT FROM trial_cursor into @trial,@server, @command

END

CLOSE trial_cursor

DEALLOCATE trial_cursor

If an error happens, the script will stop running. There is no ‘error’ checking, as I could not find a way to catch DB2 errors if something happened, but it is possible to catch SQL Errors using a TRY-CATCH, but that requires a bit more work.

After the script runs, the TRIALS table will contain data similar to the below:

Trial #: 004-123

DB2400_290

select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY

5/11/2009 10:47:17 PM

5/11/2009 10:47:27 PM

9736

Trial #: 004-124

DB2400_300

select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY

5/11/2009 10:47:27 PM

5/11/2009 10:47:37 PM

9743

Trial #: 005-125

DB2400_000

select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY

5/11/2009 10:47:37 PM

5/11/2009 10:49:10 PM

92610

Trial #: 005-126

DB2400_010

select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY

5/11/2009 10:49:10 PM

5/11/2009 10:50:58 PM

108460

Trial #: 005-127

DB2400_020

select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY

5/11/2009 10:50:58 PM

5/11/2009 10:52:45 PM

107063

Trial #: 005-128

DB2400_030

select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY

5/11/2009 10:52:45 PM

5/11/2009 10:54:29 PM

104143

In the above table, you can see where test 4 (reading 10,000 rows) switched to test 5 (reading 100,000 rows). The actual test# being here were the 123rd through the 128th query being executed. The last 3 columns contain the start and stop time, plus the duration in milliseconds. Also, by looking at the linked server name, you can see that the tests were using rowsets of 290, 300, 0, 10, 20, and 30.

I would also highly recommend running this script from a command line, piping the output to nul

GenericPart3.sql

This is a simple script that cleans up most of part1 tables. It does not delete the TRIALS tables, but removes the linked servers that were created, and drops the LinkServerNames and Query tables.

GenericPart4.sql

Now, after running GenericPart2.sql, you have your table full of performance data. But, the data is intermingled. You know you should have 3 queries for 10,000 rows against each of the linked server, but how do you get that out? And, even better, can you average the 3 test times? Well, yes you can, with this script.

This script builds a pivot sql query against the trials database, using the actual query. To only get the 3 queries, it first opens up a cursor against the TRIALS database, reading in the DISTINCT commands that were issued. These are saved in a string to be used by the pivot query (the command strings become the column headers).

Next, the script builds the query. The query also takes the average of the durationMS column from the TRIALS table. The final query (in my case) would look similar to this:

Select servername

, [select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY]

from (select servername

, command

, durationms

from dbo.trials) AS rf

PIVOT ( avg(durationms)

FOR command IN (

[select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY]

, [select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY])

) AS p

Once the script is executed, you have a nice table that can easily be reviewed to see which setting may work best for a particular query:

servername

select * from SCHEMA.TABLE FETCH FIRST 10000 ROWS ONLY

select * from SCHEMA.TABLE FETCH FIRST 100000 ROWS ONLY

select * from SCHEMA.TABLE FETCH FIRST 1000000 ROWS ONLY

DB2400_000

10798

103306

940229

DB2400_010

12305

113839

1175765

DB2400_100

10297

93475

925285

DB2400_110

10418

93156

941389

DB2400_120

10374

93995

938454

DB2400_130

10635

95111

940769

DB2400_140

10527

93925

936573

DB2400_150

10566

93308

944626

DB2400_160

10516

94326

941008

DB2400_170

10760

94126

938562

DB2400_180

10503

93899

935958

DB2400_190

10623

93555

922817

DB2400_020

11847

110398

1080601

DB2400_200

10894

93490

904607

DB2400_210

10658

93555

905882

DB2400_220

10799

93133

919812

DB2400_230

10687

95160

923053

DB2400_240

10848

96003

919006

DB2400_250

10623

96033

931469

DB2400_260

10486

98468

927744

DB2400_270

10663

96406

941534

DB2400_280

11876

96829

936327

DB2400_290

11311

98304

931216

DB2400_030

11903

105999

1068744

DB2400_300

11307

98747

927451

DB2400_040

11594

103962

1053893

DB2400_050

11363

103201

1032705

DB2400_060

11171

100775

1012864

DB2400_070

10931

98275

994035

DB2400_080

10623

96421

981286

DB2400_090

10338

94936

953186

I hope this helps. In part 2 and part 3 I will address performance testing SSIS and C# queries using the Microsoft OLE DB Provider for DB2.