Linked Server Performance (Heterogeneous Databases)

Recently I was working on a problem where the scenario was as below. There are various applications connecting to SQL Server to fetch data. But the actual data was not resident on the SQL Server rather on Oracle 10g and IBM DB2 Databases. The applications were supposed to use Windows active directory for security and hence they couldn’t connect to the data sources directly to fetch data as IBM DB2 and Oracle 10g don’t have built in support for Windows authentication. 

image

 

 

The current implementation was to have Linked Server queries across IBM DB2 and ORACLE databases. Linked server queries were being used were four part queries https://msdn.microsoft.com/en-us/library/ms188279.aspx

This worked fine when the applications had to select all the rows from a table either in IBM DB2 or ORACLE database. The issue was when the applications had to choose only a few rows based on a condition from the remote data source, the performance was very slow. It took around 2 hours to complete.

SELECT a.Cl1, CL2, Cl3, Cl4, Cl5,Cl6

FROM    DB2.WareHouse.dbo.Table1 a, Oracle.WareHouse.dbo. Table2 b

WHERE   a.UID = b.UID

A quick SQL Server Profilertrace showed that even if the where clause evaluates to just one row all the 8 million rows were brought over from IBM DB2 database. To compound the issue there was a join to be formed with a single row from a table in Oracle. The next fetch too brought 10 millions of rows while we just required one row to form the second part of the join.

This happens mostly when the column on which the predicate is executed is a string or character type. Since the collations, data type length etc may vary across the database as per the customization on standards SQL Server will bring whole of the data to itself locally and then execute the where clause on the rows.

The alternative is to use OPENQUERYalso know as pass through query to pass the actual SQL (this needs to be in the format the remote data source expects) to the remote data source. The where clause gets executed at the remote data source and the resultant single row traverses over the wire instead of millions of rows.

But the story doesn’t end here. The OPENQUERY syntax is limited in that it does not accept variables for arguments.

The Linked Server query below works fine.

SELECT * FROM OPENQUERY(Server2,'SELECT * FROM Db1.dbo.Course WHERE Title = ''Calculus''')

But try adding a parameter to it and execute it.

DECLARE @Predicate as nvarchar(20)

SET @Predicate = 'Calculus'

SELECT * FROM OPENQUERY(Server2,'SELECT * FROM Db1.dbo.Course WHERE Title =' + @Predicate )

 SQL Server Management Studio reports the following error.

Msg 102, Level 15, State 1, Line 3

Incorrect syntax near '+'.

 

We required a way to use OPENQUERY for the sake of performance and yet be able to make it accept variables that users may choose dynamically. After a bit of research we were able to arrive at this solution. To create a stored procedure to create OPENQUERY dynamically accepting variables form the user. So the user chooses the table, column and where predicate and we the stored procedure creates the query on the fly executes it and returns the result.

USEDatabase1

--Drop the stored procedure if it exists

IF EXISTS (SELECT name FROMsysobjects

         WHERE name = 'StoredProc_LinkedServer' AND type = 'P')

   DROP PROCEDUREStoredProc_LinkedServer

GO

--Create the stores procedure

USEDatabase1

GO

CREATE PROCEDUREStoredProc_LinkedServer

   @LinkedServer varchar(50),

   @CatalogName varchar(100),

   @ColumnName varchar(50),

   @Predicate varchar(50)

AS

Declare @OPENQUERY as nvarchar(4000),@TSQL as nvarchar(4000),@QUERY as nvarchar(4000)

Declare @RemoteTableName varchar(200)

SET @RemoteTableName =@CatalogName

SET @OPENQUERY = 'SELECT * FROM OPENQUERY('+ @LinkedServer + ','''

SET @TSQL = 'SELECT * FROM ' + @RemoteTableName + ' WHERE ' + @ColumnName + ' = ' + ''''''+ @Predicate+ ''''''');'

SET @QUERY = @OPENQUERY +@TSQL

--Exceute the TSQL created dynamically. Since it is a Linked Server OPENQUERY the actual query gets executed on remote database.

EXEC sp_executesql@QUERY

GO

 

This stored procedure can have wrapper stored procedures to hide the name of Linked Server and other values in case the DBA wishes so. So the applications call the stored procedure provided to them, which in turn calls this internal stored procedure. You may ask what about performance? Viola it came down to below a second and I have a happy customer.

I would like to delve a bit in depth into the reason behind FOURPART LINKED Server query versus OPENQUERY LINKED Server query performance. There are a few other intricacies about Linked Server that have not been documented well. So I will post that in another blog along with my colleague Runeet.

Author : Angshuman(MSFT), SQL Developer Engineer, Microsoft 

Reviewed by : Jason(MSFT) SQL Escalation Services, Microsoft and

                       Snehadeep(MSFT), SQL Developer Engineer, Microsoft