Distributed Partitioned Views / Federated Databases: Lessons Learned

Introduction

This article contains information about the things we have learned while working with Federated Databases. Before beginning it is necessary to define the terms being used. Included in this article is one solution in production that is using federated servers and Distributed Partitioned Views.

Definitions

Definition 1: Local Partitioned View – A single table is horizontally split into multiple tables, usually all have the same structure.

Example 1:

Original table is named FACT.

New tables are named FACT2005, FACT2006, and FACT2007. (I chose to partition by date, you can choose anything else that makes sense for your application, like branch office).

CREATE VIEW FACT AS

SELECT <select list> FROM dbo.FACT2005

UNION ALL

SELECT <select list> FROM dbo.FACT2006

UNION ALL

SELECT <select list> FROM dbo.FACT2007

You will notice that the view is created with the same name as the original table. This way the applications don’t have to change.

You also have to create a check constraint on the same column in each table to create a partitioning “key”. Read more about this and other very important restrictions in Books on Line at: ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/tsqlref9/html/aecc2f73-2ab5-4db9-b1e6-2f9e3c601fb9.htm. There are several good places in Books on Line to read about partitioned views; this link will give you a launching point for the other ones.

Note: Local partitioned views are popular in both OLTP and data warehouse projects. We rarely hear about problems with these, but there are known problems with very complex queries when the optimizer does not do partition elimination. I will not spend much time on local partitioned views in this article as I would like to concentrate on Distributed Partitioned Views.

Definition 2: Cross Database Partitioned View – tables are split among different databases on the same server instance. Notice the three part name using the database in Example 2 below.

 

Example 2:

CREATE VIEW FACT AS

SELECT <select list> FROM DB2005.dbo.FACT2005

UNION ALL

SELECT <select list> FROM DB2006.dbo.FACT2006

UNION ALL

SELECT <select list> FROM DB2007.dbo.FACT2007

Note: The most frequent question I get from people attempting a view like this that has tables in multiple databases on the same instance is about joins. You don’t lose too much performance with cross database joins. This is something to think about because you will normally join this view to some reference tables for the application. If you carry the reference tables in a database called COMMON for example, then you will most likely see something like SELECT * FROM dbo.FACT JOIN COMMON.dbo.Customer ON ….. WHERE …

You will also notice in this example that I put each fact table in its own database to facilitate easier scale out across servers or instances if you need to do this at a later time.

Definition 3: Distributed (across server or instance) Partitioned View. Tables participating in the view reside in different databases which reside on different servers or different instances. Note the four part name which includes the actual server name (or the cluster name if this is in a Windows Failover Cluster).

Example 3:

CREATE VIEW FACT AS

SELECT <select list> FROM DB2005.dbo.FACT2005

UNION ALL

SELECT <select list> FROM server2.DB2006.dbo.FACT2006

UNION ALL

SELECT <select list> FROM server3.DB2007.dbo.FACT2007

You will notice that the server name is missing from the first server. This view definition in Example 3 exists on server1. You cannot use a linked server to refer to the local server. You might immediately recognize a potential problem and try to create the view in Example 4 on server 2. However, the trick is to change the linked server definitions so that the same view code in Example 3 is deployed to every server.

BAD Example 4: (don’t do this)

CREATE VIEW FACT AS

SELECT <select list> FROM server1.DB2005.dbo.FACT2005

UNION ALL

SELECT <select list> FROM DB2006.dbo.FACT2006

UNION ALL

SELECT <select list> FROM server3.DB2007.dbo.FACT2007

Instead of doing the BAD Example 4, use the style in Example 3 and change the linked server definitions on every server.

Infrastructure architecture for one DPV project

Here is one production project to be used as a reference for DPV. The customer had 3,000 branch offices to automate and determined that the total size representing 3 months worth of data was going to be approximately 3.6TB. They didn’t want to risk putting this all on one server, so their basic approach was to break this into manageable size pieces. There are 6 servers in 2 geographically separated data centers. Each server has 2 instances so that if one server gets too busy they can easily move the second instance to another server. The data is spread out over 12 instances. The servers are 8 socket dual core machines with 16GB RAM. Each instance is responsible for approximately 300GB. Growth is expected to double soon which is why they selected machines this big.

Another reason they bought bigger machines than they need for a normal load is to have a high availability strategy. The three machines in each data center are in one cluster and if one machine goes down another machine can pick up the load. If two machines go down, one machine will do the work of all 3 and they expect that performance will be slow until the problem is fixed. If an entire data center goes down there is no solution in place yet – this is a later phase of the project.

The most important point in the success of this project and makes this project work so well is that they are not using load balancing. The users at each branch are connected directly to a server that contains their data. So even though most of the inserts, updates and deletes are done through the partitioned views, the work is mostly local to one server. There are some corporate users issuing queries that need data from multiple instances and it is expected that most of these queries will touch multiple servers.

Reasons for Distributing:

The main reason for distributing the 3.6TB over 12 servers was to break the data into manageable size pieces. This makes the hard maintenance problem in a VLDB much easier to deal with; i.e. backups, index and statistics maintenance.

If you use table partitioning as well as the partitioned views, then you have even smaller pieces to manage for most of your administrative processes.

WARNING:

Although some federated server projects are successful, we have found several problems during these projects. Fortunately we found solutions for each problem. The warning is that we don’t know if we have uncovered all the problems. Your workload may be different enough that other problems will be discovered. So test, test, test if you decide to distribute a single database over multiple servers.

Two main problem patterns

The two main problems we have found so far are:

1) When a command gets sent to every server when you think it should only go to one server. This happens when the query optimizer thinks it has to check the schema on every server, as in the case when the same collations are not used on all the servers (see note below in the lessons learned section).

2) When cross-server join copies the records from the remote server to the local server and then performs the join. This is called a non-remotable query. The optimizer is pretty good at copying the smaller table (or result set) to the right server before performing the join. Still, it is a situation that should be avoided in order to get the most consistent performance. Try to make all the joins happen on one server (either all on the remote or all on the local) without copying records across the network. See the notes below on advice on how to avoid this.

 

Here is a list of the things we have learned to do and not do on this project.

Lessons Learned on Distributed Partitioned Views: (multiple servers involved)

  1. Follow the guidelines in Books On Line very carefully. There are many links from this main one. I recommend reading and re-reading these until you know the subject very well before you start. ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/udb9/html/6e44b9c2-035e-4c88-907f-eef880c5540e.htm. If you are on-line and can get to this site: https://msdn2.microsoft.com/en-us/library/ms175129(SQL.90).aspx, read it and any of the links it has on this page.
  2. Avoid cross server joins whenever possible. One solution is to replicate all your reference/dimension tables to every server. In a cross server join, the necessary records are copied from the remote server to the local server, then the join is performed.
  3. Use the same collations in all databases. Otherwise the startup filters are not applied and the queries are always sent to servers with different collation.
  4. Use the same session settings in all connections. Otherwise startup filters are not applied and the queries are always sent to servers with different session settings.
  5. Create an index with the partitioned column as the leading column on the index because most of your queries will contain the partitioned column in the where clause. The optimizer uses this index and the associated statistics to do more efficient queries.
  6. Plan guides do not work for distributed queries.
  7. After restarting SQL Server, create some startup stored procedures that will run the queries that you need. Otherwise the first user running each query will pay a heavier penalty because it has to touch every server. These startup procedures will also create a local connection pool to each server. Even though creating a connection is fast it will still be better if the first user doesn’t have to wait for this too.
  8. Don’t forget to do the tip in Books-On-Line: Turn on Lazy Schema Validation. This will give you better performance. It helps to avoid