Authors: Kevin Cox & Michael Thomassy
Contributors: Lubor Kollar
Technical Reviewers: Shaun Tinline-Jones, Chuck Heinzelman, Steve Howard, Kun Cheng, Jimmy May
Connecting directly to a SQL Azure database from a reporting tool (like Microsoft Excel and PowerPivot, or SQL Server Reporting Services) from your desktop or local data center is possible using a data source that looks like:
This is useful if you have a single database on SQL Azure that you need to query. What if you have reached the 50GB limit that is currently the top size of a SQL Azure database? The solution is to split your database into many databases containing the same schema. See http://msdn.microsoft.com/en-us/library/azure/dn495641.aspx on database sharding with SQL Azure.
Expecting the users to always define multiple data sources to connect to multiple databases is cumbersome and may not perform well. For example, a PowerPivot user would have to open a separate link to every SQL Azure database individually. This paper describes a solution using a local SQL Server that can ease the access to all these databases and will usually give better performance. The solution is called Partitioned Views and is implemented using Linked Servers. Since customers don’t usually like to expose their databases to the internet, we are proposing a separate server that contains no actual data and only contains the views necessary for applications to get an internet link to SQL Azure databases. This is not the only solution because you can implement stored procedures using OPENROWSET or OPENQUERY for solutions that need more logic.
A read-only solution is the easiest to implement. If modifications must be made to the SQL Azure tables via the Distributed Partitioned Views, you must read the restrictions in SQL Server Books Online. It also helps to read about how to implement a partitioned view. Using the check constraints on the tables as shown in the examples in this link is essential if you want the optimizer to only touch the right tables when you use the constrained columns in a WHERE clause in your queries.
How does this work?
This concept relies on the linked server feature of SQL Server. Since ADO.NET, ODBC, PHP and JDBC are the only providers currently supported by SQL Azure, ODBC data source names (DSN) are required.
Setting up an ODBC DSN
Run odbcad32.exe to setup a system DSN using SQL Server Native Client. Or go into Control PanelSystem and SecurityAdministrative Tools and click on Data Sources (ODBC). Each database you create on SQL Azure needs a separate DSN and a separate linked server definition. Use TCP, not named pipes because you will be communicating to the SQL Azure servers through the internet.
Scale the creation of ODBC connections to multiple databases by modifying & importing a .REG file into the registry (regedit.exe) patterned after this sample:
[HKEY_LOCAL_MACHINESOFTWAREODBCODBC.INIODBC Data Sources]
“Azure_ODBC1″=”SQL Server Native Client 10.0”
SQL Server Linked Server
An ODBC DSN and an associated Linked Server need to be created for each Database that will be queried. Here are two sample T-SQL commands that will create a linked server and associated login:
EXEC master.dbo.sp_addlinkedserver@server = N’Azure_ODBC1′,@srvproduct=N’Any’, @provider=N’MSDASQL’, @datasrc=N’Azure_ODBC1′
/* For security reasons the linked server remote logins password is changed to ######## */
SQL Azure Limits
By default, SQL Azure allows 149 databases per server. To exceed this threshold, either 1) create databases in other servers, or 2) contact the Azure business desk and request that your limit be increased from the default threshold. Each database can be a maximum of 50GB, giving a default total 7.45TB of storage per Server. Remember to factor in index sizes when planning your total data needs. The 50GB limit has to cover both data and index space. Log space is considered separate from the 50GB limit.
Distributed Partitioned View
In order to make all your SQL Azure databases appear as one data source, you need to create a distributed partitioned view (DPV) in your local, on-premise SQL Server database. This database must be at least Enterprise edition because DPVs are not supported in Standard Edition or less.
This example statement creates the view with two SQL Azure databases on a local SQL Server:
CREATE VIEW dbo.dpv_Test AS
SELECT * FROM Azure_ODBC0.test0.dbo.fact
SELECT * FROM Azure_ODBC1.test1.dbo.fact;
This is an example of a simple select using a four-part name. More complex statements are allowed, such as adding joins and where clauses. Joins should use four part names in all table references to keep the join work on the same machine. The worst performance occurs when cross server joins are implemented.
The simple example above will return all rows from these tables in all databases. This may be your intention if, for example, you want to bring all the data into PowerPivot for aggregations and pivoting. Be mindful of the cost for bandwidth of returning all the rows. To estimate cost, go to the main Windows Azure pricing site. The appendix provides a query that can be used at any time to get the current billing estimate. Be mindful that the billing values reflect current pricing which you are likely to be modified as Azure matures.
Notes: It is not the view that is partitioned in a distributed partitioned view, it is the data the view points to that is partitioned. The data does not have to be partitioned, this view can point to one database and can contain one select. The tables being referenced in the view can exist in both SQL Azure and local databases as long as the correct permissions are granted for the appropriate users.
Use Linked Server to SQL Azure
There is a simple method for running a query against a single table directly on a SQL Azure server from your application (without using the DPV). As an example, if you open a query window on your local SQL Server Management Studio, while connected to a local instance, you can follow the sample:
SELECT * FROM [YourAzureServer@database.windows.net].[YourDatabase].[YourSchema].[YourTable]
To query all the tables on your Azure server from a local server, simply use the view created above:
SELECT * FROM dpv_Test
Write to a table through the view:
INSERT INTO dpv_Test VALUES (1, ‘Test’, 1)
The first time the view is invoked, either through a select or insert/update/delete command, it will cache the table constraints on the local server. This gives the view some information about where the insert should be directed and it will only touch that database. And if the constraints are used in the WHERE clause of a query, the optimizer will do partition elimination and only touch the databases it needs to touch.
Note: Beware that distributed partitioned views have several restrictions that are well documented in B Online. (This link is one of many to read, use it as a launch point for the other related pages.)
Linked Server Properties
Linked Servers are only available in SQL Server Enterprise Edition or higher.
If your view spans multiple Azure data centers, you need to be aware of one more thing before you get started. On your local server you need to set the Lazy Schema Validation option to TRUE so that the query is not sent to every server for every command. The schema check will defer to run time and may fail if the result set formats are incompatible. :
EXEC sp_serveroption ‘LocalServerName’, ‘lazy schema validation’, true
EXEC sp_serveroption ‘RemoteServerName’, ‘lazy schema validation’, true
Another server level setting that can help performance is to ensure that the collation settings are compatible on all servers. This is an optimization option and is not required. What can happen if the collations are not compatible is the filters supplied by the WHERE clauses do not get applied on the remote server. Instead, all data is returned by the query and the WHERE clause gets applied on the local machine.
EXEC sp_serveroption ‘LocalServerName’, ‘collation compatible’, true
EXEC sp_serveroption ‘RemoteServerName’, ‘collation compatible’, true
Downsides and potential pitfalls
This solution is not perfect and at times it can be a struggle. Most of the difficulties have been mentioned in this blog, but the two main ones are summarized here for convenience. The blog contains the solutions/workarounds to these problems so it is worth reading.
1. Optimizer sometimes ignores your WHERE clause and sends the query to every server, every database. The solution is to put constraints on your SQL Azure tables so the optimizer can determine which shard to touch and which to eliminate.
2. Non-remotable query where all data comes back to the local server without being filtered by the remote database.
What if one database is not available?
If the view needs to touch all the databases and one is not available, the query will fail. If the view has determined that it can do partition elimination and does not need the unavailable database, then the query will succeed. The only exception to this is the first time a DPV is executed after a restart of your local SQL Server, it needs to touch every destination database to get the constraints and store them locally (in memory). Remember to turn on lazy schema validation so that every query after the first one does not get sent to each database just to check the schema.
Leveraging Constraints for Partition Elimination
How does SQL Server know which databases to touch? To test this scenario, create the different databases on SQL Azure so that your main table contains data from each month of a certain year. Perhaps create the databases called AppDB2010_01 and AppDB2010_02 to make it easier for you to remember what the database contains. These names are meaningless and have no bearing on what you actually store in the database. Create a table, perhaps call it FACT and make sure it has a DATE or DATETIME (or any other date oriented data type). The key to making this solution work efficiently is to create a constraint on this DATE field so that SQL Server knows it only contains data within a certain range. In this example, it will only contain data for a certain month of a certain year; i.e. January 2010. This constraint is what gets cached on your local server after the first use of the DPV so that SQL Server knows which tables to touch during a query. Now test to confirm using a query such as:
SELECT * FROM dpv_Test WHERE datefield = ‘2010-01-15’.
How do you know a query did not touch a certain database? Since SQL Profiler does not work in SQL Azure, the only way to know which queries ran on a certain server is to use certain DPVs to see what plans have been generated. The following query will return a list of recently generated statistics and plans for a server:
SELECT * FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) ORDER BY last_execution_time DESC;
It is possible to use a local SQL Server to simplify access to a multiple SQL Azure database. The performance may be better than individual links from a front end tool when partition elimination is used.
Appendix: SQL Azure Billing Summary
Thanks to Lubor Kollar for creating these queries.
SELECT * FROM sys.Database_Usage
SELECT * FROM sys.Bandwidth_Usage