SQL Server CLR Hosting

Article
06/27/2004

Draft v0.1 - 30th June 2004

In this white paper, we'll discuss the integration of .NET Framework development support into SQL Server 2005, through the integration of the Common Language Runtime directly into the relational engine. We'll examine the advantages and disadvantages of writing stored procedures and other database objects in managed code, and talk about the theory and practice of using this functionality. Lastly, we'll look at some tips and tricks for developers and systems administrators wanting to maximise performance and improve maintenance of managed code in the database.

Why Introduce Managed Code into the Database?

In the currently released SQL Server 2000, there are in effect three ways to embed custom logic into your database:

Writing stored procedures, user-defined functions and triggers using Transact-SQL;
Creating extended stored procedures using a C-style DLL written to a special API;
Building your logic into an external component or data access layer (perhaps .NET or COM) and ensuring that all database manipulation occurs via your own component.

Traditionally developers have opted for a mixture of options 1 and 3. Option 1 is perfect from the point of view of a database purist - everything that impacts the integrity of the database can be kept there; nothing can bypass the code, as it's embedded in the database. However, whilst T-SQL is great for set-based data manipulation, it doesn't have the same degree of structure and elegance as most component-orientated languages, offers limited support for string handling and many other constructs, and doesn't perform awfully well when you're not dealing directly with the database.

Option 3 allows you to build access logic that abstracts data from a relational form into more business-centric idioms such as customer, purchase order etc., as well as offloading this work from the database server to other application servers. But because the data tier is separate from the database itself, it's not always easy to integrate any semantic validation from here into other data services such as replication or reporting.

Most people steer away from option 2 (extended stored procedures), because they're quite fragile due to their in-process nature - a memory leak in one of these can bring the whole database server down, for instance. They're also rather awkward to write. The guarded wording used in this security note is certainly enough to scare me off recommending them, at any rate.

The big change in SQL Server 2005 is that the Common Language Runtime (CLR) is hosted directly in the database engine, allowing .NET managed code to be used for stored procedures, functions or triggers. On the surface, this is the best of all worlds - your code resides in the database, but can take advantage of the full richness of the .NET Framework Base Class Libraries. It performs better than T-SQL for computationally-intensive tasks, can be secured using both the SQL security architecture and .NET Code Access Security, and allows you to use any .NET language to develop everything from the front-end user interface to the back-end database logic itself.

Here's a few examples of how you might use this capability to build a richer database:

Utilise the .NET cryptography classes to encrypt sensitive data in the database for added security;
Use regular expressions in a trigger to validate structured string data such as email addresses, phone numbers and postal codes prior to insertion in the database;
Create table-valued functions that combine data from a SQL table with data from an external source or other values derived through computation and lookups based on the existing data.

On top of all this, you can also create custom user-defined data types and aggregates using the .NET integration; for example, you could have a custom type that represents a financial instrument and then use a custom aggregation to ensure that it was rolled up appropriately for reports.

One particularly striking aspect of CLR integration is how SQL Server 2005 takes advantage of the extended CLR hosting APIs in .NET Framework 2.0 (the new version of the Framework that will ship simultaneously with Visual Studio 2005 and SQL Server 2005). Traditionally the CLR takes care of memory management, garbage collection and thread support itself, but when it runs in SQL Server 2005 these services are delegated to the database engine. The engine has a much better understanding of the current system load as a whole and can therefore manage memory and threads appropriately for the entire execution environment. This results in a more robust and scalable solution than if these services were left within the CLR itself.

Is T-SQL Dead? Certainly Not!

At this point, you might be wondering whether these new capabilities are the death knell for writing these objects using Transact-SQL. By no means! Both managed code and T-SQL will continue to play an important part in database applications; managed code support is intended to offer additional flexibility rather than replace what already exists. The received wisdom thus far is that managed code should be used for computationally-intensive activities or tasks that can take advantage of the base class libraries in the .NET Framework, whereas T-SQL should be maintained for set-based select and update operations.

But rather than being some complex design decision that will involve whole armies of architects debating the merits of each technique until the early hours, the choice in practice is reasonably intuitive. There are certain stored procedures I've written in the past which felt like they took T-SQL some place that it wasn't intended for: most of them have involved loops or similar iterative constructs. In the future, I'll be writing these same tasks in C# and I can imagine them taking half the code and a quarter of the effort. Similarly, if all you're doing in a stored procedure is wrapping a complex multi-table joined SELECT statement, you'll wind up writing twice the code if you build it in managed code, because you'll still need the same statement but you'll have to wrap it up with objects such as SqlContext and SqlCommand that will just add overhead and make debugging and deployment harder. (We'll talk about the concept of SqlContext next time.)

Creating a User-Defined Function in C#

That's enough general discussion. Let's see how this works out in practice. I've written a very simple pair of functions that use the Triple DES cryptography class in the Framework to encrypt and decrypt strings. As you look at the code below, note that there's nothing specific to Yukon.

 using System.IO;
using System.Security.Cryptography;
using System.Text;
public class Encrypt
{
   // This is obviously not the right place to store
   // cryptographic secrets as these values could be
   // easily retrieved. In a real-world scenario, you'd
   // use a Cryptographic Service Provider (CSP) or 
   // similar to generate these values. It does however
   // make this demo easier to distribute!
   // 192-bit key and 64-bit initialization vector
   private static readonly byte[] ENCRYPTION_KEY = new byte[]
      { 0x00, 0x3A, 0x40, 0x30, 0x00, 0x3A, 0x40, 0x30, 
        0xFA, 0xE0, 0x01, 0x39, 0x8A, 0x21, 0x17, 0x97, 
        0x14, 0x85, 0xED, 0xCA, 0xFF, 0x44, 0x48, 0x9F };
   private static readonly byte[] ENCRYPTION_IV = new byte[] 
      { 0x10, 0xA9, 0x44, 0x24, 0x96, 0xD4, 0xFA, 0xFE };
   public static byte[] EncryptData(string input)
   {
      byte[] byteInput = new UnicodeEncoding().GetBytes(input);
      // Instantiate a DES instance with random key
      TripleDESCryptoServiceProvider desServiceProvider = 
         new TripleDESCryptoServiceProvider();
      desServiceProvider.KeySize = 192;
      // Create DES Encryptor from this instance
      ICryptoTransform desEncryptor = 
         desServiceProvider.CreateEncryptor(
            ENCRYPTION_KEY, ENCRYPTION_IV);
      // Create stream that transforms input using 
      // DES encryption
      MemoryStream ms = new MemoryStream();
      CryptoStream encryptStream = 
         new CryptoStream(ms, desEncryptor, 
            CryptoStreamMode.Write);
      // Write out DES-encrypted file
      encryptStream.Write(byteInput, 0, byteInput.Length);
      encryptStream.FlushFinalBlock();
      // Now write out MemoryStream to output buffer
      return ms.ToArray();
   }
   public static string DecryptData(byte[] input)
   {
      // Instantiate a DES instance
      TripleDESCryptoServiceProvider desServiceProvider = 
         new TripleDESCryptoServiceProvider();
      // Create DES Decryptor from our des instance
      ICryptoTransform desDecryptor = 
         desServiceProvider.CreateDecryptor(
            ENCRYPTION_KEY, ENCRYPTION_IV);
      // Create stream that transforms input using 
      // DES encryption
      MemoryStream msDecrypt = new MemoryStream(input);
      CryptoStream csDecrypt = 
         new CryptoStream(
            msDecrypt, desDecryptor, CryptoStreamMode.Read);
      byte[] byteOutput = new byte[input.Length];
      csDecrypt.Read(byteOutput, 0, byteOutput.Length);
      string stringOutput = 
         new UnicodeEncoding().GetString(byteOutput);
      return stringOutput.TrimEnd('\0');
   }
};

So what do you need to do to use these functions from Yukon? The first thing of course is to compile this class into a .NET assembly: use a command like:

    csc /t:library funcEncrypt.cs

Take a note of the location of the compiled DLL; you'll need it later.

Cataloguing an Assembly in T-SQL

To use the functions above from SQL Server 2005, there are two steps: cataloguing the assembly itself and cataloguing any entry points that you'll use. Here's the Transact-SQL commands that perform these steps:

    CREATE ASSEMBLY YukonCLR
      FROM 'c:\code\yukon\yukonclr.dll'
      WITH PERMISSION_SET = SAFE
   GO
   CREATE FUNCTION dbo.fn_encrypt_data
   (@input nvarchar(max))
      RETURNS varbinary(max) 
      AS EXTERNAL NAME YukonCLR:Encrypt::EncryptData
   GO
   CREATE FUNCTION dbo.fn_decrypt_data
   (@input varbinary(max))
      RETURNS nvarchar(max) 
      AS EXTERNAL NAME YukonCLR.Encrypt.DecryptData
   GO

Notice the new data types of nvarchar(max) and varbinary(max) being used: these allow (almost) unlimited binary and string data lengths to be supported without having to deal with CLOB or BLOB manipulation pointers. These types weren't implemented in the Beta 1 release, but are supported in Beta 2. Also notice the assembly.class.method syntax for describing where each function in the assembly is stored: this is a change from the Beta 1 assembly:class::method syntax, which is still supported in this release but is deprecated and will likely not be supported by the time SQL Server 2005 finally releases. Lastly, be aware that you can give the functions an alias in the SQL environment rather than their name as defined in managed code, but the assembly itself must be named identically to the actual assembly name.

For each CREATE statement, there is an analogue for ALTER and DROP; here's the standard way to drop the assembly and associated objects (including a check for their prior existence):

    IF EXISTS (SELECT * FROM sys.assembly_modules 
      WHERE name=N'fn_encrypt_data' AND [type]=N'FS')
   DROP FUNCTION dbo.fn_encrypt_data
 
   IF EXISTS (SELECT * FROM sys.assembly_modules 
      WHERE name=N'fn_decrypt_data' AND [type]=N'FS')
   DROP FUNCTION dbo.fn_decrypt_data
   IF EXISTS (SELECT * FROM sys.assemblies 
      WHERE name = N'YukonCLR')
   DROP ASSEMBLY YukonCLR
   GO

To test that these functions work successfully, try executing the following:

    SELECT dbo.fn_encrypt_data('Yukon demo')

As a further test, let's use both functions together to prove the roundtrip works:

    SELECT dbo.fn_decrypt_data(
      dbo.fn_encrypt_data('Hello world!'))

So far, we've only seen how you build user-defined functions. Actually, stored procedures work in exactly the same way, except that you use CREATE PROCEDURE rather than CREATE FUNCTION.

If you examined the DROP statements above closely, you'll have seen mention of a few system views. You can investigate much of the underlying storage of managed code in Yukon by browsing several system views. Firstly,

    SELECT * FROM sys.assemblies

This shows the .NET assemblies that are catalogued in a particular database, with its full strong name and some other metadata (if you want to use the same assembly from two different databases, you'll have to catalogue it twice). Now try the following statement:

    SELECT * FROM sys.assembly_files

You'll see here the assembly ID, along with the actual binary code of the assembly itself. Lastly, the following view shows the functions that are catalogued in the database:

    SELECT * FROM sys.assembly_modules

Using the SQL Server In-proc Managed Provider

So far, we've created a simple user-defined function in C# and used the CREATE ASSEMBLY and CREATE FUNCTION statements to catalogue the assembly and function. But so far we've really only scratched the surface of the capabilities of managed code in SQL Server 2050. To reach a closer functional parity with Transact-SQL, we need to be able to access existing data in the database from managed code. In this instalment we therefore look at the new in-proc .NET Data Provider for SQL Server.

Since the inception of the .NET Framework, the recommended strategy for accessing relational data has been ADO.NET. To recap quickly, the classes pertaining to ADO.NET reside largely within the System.Data namespace, and provide a mechanism for connecting to databases and working with data either using a disconnected DataSet object or using a fire-hose forward-only cursor through a DataReader instance. To provide greater database neutrality, ADO.NET implements a provider model in which connected objects such as SqlConnection and OracleConnection implement a common interface IDbConnection. Most of the methods and properties you're likely to call are exposed via these common interfaces. (You'll find a far more exhaustive treatment of the subject in this book, which I highly recommend.)

Much of what you probably already know about ADO.NET also applies when writing managed code for Yukon. You'll find yourself working with common classes such as SqlConnection, SqlDataReader, and SqlTransaction, but there are a few differences that reflect the fact that the code now executes inside the database engine. Let's cover some of the main points of significance:

Since any data access now occurs in the same process as the database engine itself, there's a separate in-proc data provider. Instead of using classes in the System.Data.SqlClient namespace, managed code for Yukon uses the System.Data.SqlServer namespace.
There's no support in the SqlServer namespace for creating disconnected DataSet objects using a data adapter. This makes sense: it would be pointless to create a disconnected object within a permanently connected context.
On the other hand, there is support for the new SqlResultSet object that is introduced in the Whidbey release of ADO.NET. This provides a connected object rather like the old ADO Recordset that can be used for scrollable cursor-based navigation. Although set-based data manipulation is still desirable, the cost of a server-side cursor is greatly reduced in this context. One helpful attribute of a SqlResultSet is that it derives from SqlDataReader, allowing you to use it for data-binding even if the container hasn't been updated for Whidbey.
Instead of spinning up your own objects, the SqlServer namespace provides a class factory called SqlContext that creates connection, command, result set, error and transaction objects.
You can return information to the outside world by means of a SqlPipe object (also created by the SqlContext factory), which allows you to send strings, data readers and rows, as well as errors.

Let's have a look at a very simple stored procedure written in C#:

 using System.Data.SqlServer;
public class PubsSprocs
{
   public static void GetBookInfo()
   {
      using (SqlCommand cmd = SqlContext.GetCommand())
      {
         cmd.CommandText = "SELECT * FROM authors";
         SqlPipe pipe = SqlContext.GetPipe();
         pipe.Send(cmd.ExecuteReader());
      }
   }
}

To catalogue this against the pubs database, you'd use something like:

    CREATE PROCEDURE dbo.usp_get_author_info
      AS EXTERNAL NAME YukonCLR:[PubsSprocs]::GetBookInfo
   GO

Calling EXEC usp_get_author_info would of course give you the same results as executing the SELECT statement directly. This procedure isn't exactly a great candidate for managed code, of course, but it demonstrates some of the key concepts I've mentioned above. Notice the use of the SqlContext class to create the SqlCommand and SqlPipe instances; in fact, SqlPipe has a private constructor to prevent instantiation in any other way. You can also see how the pipe is used to output a DataReader object without deserialising it explicitly.

Using a Command object isn't actually the most scalable way to work with data if you're going to be calling the same command multiple times or from multiple users. The preferred route is to separate out the command definition itself from any run-time parameters that will be given to that command. To do this, you create a SqlDefinition object which contains a parameterised statement, and then instantiate a SqlExecutionContext object from the connection which supplies the parameter values and executes the statement. Here's a cut down code fragment which shows this in practice:

 using System.Data;
using System.Data.Sql;
using System.Data.SqlServer;
using System.Data.SqlTypes;
...
public class YukonManagedClass
{
   static readonly SqlDefinition insertStmt;
   ...
   static YukonManagedClass()
   {
      // Create the definition
      SqlCommand cmd = new SqlCommand();
      SqlParameter param = new SqlParameter();
      // Parameterised statement goes here...
      cmd.CommandText = "INSERT INTO MyTable " + 
         " VALUES (@Param1)";
      param = cmd.Parameters.Add("@Param1", 
         SqlDbType.NVarChar, 50);
      // Additional parameters here...
      insertStmt = new SqlDefinition(cmd);
   }
   public static void MyStoredProc()
   {
      SqlConnection conn = SqlContext.GetConnection();
      using (SqlExecutionContext ex = 
         conn.CreateExecutionContext(insertStmt))
      {
         ex.SetString(0, "some value");
         ex.ExecuteNonQuery();
      }
   }
}

This fragment comes from a stored procedure I wrote to create some dummy data for a sample I put together. You can see the full stored procedure here if you're interested, along with some sample SQL to catalogue the procedure and test it.

Table-Valued Functions

In previous instalments we've seen how Yukon allows you to build stored procedures and functions in managed code and how to catalog .NET assemblies in the database. We then looked at how you can use the in-proc data provider to efficiently access data and return results through the SqlPipe object.

Until now we've only looked at writing scalar functions (i.e. those that return a single numerical or string value). SQL Server also supports table-valued functions (TVFs), which as their name suggests return tables. In Transact-SQL, you can use a table-valued function in many places where you'd use a regular table, for example in a FROM clause of a SELECT statement:

    SELECT *
   FROM fn_calculate_premiums(30000)

The managed function prototype for TVFs is as follows:

    public static System.Data.Sql.ISqlReader Foo();

ISqlReader is an interface implemented by the SqlDataReader classes in both System.Data.SqlServer and System.Data.SqlClient.

So how do you return your own data through this interface? If your function is just doing a bit of data access itself, you might be able to simply do something like SqlExecutionContext.ExecuteReader() and return the result back directly. But most of the time you'll want to either edit data (perhaps adding a column or changing some values) or even create some brand new data to return. In this case, your only option is to build a new class that implements ISqlReader and provides custom implementations of the methods and properties, then returning this back to the calling application.

ISqlReader itself isn't too bad - it only has a few properties and methods to override, but it in turn implements ISqlRecord and ISqlGetTypedData, both of which require implementation a shed-load of stuff (I counted sixty separate methods and properties). Suddenly this seems like a load of work, doesn't it? (Little gripe: I'm hoping there will be a helper class by the time of Yukon RTM that makes this whole process a good deal simpler. If we don't write one officially by Beta 2, I'm going to be putting my own one together since I'm getting tired of implementing this code over and over again.)

For now, you'll be pleased to know that you can get away with only implementing 7 of the 60 methods and properties (plus one or two more if you want your table to contain multiple data types) in order for a TVF to work both within the Yukon environment and for simple reader.Read() style iterations elsewhere. If all your table contains is string columns, then the following will do:

MySqlReader (constructor)
ISqlRecord.FieldCount (number of columns in the returned table)
ISqlGetTypedData.GetSqlMetaData(int FieldNo) (the schema of the returned table)
ISqlReader.Read() (for use in constructs such as reader.Read())
IGetTypedData.GetString(int FieldNo) (returns the string in the current row and specified column)
ISqlGetTypedData.GetSqlCharsRef(int FieldNo) (ditto, but returns a SqlChars)
ISqlGetTypedData.GetSqlChars(int FieldNo) (ditto)

Make sure you implement both 6 and 7 - the MSDN documentation sample is wrong here at present. Unless you're also returning ints or other data types in which case you've got the appropriate GetXXX() methods to implement also before your work is done! For the other methods, you can simply throw a NotImplementedException. (Incidentally, don't make the mistake I did the first time round and simply return null following the default prototypes created by VS.NET when you hit Tab after ISqlRecord - you'll get spurious problems if any of the other methods are called for some reason, and you won't be able to pin them down.)

For fun, I wrote a TVF that used the Amazon search service to find books matching a given ISBN and then return the results in the form of a table. The code is too long to include here inline, but you can view it here. If you want to use it, you'll need to modify the developer token required by Amazon to one you've registered. Once compiled, you can catalogue this as follows:

    CREATE FUNCTION dbo.fn_get_book_info(@ISBN nvarchar(10))
   RETURNS @BookInfo TABLE 
      (Title nvarchar(200), Author nvarchar(200), Price nvarchar(30),
       Rating nvarchar(10), Rank nvarchar(10))
   AS EXTERNAL NAME YukonCLR:[BookInfo]::GetBookInfo
   GO

Calling the function with a statement such as:

    SELECT * FROM fn_get_book_info('073560505X')

should return a result set.

Playing around with early bits like this is sometimes frustrating and unrewarding, but it certainly means you get to understand how the product works and get ahead so that there's not so much to learn when it finally ships!

Restricting the CLR
Not all of the Base Class Libraries in the .NET Framework are available in the database. Functionality "not applicable" to the database is disallowed. Most of System.* is available, but many non-applicable services are not supported, such as System.Windows.Forms, System.Drawing, System.Web. This is achieved through a "fusion loader hook" in CLR hosting. How does that work? When you look up an assembly today, the CLR finds it using Fusion. SQL Server 2005 intercepts that search for the hosted environment and loads it from the database itself instead.

Even in supported assemblies, some APIs are not available in SQL. This is achieved through a new HostProtection attribute in the CLR (this is extensible by third-party libraries). You can reflect on an assembly and list these methods, and a full list will be available in the SQL Server 2005 documentation. Potentially unreliable constructs are disabled, such as thread creation, socket listening, and finalizers.

Tips for using the In-Proc Provider

Use SqlPipe.Execute() instead of cmd.ExecuteReader(). This allows you to simply stream back data without marshalling the results into the managed environment.
Use SqlDataReader instead of T-SQL cursors: it's the fastest way to enumerate rows.
Use SqlExecutionContext for static SQL patterns to improve performance.

Security Tips for DBAs

Use sp_configure 'clr enabled' to enable the CLR hosting in the engine - it's off by default
Use the CREATE ASSEMBLY permission to control the creation of assemblies
Use the REFERENCES permission to protect schema binding on assemblies
Use the EXECUTE permission to regulate who can execute routines
Catalog view security works as expected
Use the EXTERNAL ACCESS permission to regulate who can create external access code
Recommendations: use safe assemblies - this is the default and is closest to the T-SQL model. If you are going off the box, you need to question the impersonation strategy. By default, SQL does not impersonate. But you can use EXECUTE AS to specify a specific SQL context for impersonation.
Avoid unsafe assemblies - they can compromise system reliability and correctness.

SQL Server CLR Hosting

Additional resources