The amazing flexibility of DMX Table Valued Parameters

When most people think of “Table Valued Parameters” they think of “possibly the most anticipated T-SQL feature of SQL Server 2008.”  However, little may you know, that the Data Mining team added table valued parameters and table valued functions to DMX in SQL Server 2005.  Not only has this feature been in shipping code for more than four years as of this writing, the DMX implementation is considerably easier to use as well.

This blog entry is one of those that has been requested so many times, I look back on my past posts and am simply amazed that I haven’t written it yet.  The thing is, we made the feature so easy to use, that, as the development team, we didn’t think it needed more explanation.  Of course, as the development team we think all our features our so easy to use, so it may not say much.  Since last week the question came up yet again and since I’m working from home this morning prior to a parent-teacher conference, I decided to buck down and knock this topic out.

SQL Server Data Mining – use data from anywhere

When we set out to make data mining accessible from anywhere on any data we really meant it.  Independent of where the data resides you should always be able to use that data with SQL Server Data Mining to train models or perform predictions on already existing models.  And it should be hard.  One part of this functionality is XML for Analysis (XMLA) which allows you to connect to and communicate with a SQL Server Analysis Services server using SOAP.  I’m not going to get into that now, but it means that everything that can be done using any API can be done from any platform using raw XML, if need be.

But it needn’t be, because we thought it should be easier, so we added Table Valued Parameters to DMX in SQL Server 2005 – we didn’t call it that, but it is what it is.  It means that you can write a query like this:

 INSERT INTO (<columns>) @InputRowset

And pass the training data as the rowset parameter.  So, this begs the question, how do you send a rowset as a parameter?  Well, unlike other implementations we tried to make this as easy as possible using the API’s you are currently using.  Of one thing we made sure – there is no “preparatory” work that is necessary for you to make this happen.  For example, you do not need to make any user-defined types or do any work on the server to get this to work.  Another thing we made happen, is that if you think that the object you have in your application is logically a rowset, you should be able to send it to us the way you have it.  This means that if you use our OLE DB provider and you have a pointer to an IRowset, you can pass the IRowset as the query parameter.  If you use our ADOMD.Net provider you can pass a DataTable or an IDataReader as the parameter.  Our providers do the work of converting the contents of the objects into the formats required by the protocol so you don’t have to.  Anyway that probably sounds more simple to me than it does, so let’s just go ahead with an example.

The Example

First, lets just create an arbitrary DataTable in memory like this:

 DataTable table = new DataTable();
DataColumn column;
column = table.Columns.Add("Column1");
column.DataType = System.Type.GetType("System.Int32");
column = table.Columns.Add("Column2");
column.DataType = System.Type.GetType("System.String");
column = table.Columns.Add("Column3");
column.DataType = System.Type.GetType("System.String");

table.Rows.Add(1, "A", "B");
table.Rows.Add(2, "A", "B");
table.Rows.Add(3, "B", "C");
table.Rows.Add(4, "B", "C");

See – nothing special.  Of course, in your case you would populate the table with meaningful values, but that’s besides the point of the example.  Next I’ll use ADOMD.Net to create and train a model using DMX.  I’m using CREATE MINING MODEL just to shorten the number of steps.  Of course you can use CREATE MINING STRUCTURE followed by an ALTER MINING STRUCTURE statement, but, again, the point of the example is the table valued parameter.  Anyway, this is the code for creating the model.

  AdomdConnection conn = new AdomdConnection("DataSource=localhost");
AdomdCommand cmd = new AdomdCommand();
conn.Open();
cmd.Connection = conn;
// Create mining model
String createMiningModel =
                "CREATE MINING MODEL foo " +
                "( " +
                "   [rowid] LONG KEY, " +
                "   [A] TEXT DISCRETE, " +
                "   [B] TEXT DISCRETE " +
                ") USING Microsoft_Clustering " +
                "WITH DRILLTHROUGH";
cmd.CommandText = createMiningModel;
cmd.ExecuteNonQuery();

This creates a simple cluster model with three columns that match those in my table.  I added drillthrough so I can retrieve the cases for later, but other than that – totally straightforward.

Next, let’s look at the code that lets me take the table and use that data to train the model.

 String insertInto =
                "INSERT INTO foo([rowid],[A],[B]) " +
                "@TrainingRowset";
cmd.CommandText = insertInto;
cmd.Parameters.Add("TrainingRowset", table);
cmd.ExecuteNonQuery();

See!  It really is that simple.  Just create your query, load your data into a table, add the table as a parameter and voila!  Table valued parameters sent to the server to process a model.

So I wanted to show an example of how to do this with a prediction query, but it’s just so easy it didn’t make sense.  However, you may ask “What if my data is external?  Do I have to load it into a data table before I can send it to the server?”  The answer is a profound NO.  If you can get any sort of IDataReader on the data, the ADOMD.Net provider will consume the data reader and stream the data to the server.

Let’s take another example using a data reader.  In this example I’m going to pull the cases out of the model using a <model>.CASES, since it’s convenient (and I added DRILLTHROUGH to the model definition), but any implementation of IDataReader will do.  In fact, this is kind of silly as DMX allows me to directly join against the model cases, but hey, it’s just for example’s sake.

Here’s the code to get the reader – I put it in a function just to exemplify that the calling code is simply getting an IDataReader and has nothing to do with ADOMD.Net:

 private IDataReader GetReader()
{
    AdomdConnection conn = new AdomdConnection("DataSource=localhost");
    AdomdCommand cmd = new AdomdCommand();
    conn.Open();
    cmd.Connection = conn;
    cmd.CommandText = "SELECT [A],[B] FROM foo.CASES";
    return cmd.ExecuteReader();
}

And here’s the how you would take that reader and perform an arbitrary prediction join against the model.

 cmd.Parameters.Clear();  // clear parameters from previous usage
String queryModel =
                "SELECT Cluster(), t.[A], t.[B] " +
                "FROM foo " +
                "NATURAL PREDICTION JOIN " +
                "@InputRowset AS t";
cmd.CommandText = queryModel;
cmd.Parameters.Add("InputRowset",GetReader());
AdomdDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
    Console.WriteLine(String.Format("Cluster {0}, Input Values {1}, {2}",
            reader[0], reader[1], reader[2]));
}

Again, virtually no work on your part to accomplish pushing data from an arbitrary data provider on the client to the server to perform, in this case, predictions on data.

Summary

Table Valued Parameters in SQL Server Data Mining give you pretty much unlimited flexibility in the data sources you can use to perform data mining.  Even if appropriate data providers are not available on the server to use traditional data access methods such as OPENQUERY or model development in BI Dev Studio – or even if appropriate data providers don’t exist at all – you can still perform data mining operations on any data that you yourself can access or generate.

I hope this article opens some doors to some great new data mining applications!