Querying like the Data Mining viewers do

It happens all the time.  You see some cool user interface trick and think “how can I do that?”  Or, maybe more likely, you think “gee, that’s useful, where’s that documented?”  In any case, if you have ever wondered about how the DM viewers get the data to display on the screen, then this is the post for you.

In many cases, what is displayed in the data mining viewers is the result of built-in stored procedures which allow the processing required for the view to be done on the server without requiring all of the model content to be brought to the server.  In this post, I’ll give one example of those stored procedures, and then I’ll add some information on how you con figure out the rest.  Not everything in the viewers is calculated on the server, but many of them are, and getting access directly to the data is useful (I presume, since everyone always asks…. :) )

Let’s start with an easy one – the Naive Bayes attribute discrimination view.  This view shows how the differences in the input attributes across the states of an output attribute.  It generally looks like this:

image

The viewer doesn’t download all of the correlations in the Naive Bayes content, rather it calls the stored procedure GetAttributeDiscrimination like this:

 

 CALL System.GetAttributeDiscrimination
  ('Classify CollegePlans NB', 
   '100000005', 
   'Plans to attend', 
   1, 
   'All other states', 
   2, 
   0.0005, 
   true)

The not-so-obvious parameters are, in order strModel, strPredictableNode, strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize.  Let’s go through these parameters…

strModel – the name of the model, of course!

strPredictableNode – this one is a bit difficult, as it takes the Node Unique Name of the target attribute instead of just the string you see in the viewer.  The Node Unique Name identifies the attribute in the content rowset generated by the model.  You can get the list of predictable attributes and their Node Unique Names by calling another stored procedure – like thisCALL System.GetPredictableAttributes('ModelName').   This stored procedure returns two columns – one for the attribute name and one for the Node Unique Name.

strValue1 – The name of the value you want to compare on the left hand side.  The usage of this parameter depends on the value of the next parameter, which is….

iValType1 – This parameter indicates how to treat strValue1.  It can have values 0,1, or 2.  If this parameter is a 1, the value in strValue1 is the actual state of the attribute.  However, if this parameter is a 0 or 2, the value in strValue1 is ignored.  If the value is 0, the left-hand value is considered to be the “missing state”.  If the value is 2, the left hand value is considered to be “all other states.”  In the example above, “All other states” is specified only because it looks nice (and it’s easier to just drop the combo box value into the function call even if it will just be ignored….)

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such that small correlations don’t come back in the results.    Usually you set it to a really small number like 0.0005 in the example above.

bNormalize  - Whether or not the result is normalized.  If this value is true, the results are normalized to a maximum absolute value of 100, giving a possible range of –100 to 100.  All this does is take the largest absolute value in the result and divide that into 100, and then multiple all the other numbers by that amount.  If set to false, the numbers are whatever they are and you can figure it out yourself – it’s up to you, but we always set this to true.

The results

Calling this routine returns a row for every differentiating attribute/value pair with a score higher than the specified threshold.  The row contains the differentiating pair along with the score and some other columns and looks somewhat like this:

image

The score column is the “important” one and is best explained as if you did something like a c language compare routine e.g int Compare(int v1,int v2) { return v1-v2; } .  That is, if the value is positive it favors value1 and if the value is negative it favors value2.  I’m not going to go in depth on the other columns other than to say that they are the actual counts of the correlations of the discriminator against the inputs.  The best way to understand them is to look at the Mining Legend as you browse the model and click on rows.  For example if you clicked on the first row of the result above (in either picture), the Mining Legend would look like this:

image

Of course, once you have the result set you can use it wherever you want – in Reporting Services, Integration Services, or in you custom program.

How to get the function calls

So, how do you find this laundry list of undocumented stored procedures.  Well, some are documented in my book, but you can get them all for free just by looking in the right place.  You can run the SQL Server Profiler to see what functions are being called by the viewers.  Here’s how you do it.

First, run SQL Server Profiler:

image

Then, start a New Trace from the File Menu and connect to Analysis Services

image

You can leave all the defaults on for the Trace Properties dialog that appears

image

Then go to any data mining viewer and browse a model!  That’s it!  You will get trace output that looks like this:

image

You can get the query text in the bottom pane of the trace window by finding and selecting rows with event types of ‘Query Begin’ or ‘Query End’.

Oh, and make sure you stop the trace before you shut down SQL Profiler - ‘cuz it will keep going.

Let me know if you find any particularly tricky stored procedures that you need help with – best place to ask is in the MSDN forums!