Standard and Enterprise Data Mining

This post could be a theoretical piece about the true meaining of "Enterprise Data Mining", but no, it's much more practical than that.  As many of you know, Microsoft has released a grid indicating the differences between the various editions of SQL Server for SQL Server 2005.  Unfortunately for the data mining part, the grid is a bit mistaken.  Fortunately for you data miners out there, it turns out that a lot more functionality is available in Standard Edition than is advertised there.  So you don't have to wait until the official grid is updated, I'll just give you a run-down of the differences here, and in more detail as well.

The first cause for celebration is that all algorithms are available in Standard Edition.  The main noticeable difference between Standard Edition (SE) and Enterprise Edition (EE) is in the algorithm parameters.  In SE, most of the algorithm parameters are not available.  Off the top of my head, I believe that all of the parameters except those available for Decision Trees and Clustering in SQL Server 2000, plus the basic MINIMUM_SUPPORT and MINIMUM_PROBABILITY for Association Rules are only available in EE.  What does this mean for you?  This means that for novice users that don't need to fine tune the algorithm results, you can get by using the algorithms with default settings from SE.  More advanced users that require complete control over the algorithm suite to extract the best possible results will need EE.  In future CTP builds, each parameter that is available only in Enterprise Edition will state as such in its description text.

The next difference are some slightly more subtle differences in the Association Rules and Time Series algorithms.  Association Rules in SE is limited to 5,000 attributes.  That is, if you are creating a market-basket analysis/cross-sales model and you have less than 5,000 product, you can use SE.  If you have more, use EE.  Enterprise Time Series has the ability to cross-predict across different series, whereas Standard Time Series can use only the values of the same series in its prediction calculation.  If you want to see the interplay between your sell-in and sell-through numbers, EE is your choice.  If you need a simpler forecast only taking into account the historical values of a single series, you can get by with SE.  (Note that this is implemented by only allowing PREDICT ONLY on the series in SE, while allowing INPUT and PREDICT on series in EE).

The remaining differences revolve around performance, scalability, and extensibility.  As you would expect, you get more from EE than from SE.  In EE you can do parallel batch processing.  This means that when you create a Mining Structure containing several Mining Models, the models will be processed in parallel in EE, and processed serially in SE.  Additional in-processing parallelism, such as that for the Clustering algorithm, is curtailed in SE as well.  If processing performance of many candidate models is key, go for EE. 

For prediction, EE provides for unlimited concurrent prediction queries, whereas SE is limited to five.  This difference comes into play in middle tier scenarios or if your data mining server responds to many simultaneous ad-hoc queries.  If you are running a large web site with live predictions and SE starts to curtail your site performance and you will need to move to EE on your data mining server.

One of the key features of SQL Server 2005 data mining is it's extensibility.  For extending the tools with new visualations, this is available in all editions.  For extending the server with new algorithms, you will need Enterprise Edition.  Note that to develop new algorithms to plug into the server, there is a Developer Edition with full functionality.  This limitation comes into play only for production.