Solving Business Problems with SQL Server 2005 Data Mining

Data Mining is about exploring your data, finding patterns and performing predictions. Where querying and analysis tells you what, data mining tells you why and how. In SQL Server 2005, data mining is made accessible and easy to use through an integrated user interface, cross-product integration and familiar, standard APIs. From the two algorithms that existed in SQL Server 2000, there are now a total of seven algorithms available out of the box, and it's possible to extend this further using an SDK.

Data Organisation
How can we organise data to provide the most pertinent information? OLAP provides fast queries with aggregations, but the hierarchies are schema-driven. That makes it hard to know what hierarchies are meaningful or provide actionable information. "Intelligent" OLAP provides self-organising cubes with data-driven hierarchies. You can achieve this with a data mining dimension: from a cube, you generate a mining model that is used in turn to create a new dimension.

For example, the clustering algorithm takes the input data and attempts to segment it into groups. You could now rename those groups, before creating a data mining dimension and using them as dimension members for slicing data. Similarly, a decision trees algorithm could be used to generate a parent-child dimension. This enables us to determine (for example) which groups of customers are the most profitable, in a far more intelligent way than if we'd used OLAP alone without integrating data mining algorithms.

Data Cleansing
How do we clean data as we load it into the data warehouse? We can use DTS pipeline integration in combination with a DM clustering algorithm to separate out clean and bad data. This can help us detect spurious data members (for example, a parent younger than 10 years old).

The steps are as follows:

  • Create a clustering model for the data you want to cleanse
  • Create a DTS project
  • Add a data source
  • Add a Mining Query Transform using the PredictCaseLikelihood() function
  • Do a conditional split based on the output.