Data Mining with SQL Server 2008

Article
10/02/2008

Last week Bogdan and I went through the final steps in completing the SQL (sic) to the wildly popular "Data Mining with SQL Server 2005", cleverly titled "Data Mining with SQL Server 2008". We went through the final PDF's and signed off that the companion web content is good to go. I am definitely pleased with how this book turned out. I listened to the feedback on the first book and pressed the authors and the editors to ensure that this book has much higher "initial code quality" as you might call it.

Writing a book isn't really all that much different than writing software, particularly when the book is about software, I suppose. The authors are the developers and the editors are the testers, and I have to say that the editors took us to task ensuring that every phrase turned correctly and no word was used wastefully. It was great help that, unlike the first edition, we were able to have the same editors throughout, and even though at times it seemed they were being a bit picky :), they were able to catch similar phrases used differently across multiple chapters and made a huge impact on the final text. Also, Shuvro, from the DM team here, manually verified each and every query and line of code that is included on the companion site. (I suppose that it helps that the book was written after the product was completed, so that API's and language constructs didn't change in flight)

So, readers of the first book will probably want to know what's new and different about this one. Overall, the book has generally the same structure as the first - so it's evolutionary rather than revolutionary. That being said, I would guess that at over half the content is new or heavily modified. For example, I rewrote most of the introduction, there are two new chapters covering the Data Mining Addins for Office, and the "OLE DB for Data Mining" chapter was completely reimagined and rewritten as "Data Mining Concepts and DMX." Each of the algorithm chapters was updated with respect to new features and comments from forums and newsgroups. Even where there were no significant changes to the algorithm between 2005 and 2008, the chapters were reorganized to focus on the practical application first and leave the technical implementation details to the latter portion of the chapter. Also, Bogdan, borrowing from his 50-page epic whitepaper SQL Server Data Mining Programmability, took over and greatly enhanced the Architecture and API chapters.

According to Amazon, the book will be released on November 10th and you can pre-order now. Since the "read inside" feature isn't available yet (likely since the book is being printed as I write this), I've taken the liberty and pasted the table of contents to make this probably the longest blog post I've ever had the pleasure of pasting. Anyway, I hope this gives you a good idea of what's coming in the book so you can preorder with confidence! :)

Foreword xxix

Introduction xxxi

Chapter 1 Introduction to Data Mining in SQL Server 2008 1

Business Problems for Data Mining 4

Data Mining Tasks 6

Classification 6

Clustering 6

Association 7

Regression 8

Forecasting 8

Sequence Analysis 9

Deviation Analysis 9

Data Mining Project Cycle 9

Business Problem Formation 10

Data Collection 10

Data Cleaning and Transformation 10

Model Building 12

Model Assessment 12

Reporting and Prediction 12

Application Integration 13

Model Management 13

Summary 13

Chapter 2 Applied Data Mining Using Microsoft Excel 2007 15

Setting Up the Table Analysis Tools 16

Configuring Analysis Services with Administrative Privileges 17

Configuring Analysis Services without Administrative

Privileges 18

What the Add-Ins Expect 19

What to Do If You Need Help 22

The Analyze Key Influencers Tool 22

The Main Influencers Report 24

The Discrimination Report 26

Summary of the Analyze Key Influencers Task 28

The Detect Categories Tool 28

Launching the Tool 29

The Categories Report 30

Categories and the Number of Rows in Each 30

Characteristics of Each Category 31

The Category Profiles Chart 32

Summary of the Detect Categories Tool 34

The Fill From Example Tool 35

Running the Tool and Interpreting the Results 36

Refining the Results 38

Summary of the Fill From Example Tool 39

The Forecasting Tool 39

Launching the Tool and Specifying Options 40

Interpreting the Results 42

Summary of the Forecast Tool 44

The Highlight Exceptions Tool 44

Using the Tool 45

More Complex Interactions 48

Limitations and Troubleshooting 50

Summary of the Highlight Exceptions Tool 51

The Scenario Analysis Tool 51

The Goal Seek Tool 53

Using Goal Seek for a Numeric Goal 56

Using Goal Seek for the Whole Table 57

TheWhat-If Tool 58

UsingWhat-If for the Whole Table 61

Summary of the Scenario Analysis Tool 62

The Prediction Calculator Tool 62

Running the Tool 64

The Prediction Calculator Spreadsheet 65

The Printable Calculator Spreadsheet 67

Refining the Results 68

Using the Results 73

Summary of the Prediction Calculator Tool 73

The Shopping Basket Analysis Tool 74

Using the Tool 75

The Bundled Item Report 76

The Recommendations Report 77

Tweaking the Tool 79

Summary of the Shopping Basket Analysis Tool 81

Technical Overview of the Table Analysis Tools 81

Summary 82

Chapter 3 Data Mining Concepts and DMX 83

History of DMX 83

Why DMX? 84

The Data Mining Process 85

Key Concepts 86

Attribute 86

State 87

Case 88

Keys 89

Inputs and Outputs 91

DMX Objects 93

Mining Structure 93

Mining Model 94

DMX Query Syntax 95

Creating Mining Structures 96

Discretized Columns 97

Nested Tables 98

Partitioning into Testing and Training Sets 99

Creating Mining Models 100

Nested Tables 101

Complex Nesting Scenarios 104

Filters 107

Populating Mining Structures 108

Populating Nested Tables 110

Querying Structure Data 112

Querying Model Data 112

Prediction 115

Prediction Join 116

Prediction Query Syntax 116

Nested Source Data 117

Real-Time Prediction 118

Degenerate Predictions 119

Prediction Functions 120

PredictNodeID 122

External and User-Defined Functions 123

Predictions on Nested Tables 123

Predicting Nested Value Columns 124

Summary 125

Chapter 4 Using SQL Server Data Mining 127

Introducing the Business Intelligence Development Studio 128

Understanding the User Interface 128

Offline Mode and Immediate Mode 130

Immediate Mode 131

Getting Started in Immediate Mode 131

Offline Mode 132

Getting Started in Offline Mode 133

Switching Project Modes 135

Creating Data Mining Objects 135

Setting Up Your Data Sources 135

Understanding Data Sources 136

Creating the MovieClick Data Source 137

Using the Data Source View 137

Creating the MovieClick Data Source View 138

Working with Named Calculations 140

Creating a Named Calculation on the Customers Table 142

Working with Named Queries 142

Creating a Named Query Based on the Customers Table 143

Organizing the DSV 144

Exploring Data 145

Creating and Editing Models 148

Structures and Models 148

Using the Data Mining Wizard 148

Creating the MovieClick Mining Structure and Model 155

Using Data Mining Designer 157

Working with the Mining Structure Editor 157

Adding the Genre Column to the Movies Nested Table 159

Working with the Mining Models Editor 160

Creating and Modifying Additional Models 163

Processing 164

Processing the MovieClick Mining Structure 165

Using Your Models 166

Understanding the Model Viewers 166

Using the Mining Accuracy Chart 167

Selecting Test Data 168

Understanding the Accuracy Charts 169

Using the Profit Chart 172

Multiple Target Accuracy Charts 172

Using the Classification Matrix 173

Scatter Accuracy Charts 173

Creating a Lift Chart on MovieClick 174

Using CrossValidation 174

Using the Mining Model Prediction Builder 178

Executing a Query on the MovieClick Model 179

Creating Data Mining Reports 180

Using SQL Server Management Studio 181

Understanding the Management Studio User Interface 182

Using Server Explorer 182

Using Object Explorer 183

Using the Query Editor 184

Summary 185

Chapter 5 Implementing a Data Mining Process Using Office 2007 187

Importing Data using the Data Mining Client 189

Data Exploration and Preparation 190

Discretizing Data with the Explore Data Tool 191

Chopping Off the Long Tail 191

Consolidating Meaning 192

Eliminating Spurious Values 194

Rebalancing Data 195

Modeling 196

Task-Based Modeling 196

Introduction 198

Select Data 198

Select Columns and Options 198

Split Data 200

Finishing the Task 200

Advanced Modeling in the Data Mining Client 200

Accuracy and Validation 203

Model Usage 204

Browsing Models 204

Viewing Models with Visio 205

Querying Models 208

QueryWizard 208

Data Mining Cell Functions 211

DMPREDICT 211

DMPREDICTTABLEROW 212

DMCONTENTQUERY 212

Model Management 213

Trace 213

Summary 213

Chapter 6 Microsoft Naıve Bayes 215

Introducing the Naıve Bayes Algorithm 216

Using the Naıve Bayes Algorithm 216

Creating a Predictive Model 217

Data Exploration 219

Analysis of Key Influencers 219

Document Classification 220

DMX 222

Drill-Through 222

Understanding Naıve Bayes Content 223

Exploring a Naıve Bayes Model 225

Dependency Network 225

Attribute Profiles 226

Attribute Characteristics 227

Attribute Discrimination 228

Understanding Naıve Bayes Principles 229

Limitations of the Naıve Bayes Algorithm 231

Naıve Bayes Parameters 233

MAXIMUM INPUT ATTRIBUTES 233

MAXIMUM OUTPUT ATTRIBUTES 233

MAXIMUM STATES 233

MINIMUM DEPENDENCY PROBABILITY 234

Summary 234

Chapter 7 Microsoft Decision Trees Algorithm 235

Introducing Decision Trees 236

Using Decision Trees 237

Creating a Decision Tree Model 237

DMX Queries 237

Classification Model 237

Regression Model 239

Association 241

Model Content 244

Interpreting the Model 244

Decision Tree Principles 248

Basic Concepts of Tree Growth 248

Working with Many States in an Attribute 251

Avoiding Overtraining 252

Incorporating Prior Knowledge 252

Feature Selection 253

Using Continuous Inputs 253

Regression 254

Association Analysis with Microsoft Decision Trees 255

Parameters 256

COMPLEXITY PENALTY 257

MINIMUM SUPPORT 257

SCORE METHOD 257

SPLIT METHOD 258

MAXIMUM INPUT ATTRIBUTES 258

MAXIMUM OUTPUT ATTRIBUTES 258

FORCE REGRESSOR 258

Stored Procedures 259

Summary 260

Chapter 8 Microsoft Time Series Algorithm 263

Overview 264

Usage 265

Time Series Scenarios 267

Performing a Simple Forecast 267

Predicting Interdependent Series 268

Understanding Your Time Series 268

What-If Scenarios 269

Predicting New Series 269

DMX 270

Model Creation 270

Model Processing 272

Forecasting 274

Returning Supplemental Statistics 275

Changing the Future —Executing a What-If Forecast 276

Forecasting with Little Data —Applying Models to New

Data 277

Drill-Through 280

Principles of Time Series 280

Autoregression 281

Periodicity 281

Autoregression Trees 282

Prediction 284

Parameters 285

MISSING VALUE SUBSTITUTION 285

PERIODICITY HINT 286

AUTO DETECT PERIODICITY 286

MINIMUM and MAXIMUM SERIES VALUE 286

FORECAST METHOD 286

PREDICTION SMOOTHING 287

INSTABILITY SENSITIVITY 287

HISTORIC MODEL COUNT and HISTORIC MODEL GAP 287

COMPLEXITY PENALTY and MINIMUM SUPPORT 288

Model Content 289

Summary 289

Chapter 9 Microsoft Clustering 291

Overview 292

Usage of Clustering 294

Performing a Clustering 295

Clustering as an Analytical Step 297

Anomaly Detection Using Clustering 297

DMX 299

Model Creation 300

Drill-Through 301

Cluster 301

ClusterProbability 301

PredictHistogram 302

PredictCaseLikelihood 302

Model Content 303

Understanding Your Cluster Models 304

Get a High-Level Overview 305

Pick a Cluster and Determine How It Is Different from the

General Population 307

Determine How the Cluster Is Different from Nearby

Clusters 308

Verify that Your Assertions Are True 309

Label the Cluster 309

Principles of Clustering 309

Hard Clustering versus Soft Clustering 311

Discrete Clustering 312

Scalable Clustering 313

Clustering Prediction 314

Parameters 314

CLUSTERING METHOD 314

CLUSTER COUNT 315

MINIMUM CLUSTER CASES 315

MODELLING CARDINALITY 316

STOPPING TOLERANCE 316

SAMPLE SIZE 316

CLUSTER SEED 317

MAXIMUM INPUT ATTRIBUTES 317

MAXIMUM STATES 318

Summary 318

Chapter 10 Microsoft Sequence Clustering 319

Introducing the Microsoft Sequence Clustering Algorithm 320

Using the Microsoft Sequence Clustering Algorithm 320

Creating a Sequence Clustering Model 321

DMX Queries 322

Executing Cluster Predictions 323

Executing Sequence Predictions 323

Extracting the Probability for the Sequence Predictions 325

Using the Histogram of the Sequence Predictions 326

Detecting Unusual Sequence Patterns 329

Interpreting the Model 329

Cluster Diagram 330

Cluster Profiles 331

Cluster Characteristics 331

Cluster Discrimination 333

State Transitions 333

Microsoft Sequence Clustering Algorithm Principles 334

Understanding a Markov Chain 334

Order of a Markov Chain 335

State Transition Matrix 336

Clustering with a Markov Chain 337

Cluster Decomposition 339

Model Content 339

Algorithm Parameters 340

CLUSTER COUNT 340

MINIMUM SUPPORT 340

MAXIMUM STATES 341

MAXIMUM SEQUENCE STATES 341

Summary 341

Chapter 11 Microsoft Association Rules 343

Introducing Microsoft Association Rules 344

Using the Association Rules Algorithm 344

Data Exploration Models 345

A Simple Recommendation Engine 346

Advanced Cross-Sales Analysis 349

DMX 351

Model Content 355

Interpreting the Model 357

Association Algorithm Principles 359

Understanding Basic Association Algorithm Terms and

Concepts 359

Itemset 360

Support 360

Probability (Confidence) 361

Importance 361

Finding Frequent Itemsets 363

Generating Association Rules 366

Prediction 367

Algorithm Parameters 368

MINIMUM SUPPORT 368

MAXIMUM SUPPORT 368

MINIMUM PROBABILITY 368

MINIMUM IMPORTANCE 368

MAXIMUM ITEMSET SIZE 369

MINIMUM ITEMSET SIZE 369

MAXIMUM ITEMSET COUNT 369

OPTIMIZED PREDICTION COUNT 369

AUTODETECT MINIMUM SUPPORT 369

Summary 370

Chapter 12 Microsoft Neural Network and Logistic Regression 371

Same Principle, Two Algorithms 372

Using the Microsoft Neural Network 373

Text Classification Models 373

Utility Models 378

DMX Queries 378

Model Content 381

Interpreting the Model 382

Principles of the Microsoft Neural Network Algorithm 384

What Is a Neural Network? 385

Combination and Activation 387

Backpropagation, Error Function, and Conjugate Gradient 389

A Simple Example of Processing a Neural Network 390

Normalization and Mapping 392

Topology of the Network 393

Training the Ending Condition 394

Nonlinearly Separable Classes 395

Algorithm Parameters 396

MAXIMUM INPUT ATTRIBUTES 396

MAXIMUM OUTPUT ATTRIBUTES 396

MAXIMUM STATES 396

HOLDOUT PERCENTAGE 397

HOLDOUT SEED 397

HIDDEN NODE RATIO 397

SAMPLE SIZE 397

Summary 397

Chapter 13 Mining OLAP Cubes 399

Introducing OLAP 400

Understanding Star and Snowflake Schemas 401

Understanding Dimension and Hierarchy 402

Understanding Measures and Measure Groups 404

Understanding Cube Processing and Storage 404

Using Proactive Caching 405

Querying a Cube 406

Performing Calculations 407

Browsing a Cube 408

Understanding Unified Dimension Modeling 408

Understanding the Relationship between OLAP and Data

Mining 413

Mining Aggregated Data 414

OLAP Pattern Discovery Needs 415

OLAP Mining versus Relational Mining 415

Building OLAP Mining Models Using Wizards and Editors 417

Using the Data Mining Wizard 417

Building the Customer Segmentation Model 417

Creating a Market Basket Model 420

Creating a Sales Forecast Model 424

Using the Data Mining Designer 428

Understanding Data Mining Dimensions 429

Using MDX within DMX Queries 432

Using Analysis Management Objects for the OLAP Mining

Model 434

Summary 438

Chapter 14 Data Mining with SQL Server Integration Services 439

An Overview of SSIS 440

Understanding SSIS Packages 442

Task Flow 442

Standard Tasks in SSIS 442

Containers 443

Debugging 444

Exploring a Control Flow Example 444

Data Flow 444

Transformations 445

Viewers 446

Exploring a Data Flow Example 447

Working with SSIS in Data Mining 447

Data Mining Tasks 448

Data Mining Query Task 449

Analysis Services Processing Task 452

Analysis Services Execute DDL Task 453

Data Mining Transformations 455

Data Mining Model Training Destination 455

Data Mining Query Transformation 458

Example Data Flows 462

Using Non-Predictive Data Mining Queries in an

Integration Services Pipeline 463

Text Mining Transformations 464

Term Extraction Transformation 465

Term Lookup Transformation 467

More Details on the Text Mining Process 470

Summary 472

Chapter 15 SQL Server Data Mining Architecture 475

Introducing Analysis Services Architecture 476

XML for Analysis 476

XMLA APIs 477

Discover 478

Execute 479

XMLA and Analysis Services 480

Processing Architecture 482

Predictions 486

Data Mining Administration 487

Server Configuration 488

Data Mining Security 489

Security Requirements for Creating and Training Mining

Objects 491

Security for Various Deployment Scenarios 491

Local Database and Analysis Services 492

Local Analysis Services and a Remote Database 493

Intranet Analysis Services and Databases on the Same

Server 493

Analysis Services and Databases behind an HTTP

Endpoint in an Internet Deployment 494

Configuring Analysis Services for Use with Data Mining

Excel Add-Ins over HTTP 495

Summary 496

Chapter 16 Programming Sql Server Data Mining 497

Data Mining APIs 498

ADO 498

ADO.NET 500

ADOMD.NET 501

Server ADOMD.NET 501

AMO 501

Using Analysis Services APIs 502

Using Microsoft.AnalysisServices to Create and Manage

Mining Models 502

AMO Basics 503

AMO Applications and Security 505

Object Creation 506

Creating Data Access Objects 507

Creating the Mining Structure 510

Creating the Mining Models 512

Processing Mining Models 513

Deploying Mining Models 515

Setting Mining Permissions 516

Browsing and Querying Mining Models 517

Predicting with ADOMD.NET 517

More on Table-Valued Parameters in ADOMD.NET 522

Browsing Models 525

Stored Procedures 527

Writing Stored Procedures 529

Stored Procedures and Prepare Invocations 530

A Stored Procedure Example 530

Executing Queries inside Stored Procedures 533

Returning Data Sets from Stored Procedures 534

Deploying and Debugging Stored Procedure Assemblies 537

Summary 538

Chapter 17 Extending SQL Server Data Mining 541

Plug-in Algorithms 542

Plug-in Algorithm Framework 543

Lifetime of a Plug-in Algorithm Instance 543

Conceptual Overview 545

Model Creation and Processing 547

Prediction 553

Content Navigation 554

Custom Functions 555

PMML 557

Managed vs. Native Plug-ins 557

Installing Plug-in Algorithms 558

Where to Find Out More about Plug-in Algorithms 558

Data Mining Viewers 558

Interfaces to Be Implemented 559

Rendering the Information 559

Retrieving Information from Analysis Services 560

Registering the Viewer 561

Where to Find Out Mode about Plug-in Viewers 561

Summary 562

Chapter 18 Implementing a Web Cross-Selling Application 563

Source Data Description 564

Building Your Model 564

Identifying the Data Mining Task 564

Using Decision Trees for Association 565

Using the Association Rules Algorithm 567

Comparing the Two Models 568

Making Predictions 570

Making Batch Prediction Queries 570

Using Singleton Prediction Queries 572

Integrating Predictions with Web Applications 573

UnderstandingWeb Application Architecture 573

Setting the Permissions 574

Examining Sample Code for the Web Recommendation

Application 575

Summary 578

Chapter 19 Conclusion and Additional Resources 581

Recapping the Highlights of SQL Server 2008 Data Mining 581

State-of-the-Art Algorithms 582

Easy-to-Use Tools 583

Simple-Yet-Powerful API 584

Integration with Sibling BI Technologies 584

Exploring New Data Mining Frontiers and Opportunities 585

Further Reference 586

Microsoft Data Mining 586

General Data Mining 586

Appendix A Data Sets 589

Appendix B Supported Functions 595

Index 607

Data Mining with SQL Server 2008

Additional resources