Rob Caron pointed me to David Lempher’s recent blog post about possibly doing impact analysis with Team Foundation Server. It’s a funny coincidence, because Bogdan Crivat and I were brainstorming about how we can take better advantage of the data mining in SQL Server 2005.
Data mining with SQL Server 2005 involves applying one of several algorithms to your data to pull out information that one cannot deduce from casual observation. There are several algorithms that I think would lend themselves to software development.
SQL Server 2005 provides the following algorithms that you can use for data mining:
- Microsoft Clustering
- Microsoft Association Rules
- Microsoft Decision Trees
- Microsoft Linear Regression
- Microsoft Logistic Regression
- Microsoft Naive Bayes
- Microsoft Neural Network
- Microsoft Sequence Clustering
- Microsoft Time Series
The Microsoft Assocation Rules algorithm is used for doing what is known as market basket analysis. The canonical example of market basket analysis is a grocery shopping cart – market basket analysis provides insight into questions like: ‘when people buy diapers, what else are they buying?’. One answer to that question surprisingly is beer; I guess dads must get sent out a lot to buy diapers so they pick up a little treat for themselves Grocery stores use this information to coordinate their promotions – when diapers go on sale, don’t put beer on sale as well, since it is likely to get bought anyway.
So how does this relate to software development? Generally speaking, the Microsoft Association Rules builds rules that can be used to predict the presence of an item based on other items in a transaction. A transaction in this case could be the check-in that fixes a bug. What files are being included? If file foo.cs is being fixed as part of a bug, are there other files that are likely to be included? This could help with impact analysis. Imagine a check-in policy that looks at the files in your change set and tries to predict what else might be there, or should be there. That could help you answer questions about ‘what else get affected if I touch this file?’
Another interesting algorithm to apply is the Microsoft Clustering algorithm. This algorithm creates clusters of artifacts that share similar characteristics. The canonical example of this algorithm in use is in the retail industry. The question to answer might be, ‘in my population of customers, who is buying bicycles?’ Applying a clustering algorithm to this data might show that customers who buy bicycles tend to live in the northwest, are male and between the ages of 25–30; this is just an example of course.
Applied to software development, clustering might be used to pull out source files that happen to share similar characteristics. Maybe it is because they are the ones always being fixed? One could implement impact analysis by looking at whether a file that is about to be changed is part of one of these clusters.
Data mining is also commonly used to produce Profit Charts to help direct-marketers. The canonical scenario is a direct mail campaign. It costs something to send a mailer out to a customer, so businesses want to set a limit on how many they send to reduce their cost. On the other hand, the more mailers they send, the more customers they contact, and the more profit they stand to make.
The crux of the problem is how to find a happy balance. Data mining helps by creating profit charts. Profit charts work hand in hand with lift charts. Lift charts are used to illustrate the accuracy of a prediction. A prediction in a direct mail campaign might be whether the customer responds or not. The value of the prediction will be ‘yes they will respond’, or ‘no they will not respond’.
The accuracy of the prediction depends on the attributes of the customers in the population. Maybe there are some customers who have attributes that strongly favor a prediction, one way or another. Maybe there are other customers who have attributes that make an accurate prediction very hard to come by.
Quoting from the MSDN help, the input into a profit chart is:
The number of cases in the dataset that is being used to create the lift chart. For example, the number of potential customers.
- Fixed Cost
The fixed cost that is associated with the business problem. If this were for a targeted mailing solution, the cost would not depend on variables such as the number of telephone calls made or the number of promotional mailings sent.
- Individual Cost
Costs that are in addition to the fixed cost, that can be associated with each customer contact. For example, promotional mailings or telephone calls.
- Revenue Per Individual
The amount of revenue that is associated with each successful sale.”
A typical profit chart shows an increase in profits up to a point, after which profits decrease as more of the population is contacted.
For example, if the peak of the profit curve is at 55 percent of the population and the associated predict probability is 20 percent, this indicates that to achieve maximum profits you should only contact those customers whose response is predicted with a 20 percent or greater chance.
There may be an application for profit charts in software development. Bogdan and his team have done some experiments on predicting the outcome of a bug. For example, will the outcome be actionable (i.e. fixed) or will it be non-actionable (i.e. duplicate, not-repro). When a schedule gets tight, maybe it makes sense to use a profit chart to only work on bugs whose outcome can be predicted with a 20% or greater accuracy?
Anyways, just some thoughts on data mining and Team Foundation Server. I think we are just touching the surface on what can be done here. I would suggest poking around on the SQL Server Data Mining community site to learn more.
Over the next few weeks, I’ll be doing a few more posts on data mining and Team Foundation Server. Exploring ways of doing impact analysis might be a great question to start with.
David, in case you’re reading, I’ll see you in Australia in a few weeks, looking forward to it!