Fun with Nested Tables and Sampling

Last week Jason Carlson from Reporting Services asked me to help him build a model he was creating to analyze some survey results. He was trying to see how company related questions such as number of PC's etc relate to product related questions, such as what features of Reporting Services are interesting. When we created the model we got some wierd results - attributes showed up in the models that we didn't expect and we couldn't figure out why (even though I implemented the code - twice!). Eventually going back to the code we figured it out and corrected the model, but I told Jason I'd write up a tip or trick and I did - although it turned into a mini-whitepaper.

Peter told me a kind of funny story today. An internal group was doing some data-mining and needed to split their data into testing and training sets. For some reason they weren't using the wonderful sampling transforms in Integration Services but instead came up with their own ad-hoc method using timestamps and modulo of some columns and maybe even some scales of newt. In any case, Peter asserted that this method of splitting wasn't random and they said "prove it." So he did. He created a decision tree model to predict whether data was from the training set or the testing set, and it produced a very deep tree explicitly listing the rules that would put a case into a training set or the testing set. He ran the same experiment on data split by the Integration Services transforms and the algorithm was not able to detect anything that would indicate which set each case came from. I think they will be using the transform from now on.....