Occam’s Razor and the Data Science Project


The Microsoft Business Analytics and AI components from Microsoft is not a single platform, but actually a group of related products and features. Why so many? Couldn’t someone just use Microsoft R Server, or Azure ML, or Hadoop to create a solution? Isn’t the simplest solution always the best? Well, yes, but only inasmuch as it is as simple as it should be.

In many data projects, it’s common to use a single process to get the answers you’re looking for. For instance, in a reporting-based application, you have a data source (or perhaps more than one) and you run a query over that data. The query might have multiple components, such as aggregations, combinations and filters, but in essence it’s a single technology or type of technology that processes the answer from the data. The simplest answer is to apply the query language to the data. And simple is best, right?
In a Data Science project, more processes are involved. You start with not one but several data sources, and you’ll use that data in more than one process. This is the primary reason you have an Extract, Load and Transfer (ELT) component rather than an Extract, Transform and Load (ETL) component. Within a data flow in Data Science, it’s common to have different algorithms, processes and even tools to get a given solution.

Let’s look at an example. Consider a company that has a loyalty program of some sort – perhaps they offer a discounted gym membership to their customers for buying a certain amount of subscription time to a fitness-tracking program for a device they sell. Assume that an analysis shows they are losing money by offering the gym discounts. Should they cancel the benefit? According to a single analysis process, the answer is yes. But perhaps we should dig a little deeper…

Accessing multiple sources of data about not only the customers but about their extended buying practices, we use a classification method to learn more about their habits – with so many variables, perhaps a multi-class decision forest – and feed those results into yet another process to find out the sequences in timefor what the customers do next. A sequential pattern or basket analysis could yield this result. After we know more about those habits, we can use a clustering algorithm, such as K-Means, to divide out these customers by similar attributes. In all, we’ve used the Azure Platform, Storage, Hadoop, Azure ML with an R script, and Power BI to arrive at and display our results.

From all this analysis, we find that the customers taking advantage of the “money losing” benefit actually purchase more of the company’s associated clothing line – something not originally factored in. The profit level on the clothing is high enough to more than offset the benefit – so the advice is to continue the program, and in fact expand it to feature in-gym promotions of the clothing line. Far from viewing the benefit as a loss-leader, we’ve turned it into a revenue opportunity.

The point of this (mostly) fictional exercise is to show that multiple techniques, algorithms and even tools are often required to solve a given investigation. It’s true that the simplest process to arrive at the correct answer is usually best – but things should be as simple as possible, and no simpler.

You can learn more about this process by starting here: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-create-experiment/ 

Comments (0)

Skip to main content