To obtain the data for the analysis a Data Scientist needs to work with, there are two options: you can get all the data (called a population or “X”) or a subset of the data (called a sample, or “x”). Most of the time the information you need to perform analysis is too large to get it all – especially where one of those data points is time.
So it seems a simple matter to look at a large group of data, pull some out at random, and use that to estimate what the rest of the data will look like. Ah, but therein lies the rub.
You see, humans aren’t really random, and computers are definitely not random. It might seem that things are randomly selected, but if you’re not extremely careful, a pattern emerges – statisticians called these “biases”. The word biascomes to us through Latin and French, and essentially means “to slope or angle”. If we want to trust the base data we use for an estimation or assumption, we have to eliminate as many of these biases as we can. There are a few we can examine to see if they show up in our sample gathering. Although there are several, I’ll bunch them up into two major groups: Bias in Design, and Bias in Collection
Bias in Design
In these types of bias, the researcher isn’t paying enough attention to how they design the study or data gathering.
The first and most common error is Selection Bias. This is where you pick the wrong group of data to begin with.
For example, if you’re testing to see the most popular kinds of food in the UK, creating a poll on the types of Grits they eat is a bad design. People in the UK don’t often eat Grits (although they do eat Polenta, which is essentially the same thing, but don’t tell them that). Or perhaps your test is designed to be administered in another country, so of course that wouldn’t tell you much about the UK. How do you stop this? Think it through – and involve as many people as you can in the design.
Also included in this area are population parameter biases, sensitivity biases and specificity biases. If you focus on the Selection bias, you’ll correct most of your errors.
Another primary bias is more of an error – it’s not collecting enough parameters. If you only ask what type of meat the people in the UK like, you’ll miss the people that only eat vegetables and so on.
Bias in Collection
These types of biases happen when you’re getting the data, or running an experiment or test. Here the most common biases I see are designing a study that is too small (perhaps because you don’t have a lot of money), interference from the observer (like asking leading questions) and inferring meaning into the data when it isn’t collected.
By far the most common error is selecting too few samples from a population. More data is (almost) always a good thing, especially of the data is likely to have a high degree of variability. For instance, if you’re collecting data from a web log on a server, collecting data from only one day intervals is far less useful than selecting data each hour or minute. (Although there are some tricks around that – here’s an interesting example: http://www.eetimes.com/document.asp?doc_id=1275354 )
You can avoid these types of errors by evaluating multiple collections to see if they show too much variability with each other, using statistical tests that will help you ferret out bias.