# Databas(ics)

The beginnings of data science is data. Data are things that you know about, well, other things, so it makes sense to ensure you have a firm grasp on handling that data.

Note: I know this seems really is basic, but stick with me – it gets deep quick, and it’s essential to understand this well. I’ve (recently) had to go back and qualify some statements a client made when we started down an analysis path, only to learn one of these fundamental concepts was misunderstood, and it changed the whole project!

First, a couple of terms. Data is defined by the Oxford English Dictionary anything that qualifies as a Noun. According to my favorite source of education, that means a “Person, Place or Thing”. The next term we care about is Metadata  which is defined as data about data. So let’s take a quick look at that in action:

Data: Buck Woody

Metadata: Name, Letters (number=10), English, Noun, Sentence, ASCII (Binary Conversion: 01000010011101010110001101101011 0101011101101111011011110110010001111001)…., ad infinitum

So you can see that the metadata about a datum is actually larger than the datum itself. (I’ve often wondered if there really is data at all, but rather just a group of metadata describing the datum which actually forms the datum, but I digress…)

There are generally two types of data: qualitative (information about something) and quantitative (numerical information that can be calculated). So “Buck” is qualitative (my name) and “42” was my age (at one time anyway) which is quantitative – the number of years I’ve been taking up space on the planet. These distinctions are VERY important to the Data Scientist,  since we’ve developed lots of methods to handle showing each of these kinds of information. You’ll use various quantitative techniques in R and Python, and other methods for qualitative data. You’ll also need to fundamentally understand these differences when you embark on learning Machine Learning techniques.

By the way – you can turn a qualitative datum into a quantitative one. We do this all the time – “On a scale from one to ten, how handsome is Buck Woody?” (don’t answer that). This is an important skill, and one I’ll cover in another Notebook entry.
The next distinction is whether the data is Discreteor Continuous. Continuous is probably easier to understand – it’s data that has a constant progression (up or down). Think about the temperature – it can be 0.000000001 Celsius, or  .00001, or .01, or 0, or 1, depending on what level of precision you need for measuring it. The value ranges over a scale. In point of fact, those values are infinite – which even has it’s own branch of math to deal with.There are a couple of other data basics you need to understand. Does the data have a point-in-time, segmented, specific value oriented value? We call this a categorical datum – things like Male/Female, Night/Day, 1/0, Red/Green/Blue.   Note that categorical data can be numeric (quantitative) or not (qualitative).

Discrete data has gaps. Read that again. That means there are not 1.0000001 bananas on my desk, just 1. It’s a discrete thing. This gets a little trickier than you might imagine, especially when we start thinking about that conversion from qualitative to quantitative I mentioned earlier (and is actually the error made that I talked about in that project).

Let’s say you’ve lined up myself and several Hollywood actors. I’ve asked you “Put us in groups of attractiveness”. You’ve done that, arranging us into “Scary”, “Guy Next Door” and “Wow”. Are these groups equally distributed? How much better is “Wow” from “Scary”? Is there a “Partly-Wow” we should have used? “Amazingly-Scary”? The point here is that a fundamental error I see quite often is using a Classification technique in a numeric comparison. You’ll see this all the time in rankings – rate your teacher from 1-10, or this meal from 1-7. You can’t treat discrete data like continuous data, especially when you are choosing an algorithm to work with.

I’ll refer back to this post as I cover algorithms in the future. For now, your homework is to use your newfound skill and start looking at the world using these terms – where do they work, and do they break down anywhere? Why?

Tags