Data Science is compute and labor intensive
In the previous blogs, we showed you how to find a dataset, clean it and run simple mapReduce, sort on the dataset. It was meant to give you a flavor of what data science is all about, and I also wanted to expose Big Data’s rather labor intensive nature. It does take some processing power and thinking to work on even just 10 Gigabytes of data. Now you can imagine why it takes a large team at large web 2.0 companies to deal with terabytes of data. To maintain good service levels on those live services, it takes some serious processing and man power. Back to our word count examples, one of the most common questions I get is that what exactly do people do after the word count is done, and why would you do word count in the first place. There are two parts of this blog, first explains how things work in detail, the second blog goes through an example.
Meaning of word count
Word count is the foundation of (NLP) or natural language processing, which is fundamental to applications such as speech recognition, search engines, machine translation, and just about any applications that needs to derive meaning from human input. By counting words and putting that into a mathematical model, you can derive the meaning of a sentence based on the words appear in it, at the same time, the context which words appear also determine the meaning of words. The simplest example is a word cloud.
We will go through a relatively complex example, even if you do not understand the mathematics behind it, please follow along to understand how words of English are turned into a mathematical model, and how relationships can be derived between words based on your model. These mathematical techniques can be applied to just about anything that can be turned into such a model, that includes your Facebook post behaviors, your online purchases, and even your travel patterns.
Python: the Data Modeling Language
If you are working with mathematical models and scientific computing, you are likely to encounter tools such as R, Python, Matlab, SAS, etc. Python is becoming more popular by the day due to its user base and the availability of packages. Another important reason we are using Python here is the fact that most data scientists use Python to get a “feel” for their data, any one of them will tell you that they use “Scripting” to “feel” their data, and that is here to stay. We’ll have a discussion on how Python and Hadoop fit together in a later blog. In a short sentence is that you need to use the right tool for the right job. Hadoop is an amazing and proven tool for scaling out data analysis at Peta-scale, while Python is great for prototyping and analytics for small amounts of data. In production, big data scientists may port, or modify their Python code to run on Hadoop.
It should take you just a couple of day and perhaps even a few hours to get acquainted with it. For this example, we encourage you to install Python with a few numerical packages. Luckily for Windows users, you can simply download EPD from Enthought, our friends in Austin with batteries included. Their distribution provides scientists with a set of tools including over 100 libraries to perform data analysis and visualization. Some of the most common ones include: SciPy and NumPy and matplotlib. Link to the binary download.
Once you have it installed, run a simple test with PyLab. If both import statements ran fine, you have it installed correctly, and those are the packages we need.
Semantic Matrix Model
The basic idea of this example is that the meaning of a document is determined by the words that appear in it, and the meaning of a word is determined by the documents in which it appears. Consider the following piece of text, each sentence consists of an array of words, and the words themselves appear in multiple sentences. Within different context (sentences), words may have different meanings. In this case, the word “tree” in m2 refers to a computer science tree. Graph and Trees are very much the same thing in sentence m2. You’ll also notice that while minors are human in our every day context, the word minor appears in m3 has nothing to do with the word human, while the word “user” refers to human in these sentences. Will a a mathematical model, or a semantic matrix analysis be able to tell them apart? Let us explore.
The first step is, you guessed it, word count. In the semantic matrix below, you can find a word count for each of the words. “User” appears in sentence c2, c3 and c5 1 time each, and we mark all the other context (sentences) that it didn’t appear in as 0. The Word system appeared in c4, 2 times. As you can see, we are simply counting up all the words for each of the sentences and each of the words. In the case of the entire internet, this could become a really really large Matrix. Perhaps, many billions of documents vs. millions of words. The concept of word counting is also known as N-Gram, in the case of counting just 1 word, we call it the 1-Gram, while we could also count 2-grams…. Ngrams. A couple of examples of 2-grams above are “response time”, and “human system.” Then, we will end up with even larger matrix. There are tools in Python that let’s you deal with this such as the NLTK.
The import thing to note here is that we turned the original corpus into a matrix, each document into a document vector, and each word into a word vector. You can do the same with just anything you want to compare relationships with, that was also explained in the “meaning of word count” section above.
To express these vectors in equations:
The meaning of a sentence is equal to adding together of meaning of all the words times a coefficient, or a linear combination of meaning of words. The coefficients are something we need to solve for in this matrix to determine their actual values, these are the a[i, j]. The word & document vectors themselves don’t get changed, but we are trying to find the coefficients.
SVD: Single Value decomposition
Single value decomposition is widely used technique in working with matrix transformation. It is the basis of a number of vector-based methods like Latent Semantic Analysis (LSA) and Principal Component Analysis (PCA), Independent Analysis (ICA). When you work with big data, you will encounter these methods rather frequently especially PCA. In recent context, SVD is used in advanced recommendation engines among other practical uses. By transforming the matrix in different ways using SVD, it allows us to look at meaning of words from different angles, or tuning the nobs to reveal different properties of the matrix.
Let’s go through the process of running SVD on a matrix. If you would like to know more details, consult your linear algebra book. Basically a matrix can be decomposed into 3 matrices. A is the original matrix, and U and V are the new basis. Same vectors, but diff coefficient. U, rows are document vectors but now with diff coefficient. W (the NxN matrix in the middle), only has diagonal Eigen values. V: columns of VT(transposed) are word vectors, since it is transposed.
You do need a sparse matrix solver to do a SOLVE, that is a topic that people have been working on for 50 years in HPC. There are efficient math solver libraries that you can use to compute these 3 matrices. You can find these solvers from Petsc, Trilinos, Python, or Matlab. Your best bet is the Numpy, SciPy library in Python that you just installed. There are also higher level tools that you can use such as the NLTK, natural language processing tool kit’s clustering algorithm API. We will have a lecture on using NLPTK with Hadoop in the future. Another package on Hadoop is the Mahout Machine learning package, it also has a SVD solver that you can use.
Dimension reduction: Smaller matrix but retain the most important part
As we mentioned earlier, the matrix can get rather large when you go to the internet scale. Perhaps SVD isn’t really the right solution, but for modest amounts of Data, SVD can easily handle the computation of a few hundred million cells in a large sparse matrix. Even so, we will need to do a dimension reduction by shrinking the size of the matrix. Eigen values are in decreasing order in the W matrix. We don’t have to keep all of them. We can keep a much smaller N, and it’s a good approximation of the original matrix. In language applications keeping about 300 for documents in the “millions.“
In other words, you are reducing the dimensionality by many orders of magnitude. Visually it is represented by a skinnier U matrix that’s as tall as the original one & flatter Vt Matrix that is as wide as the original one. As we recall from matrix multiplication, when you multiply them together, you still get the M x N dimension back! Now, that is Math Magic by approximation.
SVD only needs to be done once
If you are adding new documents, there are no need to redo this entire process again. That the red dots on the right side of the equation (U matrix’s last row) can be calculated easily by using the Au vector or the last row of the original matrix, AKA the added new document.
Uu = Au * V * W inverse
Au = new document and it is a vector. V, the smaller matrix, and W (inverse).
Distances and Relations
As we recall,
Dot products: They are simply a projected value from 1 vector onto the other; imagine you have two pencils at an angle to each other, the similarity can be computed by projecting a shadow from first pencil onto the second one.
Cosine: normalized dot product, you don’t care about the length, just the direction.
Euclidean distance: compute distance for all the points and add them together,
In any case, we are looking for distance/similarity of vectors, we can calculate meaning of words by comparing their vectors in the new Matrix. The calculation examples will continue into the second half of this blog.
One note is that Euclidean distance and the angle (the inverse-cosine) are metrics. The dot product and the cosine are NOT metrics. A metric should be zero when the distance is zero, it should always be non-negative, and it should obey the triangle inequality, so that the distance A-to-B + B-to-C is always greater or equal to the distance A-to-C (meaning that a roundabout path cannot be shorter than the direct route). That's not true for cosines; cosines will tell you that some roundabout routes are shorter than the direct one.
SVD is one of the more advanced topics and you may not have fully understood it in this blog without a math background. However, the things to remember here are:
1. Data that we collect can be turned into a mathematical model.
2. Documents, or individual behavior data can be turned into a feature vector.
3. These vectors can be magically transformed, factorized to reveal important characters of the matrix and the relations between the various feature vectors.
The example using Python will be through through in detail in the 2nd part of this blog, we will demonstrate to you mathematically that how this computer model can understand that USER = HUMAN.