Getting Linear with it

Hello World!

This is a pull from my primary dev blog: www.dacrook.com.  This article is Linear Regression Cost Functions with Linear Algebra :).

So I wrote an article earlier "Linear Regression From Scratch".  Many folks have pointed out that this is in fact not the optimal approach.  Now being the perfectionist I decided to re-implement.  Not to mention it works great in my own libraries.  The following article discussing converting the original code into code that uses linear algebra.  Beyond this, it still works in PCL for xamarin,  Hoo-Rah Xamarin!

Before we get started, note that you will need to install the nuget packages MathNumerics and MathNumerics.FSharp.  If you are not compiling to PCL for Xamarin, you should probably just install the entire fslab package, as it has a ton of great stuff for data science and machine learning.

So lets take another look at the original Sum of Squares function...

 let SumSquaredError (predictionModel:IPredictionModel) (trainingData:UnivariateDataPoint seq) =
trainingData
|> Seq.map(fun point -> point.X
|> predictionModel.PredictValue
|> subtract point.Y
|> square
)
|> Seq.sumBy(fun point -> point)

Here are the issues with this solution:
1. We need this prediction model thing. (more code)
2. Our training data is this data point type. (more code)
3. Seq.map is essentially a for loop going through every object in the sequence and performing the manipulation (slow)
4. We still do this sum thing. (slow)
5. I forgot to divide by the length of the sequence.
6. We can't take advantage of a GPU or massive parallelization

Lets see what I have transformed this into.

 let SumSquares (theta:Vector<float>) (y:Vector<float>) (trainingData:Matrix<float>) =
let m = trainingData.RowCount |> float
(trainingData * theta)
|> subtract y
|> square
|> divideBy m

Here are the immediate advantages
1. Cleaner and easier to read
2. Maps more closely to mathematical notation
3. Uses the MathNumerics optimizations for Linear Algebra.
4. Deals with polynomials fine without extra code overhead.

Lets take a look at the test data

  let tData = matrix [[ 1.0; 2.0]
[1.0; 3.0]
[1.0; 3.0]
[1.0; 4.0]]
let yVals = vector [ 5.0; 6.0; 7.0; 11.0]
let theta = vector [1.0; 0.2]
tData |> SumSquares theta yVals

Ok, so this is different from the other examples usage for sure. We have to extract our features and transform them into these matrix and array things that most folks probably aren't familiar with.

So, how in the world did we essentially remove 2 for loops by doing this?  Lets dive line by line starting with our example data.

The Set Up Data

tData is what is considered a 4x2 matrix.  It has 4 rows and 2 columns.  This is indicative of a polynomial of the form y = mx + b (or the function of a line).  This is a standard univariate polynomial.  Or in lamest terms, you have one predictor variable such as the square footage of a house predicts the price.

Now wait a second, what do you mean one variable?  There are two there!  Yes, Price = m*squarefootage + 1*yaxismodifier.  Where m is how much the square footage affects the price and the yaxismodifier is a boost to get to minimum home size.  In all linear regressions, you will at a minimum have this extra 1.  So when you extract your data points from the data frame, you will need to append a column of ones.

Ok, thats cool, now what about yVals?  This is a Vector, which can be thought of as a 1xn matrix.  There is 1 row and n columns.  yVals are simply the known values of our observed data set.  In this instance if you oriented (or transposed) yVals to be a column with 5,6,7,11 and appended it as a column to the end of tData, you would have the actual for each observation.

Finally, theta is another vector or 1 x n matrix.  This is our list of values, or our weights for how much to affect something.  These are the m and b in the equation y = mx + b

The Equation

 let SumSquares (theta:Vector<float>) (y:Vector<float>) (trainingData:Matrix<float>) =
let m = trainingData.RowCount |> float
(trainingData * theta)
|> subtract y
|> square
|> divideBy m

Lets step through this.  First, what is the goal.  Well previously we multiplied every sample within training data by the correct theta values, subtracted y, squared them and divided by m.  Here we are taking advantage of how Matrix multiplication and subtraction works.  In matrix multiplication, you need to multiply 2 matrices of sizes A x B and B x C.  trainingData is 4 x 2 and theta is 1 x 2.  Or theta, as it is defined as a Vector is a special case of a "ColumnMatrix", which is in fact a 2 x 1.  This will produce a 4 x 1 matrix.

Ok Ok, thats too much flipping math for my simple brain, what in the world are we really doing here?  I tried to explain it, but that didn't work, here is a picture.

vectorMatrixMult

As you can see above, the way you can think of applying theta to each row of tData, which produces the equation to the right and executes it.  This does all of that in one blast shot.  This is the type of thing GPUs are amazing at by the way.  So as you can see we easily replaced a "for i in 0 .. 3" loop with that single mathematical operation.  The resulting data type is a vector.  The library we are using takes care of converting vectors into column matrices and then back to vectors again for us when multiplying a matrix by a vector.  MathNumerics is built quite well for this construction and deconstruction of matrices.

Next up, subtract y.  Now remember y is a vector and the out put of tData * theta is also a vector.  It simply subtracts from each cell the corresponding y value.  Sweet, that part of the for loop replaced.

What about square?  Well we are giving it a 1 x 4 and a 1 x 4.  Again another special case of Matrix multiplication, Vector multiplication, our library takes care of that for us.  It does the same thing as before, giving us cell1 * cell1 + cell2 * cell2 + cell3 * cell3 + cell4 * cell4.  Hrm, looks like we are squaring, then summing.  What do you know?  That's pretty nifty.  That completes the whole summation portion of the equation without doing a single for loop.  Finally we just divide by the total number of data points to get an average error, producing our sum of squares as a total cost function.

Next post I will cover gradient descent and we will be using this linear algebra approach from here on out.  At some other time I'll address the big data problem, it should be understood that this approach will only work for data up to a certain size.  In most cases though you won't have to worry about it.