A KMeans example for Spark MLlib on HDInsight

Article
09/24/2015

Today we will take a look at Sparks's module for MLlib or its built-in machine learning library Sparks MLlib Guide . KMeans is a popular clustering method. Clustering methods are used when there is no class to be predicted but instances are divided into groups or clusters. The clusters hopefully will represent some mechanism at play that draws the instance to a particular cluster. The instances assigned to the cluster should have a strong resemblance to each other. A typical use case for KMeans is segmentation of data. For example suppose you are studying heart disease and you have a theory that individuals with heart disease are overweight. You have collected data from individuals with and without heart disease and measurements of their weight like their body mass index, waist-to-hip ratio, skinfold thickness, and actual weight. KMeans is used to cluster the data into groups for further analysis and to test the theory. You can find out more about KMeans on Wikipedia Wikipedia KMeans .

The data that we are going to use in today's example is stock market data with the ConnorsRSI indicator. You can learn more about ConnorsRSI at ConnorsRSI. Below is a sample of the data. ConnorsRSI is a composite indicator made up from RSI_CLOSE_3, PERCENT_RANK_100, and RSI_STREAK_2. We will use these attributes as well as the actual ConnorsRSI (CRSI) and RSI2 to pass into our KMeans algorithm. The calculation of this data is already normalized from 0 to 100. The other columns like ID, LABEL, RTN5, FIVE_DAY_GL, and CLOSE we will use to do further analysis once we cluster the instances. They will not be passed into the KMeans algorithm.

Sample Data (CSV): 1988 instances of SPY SPY

Column	Description
ID	Used to uniquely identify the instance. date:symbol
Label	If the close was up or down from the previous days close.
RTN5	The return from the past 5 days.
FIVE_DAY_GL	The return from the next 5 days.
Close	Closing price.
RSI2	Relative Strength Index (2 days).
RSI_CLOSE_3	Relative Strength Index (3 days).
RSI_STREAK_2	Relative Strength Index (2 days) for streak durations based on the closing price.
PERCENT_RANK_100	The percentage rank value over the last 100 days. This is a rank that compares todays return to the last 100 returns.
CRSI	The ConnorsRSI Indicator. (RSI_CLOSE_3 + RSI_STREAK_2 + PERCENT_RANK) / 3

ID	LABEL	RTN5	FIVE_DAY_GL	CLOSE	RSI2	RSI_CLOSE_3	PERCENT_RANK_100	RSI_STREAK_2	CRSI
2015-09-16:SPY	UP	2.76708	-3.28704	200.18	91.5775	81.572	84	73.2035	79.5918
2015-09-15:SPY	UP	0.521704	-2.29265	198.46	83.4467	72.9477	92	60.6273	75.1917
2015-09-14:SPY	DN	1.77579	0.22958	196.01	47.0239	51.3076	31	25.807	36.0382
2015-09-11:SPY	UP	0.60854	-0.65569	196.74	69.9559	61.0005	76	76.643	71.2145
2015-09-10:SPY	UP	0.225168	1.98111	195.85	57.2462	53.9258	79	65.2266	66.0508
2015-09-09:SPY	DN	1.5748	2.76708	194.79	42.8488	46.1728	7	31.9797	28.3842
2015-09-08:SPY	UP	-0.12141	0.521704	197.43	73.7949	64.0751	98	61.2696	74.4483
2015-09-04:SPY	DN	-3.35709	1.77579	192.59	22.4626	31.7166	6	28.549	22.0886

The KMeans algorithm needs to be told how many clusters (K) the instances should be grouped into. For our example let's start with two clusters to see if they have a relationship to the label, "UP" or "DN". The Apache Spark scala documentation has the details on all the methods for KMeans and KMeansModel at KMeansModel

Below is the scala code which you can run in a zeppelin notebook or spark-shell on your HDInsight cluster with Spark. HDInsight

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.clustering.KMeans

import org.apache.spark.sql.functions._

// load file and remove header

val data = sc.textFile("wasb:///data/spykmeans.csv")

val header = data.first

val rows = data.filter(l => l != header)

// define case class

case class CC1(ID: String, LABEL: String, RTN5: Double, FIVE_DAY_GL: Double, CLOSE: Double, RSI2: Double, RSI_CLOSE_3: Double, PERCENT_RANK_100: Double, RSI_STREAK_2: Double, CRSI: Double)

// comma separator split

val allSplit = rows.map(line => line.split(","))

// map parts to case class

val allData = allSplit.map( p => CC1( p(0).toString, p(1).toString, p(2).trim.toDouble, p(3).trim.toDouble, p(4).trim.toDouble, p(5).trim.toDouble, p(6).trim.toDouble, p(7).trim.toDouble, p(8).trim.toDouble, p(9).trim.toDouble))

// convert rdd to dataframe

val allDF = allData.toDF()

// convert back to rdd and cache the data

val rowsRDD = allDF.rdd.map(r => (r.getString(0), r.getString(1), r.getDouble(2), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6), r.getDouble(7), r.getDouble(8), r.getDouble(9) ))

rowsRDD.cache()

// convert data to RDD which will be passed to KMeans and cache the data. We are passing in RSI2, RSI_CLOSE_3, PERCENT_RANK_100, RSI_STREAK_2 and CRSI to KMeans. These are the attributes we want to use to assign the instance to a cluster

val vectors = allDF.rdd.map(r => Vectors.dense( r.getDouble(5), r.getDouble(6), r.getDouble(7), r.getDouble(8), r.getDouble(9) ))

vectors.cache()

//KMeans model with 2 clusters and 20 iterations

val kMeansModel = KMeans.train(vectors, 2, 20)

//Print the center of each cluster

kMeansModel.clusterCenters.foreach(println)

// Get the prediction from the model with the ID so we can link them back to other information

val predictions = rowsRDD.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._6, r._7, r._8, r._9, r._10) ))}

// convert the rdd to a dataframe

val predDF = predictions.toDF("ID", "CLUSTER")

The code imports some methods for Vector, KMeans and SQL that we need. It then loads the .csv file from disk and removes the header that have our column descriptions. We then define a case class, split the columns by comma and map the data into the case class. We then convert the RDD into a dataframe. Next we map the dataframe back to an RDD and cache the data. We then create an RDD for the 5 columns we want to pass to the KMeans algorithm and cache the data. We want the RDD cached because KMeans is a very iterative algorithm. The caching helps speed up performance. We then create the kMeansModel passing in the vector RDD that has our attributes and specifying we want two clusters and 20 iterations. We then print out the centers for all the clusters. Now that the model is created, we get our predictions for the clusters with an ID so that we can uniquely identify each instance with the cluster it was assigned to. We then convert this back to a dataframe to analyze.

Below is a subset of the allDF dataframe with our data.

Below is a subset of our predDF dataframe with the ID and the CLUSTER. We now have a unique identifier and which cluster the KMeans algoritm assigned it to. Also displayed is the mean for each of the attributes passed into the KMeans algorithm for each cluster. Cluster 0 and Cluster 1. You can see that the means are very close in each cluster. For Cluster 0 it is around 27 and for cluster 1 it is around 71.

Because the allDF and predDF dataframes have a common column we can join them and do more analysis.

// join the dataframes on ID (spark 1.4.1)

val t = allDF.join(prdDF, "ID")

Now we have all of our data combined with the CLUSTER that the KMeans algorithm assigned each instance to and we can continue our investigation.

Let's display a subset of each cluster. It looks like cluster 0 is mostly DN labels and has attributes averaging around 27 like the centers of the clusters indicated. Cluster 1 is mostly UP labels and the attributes average is around 71.

// review a subset of each cluster

t.filter("CLUSTER = 0").show()

t.filter("CLUSTER = 1").show()

Let's get descriptive statistics on each of our clusters. This is for all the instances in each cluster and not just a subset. This gives us the count, mean, stddev, min, max for all numeric values in the dataframe. We filter each by CLUSTER.

// get descriptive statistics for each cluster

t.filter("CLUSTER = 0").describe().show()

t.filter("CLUSTER = 1").describe().show()

So what can we infer from the output of our KMeans clusters?

Cluster 0 has lower ConnorsRSI (CRSI), with a mean of 27. Cluster 1 has higher CRSI, with a mean of 71. Could these be areas to initiate buy and sells signals?
Cluster 0 has mostly DN labels, and Cluster 1 has mostly UP labels.
Cluster 0 has a mean of .28 % gain five days later, while cluster 1 has a loss of .03 five days later.
Cluster 0 has a mean loss of 1.22% five days before and cluster 1 has a gain of 1.15% five days before. Does this suggest markets revert to their mean?
Both clusters have min\max of 5 day returns between positive 19.40% to a loss of 19.79%.

This is just the tip of the iceberg with further questions, but gives an example of using HDInsight and spark to start your own KMeans analysis. Spark MLlib has many algorithms to explore including SVMs, logistic regression, linear regression, naïve bayes, decision trees, random forests, basic statistics, and more. The implementation of these algorithms in spark MLlib is for distributed clusters so you can do machine learning on big data. Next I think I'll run the analysis on all data for AMEX, NASDAQ and NYSE stock exchanges and see if the pattern holds up!

Bill

spykmeans.csv

A KMeans example for Spark MLlib on HDInsight

Additional resources