In my previous post I described about the text featurization using MicrosoftML.
In this post, I show you a brief introduction for the anomaly detection with MicrosoftML.
Note : As I mentioned in the previous post, MicrosoftML is now available in Windows only (not Linux including the Spark cluster). Sorry, but please wait for the update.
MicrosoftML provides the function of one class support vector machines (OC-SVM) named
rxOneClassSvm, which is used for the unbalanced binary classification. This function is the unsupervised learner, i.e., it doesn't need to know about the possible anomalies in the training phase. (The only normal data is used for the training, and it's separated by the optimal hyperplane while it's mapped into the high dimensional space.)
First I show you a brief example of this function for your understanding as follows.
library(MicrosoftML) # train data with normal data train_count <- 500 ndivall <- rnorm(train_count) ndivnorm <- (ndivall - min(ndivall))/(max(ndivall) - min(ndivall)) traindata <- data.frame(AvailableMemory = round(200 * ndivnorm, digits = 2)) ndivall <- rnorm(train_count) ndivnorm <- (ndivall - min(ndivall))/(max(ndivall) - min(ndivall)) traindata$DiskIO <- round(100 * ndivnorm, digits = 2) # test data with some anomaly data test_count <- 10 ndivall <- rnorm(test_count) ndivnorm <- (ndivall - min(ndivall))/(max(ndivall) - min(ndivall)) testdata <- data.frame(AvailableMemory = round(200 * ndivnorm, digits = 2)) ndivall <- rnorm(test_count) ndivnorm <- (ndivall - min(ndivall))/(max(ndivall) - min(ndivall)) testdata$DiskIO <- round(100 * ndivnorm, digits = 2) testdata$AvailableMemory[c(3,7)] <- c(100, 0) testdata$DiskIO[c(3,7)] <- c(150, 120) # train by OC-SVM with normal data model <- rxOneClassSvm( formula = ~AvailableMemory + DiskIO, data = traindata) # predict result <- rxPredict( model, data = testdata, extraVarsToWrite = c("AvailableMemory", "DiskIO"))
As you can see, the row #3 and #7 in the test data is the outlier.
The following illustrates the data map including the normal data by the blue dot and this outlier data by the red dot.
The following is the result. The outlier data in row #3 and #7 are scored as follows.
Let's see the real scenario.
Here I use the "Breast Cancer Wisconsin Data Set" (see here). This data is including id of patient, the diagnosis result of disease (M = malignant, B = benign), and a lot of attributes which are computed from a digitized image of a breast mass (radius, texture, perimeter, etc). This sample is having high dimensions.
This dataset is well-formed for the analysis purpose, but in the real application you must do some works before training like selecting appropriate attributes, vectorizing, data cleaning, eliminating dependencies, etc.
8510426, B, 13.54, 14.36, 87.46, ... 8510653, B, 13.08, 15.71, 85.63, ... 8510824, B, 9.504, 12.44, 60.34, ... ...
Here I train and predict with the following steps.
- Split the original data into the training purpose and testing purpose.
- Create the trained model by
rxOneClassSvmwith the training data. We use all the attributes except for the patient id and the result ('M' or 'B') for training.
- Predict by the generated model with test data, and evaluate the results. (Here I use ROCR package.)
This programming example is here :
library("MicrosoftML") library("ROCR") # read data alldata <- read.csv( "C:\\tmp\\wdbc.data", col.names=c( "patientid", "outcome", "radius_mean", "texture_mean", "perimeter_mean", "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean", "concavepoints_mean", "symmetry_mean", "fractaldimension_mean", "radius_error", "texture_error", "perimeter_error", "area_error", "smoothness_error", "compactness_error", "concavity_error", "concavepoints_error", "symmetry_error", "fractaldimension_error", "radius_worst", "texture_worst", "perimeter_worst", "area_worst", "smoothness_worst", "compactness_worst", "concavity_worst", "concavepoints_worst", "symmetry_worst", "fractaldimension_worst")) # split data # (Note that all training data must be normal data) traindata <- alldata[1:449,] traindata <- traindata[traindata$outcome=="B",] traindata <- traindata[,!(names(traindata) %in% c("patientid", "outcome"))] testdata <- alldata[450:568,] # train by OC-SVM with normal data model <- rxOneClassSvm( formula = ~ ., data = traindata) # predict using the trained model result <- rxPredict( model, data = testdata, extraVarsToWrite = c("outcome")) # evaluate results (compare with the diagnosis results) and plot pred <- prediction( predictions = result$Score, labels = result$outcome, label.ordering = c('B', 'M')) roc.perf = performance( pred, measure = "tpr", x.measure = "fpr") plot(roc.perf)
The following is the result plotted by ROCR. The result seems to fairly match the diagnosis results.
rxOneClassSvm uses the radial basis (RBF) as the SVM kernel function by default. For more complex cases, you can specify other kernel functions (linear, polynomial, sigmoid) with appropriate parameters.
model <- rxOneClassSvm( formula = ~TestAttr1 + TestAttr2, kernel = polynomialKernel(a = .2, deg = 2), data = traindata)