Analyze your text in R (MicrosoftML)

MicrosoftML” package is powerful and can be used in the recent Microsoft R Server and R Client. Here I describe how you can use the text featurization capabilities using this great package with simple sentiment example.

Note : Sorry, but now it’s available in Windows only (not Linux). Please wait for the update.

You can apply the text featurization into the sentiment analytics, spam filtering (classification), social analytics, etc. Note that the supported languages are English, French, German, Dutch, Italian, Spanish and Japanese (great !) now. (see MSDN for the latest update)

Sample DataSet

Here I use the sample dataset of Amazon book reputation which is having 975,194 records of rating and free text comment as follows. (see here for this dataset)

2.0     This book has its good ...
2.0     The fatalistic view of ...
1.0     I was intrigued by the ...
2.0     I admit, I haven't finished ...

Train with text featurization

Now let’s see sample code with the text featurization. Please see the following example.

In MicrosoftML (MML), the text featurizatoin is one of the transforms for trainings. rxFastLinear is the function for the linear regression (or binary regression), which is entried in the new MML package. (It’s more fast than ever.)
Before proceeding the linear regression, we featurize the REVIEW_TEXT string, and store this featurized data named “Features” by the transforms. (Not all words. In this example, the words which is used less than 500 times are all ignored.)
The following rxFastLinear function analyzes this featurized data as input with the regression algorithm, and creates the function for the input of REVIEW_TEXT string and output of RATING numeric.

Note that this shows the very straight forward example for your understanding, and there exists the other consideration of modeling and tuning. (It’s not simple for the real production.)
For example, please consider the case he’s having extremely high positive (> 5.0). In this case, the rating will be just 5.0, because the maximum value is 5.0. That is, the rating value (numeric) is not strictly linear for the human sentiment and the model must be essentially the binary decision of positive/negative. Transform the data, use the logistic regression for the binary decision, and the confidence of this result will relate to the actual rating value. Or you must apply some other adjustments.


# read data (you can also use RxTextData)
data <- read.table(
  col.names = c("RATING", "REVIEW_TEXT"),
  sep = "\t",
  header = TRUE,
  quote = "\"",
  stringsAsFactors = FALSE)

# split data (total 975194 records)
train <- data[1:900000,]

# create transforms
transformRule = list(
    vars = c(Features = "REVIEW_TEXT"),
    # ex: include not only "Azure", "AD", but also "Azure AD"
    ngramLength = 2,
    # ex: "computer" and "compuuter" is the same
    skipLength = 1,
    featureExtractor = ngramCount(weighting = "tfidf"),
    language = "English"
    vars = c("Features"),
    mode = minCount(500)

# train using transforms !
model <- rxFastLinear(
  RATING ~ Features,
  data = train,
  mlTransforms = transformRule,
  type = "regression" # not binary (numeric regression)

summary(model,top = 50)

By summary function, it shows the coefficients, and you can see many negative words and positive words (which I surrounded by the green rectangle) in the output.
In this example, we’re using 2 as n-gram value for the text featurization (see the source code above), and the words like “not|disappointed” are also estimated correctly as follows.

Predict (Scoring)

Now let’s predict some input text using the generated model.
The following is having 3 of input text, and the following screenshot shows the scored (predicted) rating results.

pred1 <- rxPredict(
  data = data.frame(
      "It's very easy to understand and useful for me.",
      "It's boring.",
      "Humm, it seems to be a lot of pages..."),
    "RATING"=c(0.0, 0.0, 0.0),
    stringsAsFactors = FALSE),
  extraVarsToWrite = c("REVIEW_TEXT"))

The next sample predicts the rating with new 700 records, and plots the relation of actual ratings and predicted ratings.

test <- data[920001:920700,]
pred2 <- rxPredict(
  data = test,
  extraVarsToWrite = c("RATING", "REVIEW_TEXT"))
plot(x = pred2$RATING, y = pred2$Score)


I will show some social analytics example (with the real data) by MicrosoftML package in the event coming soon in Japan ! (Please enjoy.)


MSDN : Introduction to MicrosoftML

Revolution Analytics Blog : Building a machine learning model with the MicrosoftML package


Comments (0)

Skip to main content