Pattern Matching on xdf files in Microsoft R Server

Pattern Matching:

R uses regular expressions for pattern matching. To find patterns on non-xdf files, it’s pretty straightforward using R's grep function.

Example:

Output:

 [1] "Microsoft01" "Microsoft03" "Microsoft05"
[1] "Microsoft01"

 

When using xdf format files (the binary compressed file format used by RevoScaleR, that is a part of the Microsoft R Server) we can grep using the rxDataStep function. rxDataStep has an input parameter called transformFunc, that could be any user-defined
function.

In our example, we will use this function to perform grep-like operations on one of RevoScaleR's default dataset.

Output:

 Var 1: Q1
 7 factor levels: Completely satisfied Mostly satisfied Somewhat satisfied Neither satisfied nor dissatisfied Somewhat dissatisfied Mostly dissatisfied Completely dissatisfied
Var 2: Q2
 7 factor levels: Completely satisfied Mostly satisfied Somewhat satisfied Neither satisfied nor dissatisfied Somewhat dissatisfied Mostly dissatisfied Completely dissatisfied
Var 3: Q3
 7 factor levels: Completely satisfied Mostly satisfied Somewhat satisfied Neither satisfied nor dissatisfied Somewhat dissatisfied Mostly dissatisfied Completely dissatisfied
Var 4: Q4
 7 factor levels: Completely satisfied Mostly satisfied Somewhat satisfied Neither satisfied nor dissatisfied Somewhat dissatisfied Mostly dissatisfied Completely dissatisfied

 

The grepFunc() is a user-defined function, that is assigned to transformFunc. This function would include the pattern matching logic that you want to apply on the data, in this case we use a simple grep functionality. One thing to note here, is that the first argument and return value of the transformFunc function - are always named-R lists type with equal length vector elements. But for our case, the output of the grep function might not be equivalent in size to the input, as only rows that match the pattern are returned. So, we store the output in a variable that is defined in the transformObjects parameter to rxDataStep and set returnTransformObjects to TRUE. To explore these options in depth, check out rxDataStep function.

Output:

 $matched
 [1] "Somewhat satisfied" "Somewhat satisfied" "Somewhat dissatisfied" "Somewhat satisfied" "Somewhat dissatisfied" "Somewhat satisfied" "Somewhat satisfied" "Somewhat dissatisfied"
 [9] "Somewhat dissatisfied" "Somewhat satisfied"