Data Transformation in Function rxDataStep

Article
01/20/2017

Microsoft R Server supports four cases of R transformations, such as transformFunc, transforms (lists of transform statements), rowSelection (a logical expression) and in-line expressions in formulas. In this article, let's focus on how to use "transforms" and "transformFunc" to do variable transformation. For all the following example, we use a RxSqlServerData source from the test database of Microsoft R Server as the input data of function rxDataStep, and run the examples in SQL compute context (please note: transforms, transformFunc and their related parameters can be used in all compute contexts, including Teradata, Hadoop and Spark). To understand how data transformation works in RevoScaleR, let's go through some concepts first:

Lexical scoping: when R executes an expression, it first looks at the objects within the local environment, if the object is not found by name in that environment, R searches the enclosing environment of the local environment; if the object is not in the enclosing environment, R searches the enclosing environment of the enclosing environment, and so on.
Dynamic scoping: looking up variables in the calling environment rather than in the enclosing environment.
Calling environment: the environment where the function was called. parent.frame() used to get the calling environment.
Enclosing environment: the environment where the function was created and used for lexical scoping. Every function has one and only one enclosing environment. parent.env() used to get the enclosing environment.

You can get more information about scoping/environment in R here.

1. Using transforms

In rx functions of RevoScaleR, transforms argument is designed to use an expression of the form list (name = expression, ...) representing the first round of variable transformations, expression returns a vector. You can change the content, datatype of the vector, or remove it in the expression.

The Original data:

OUTPUT:

2. Using transformFunc

Argument transformFunc is different from transforms, transformFunc argument is a R function whose first argument and return value are named R lists with equal length vector elements. The output list of transformFunc can contain modifications or newly named elements. It's recommended way to do variable transformation.

OUTPUT:

3. Using transforms with UDF and "unknown" variable in UDF

transforms can also be defined in the function call using the expression function. When you use UDF for transforms, you need to pass the UDF to the remote by using argument transformObject since transforms expression gets evaluated in the server side . If an "unknown" variable is referred to in the UDF, you also need to specified the "unknown" variable in transformObjects which will pass the object into calling environment . To access the "unknown" variable in the UDF, you have to use dynamic scoping so that R looks up the "unknown" variable in calling environment, otherwise, R looks up the "unknown" variable in the enclosing environment of the UDF according to the lexical scoping.

Note: here, R expression constant < - get("constant", parent.frame()) is the dynamic scoping.

OUTPUT:

4. Using transformFunc with UDF and "unknown" variable in UDF

transformFunc is different from transforms, it will be get evaluated at client side, you do not need to pass the UDF name to server side. In addition, the objects specified by transformObjects will be passed to enclosing environment of the transformation function, so you do not need to do dynamic scoping when you use transformFunc to do variable transformation.

OUTPUT:

5. Using transformEnvir with transforms

transformEnvir is a user-defined environment. It's used as parent environment of the transformation functions and contains the data specified by transformObjects. If there are multiple objects referenced by transform functions, you can bind those objects to an user-defined environment, and then just pass the environment in transformEnvir to remote, instead of listing all the objects in transformObjects.

However, when using transforms to do variable transformation, you should set the user-defined environment as the enclosing environment of transformation function, otherwise R cannot find the "unknown" variable and the function in calling or enclosing environment.

OUTPUT:

6. Using transformEnvir with transformFunc

When you use transformEnvir with transformFunc, the user-defined environment specified in transformEnvir is passed to the remote. All the variables and functions binding to this user-defined environment will be in the calling and enclosing environment. So you do not need to set up the enclosing environment for the R transformation function in the R script.

OUTPUT:

Summary

transformFunc is the recommended way to do variable transformation, for how to use transformFunc, please see rxTransform. Even though transforms can be used to do variable transformation as well, there are some difference about R scoping/environment and where to get evaluated between these two arguments.

REFERENCE

Lexical Scope and Function Closures in R Environments rxDataStep rxTransforms

Data Transformation in Function rxDataStep

1. Using transforms

2. Using transformFunc

3. Using transforms with UDF and "unknown" variable in UDF

4. Using transformFunc with UDF and "unknown" variable in UDF

5. Using transformEnvir with transforms

6. Using transformEnvir with transformFunc

Summary

REFERENCE

Additional resources