Ask Learn
Preview
Please sign in to use this experience.
Sign inThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
rxExecBy is designed to resolve a problem that user has a very big data set, want to partition it into many small partitions, and train models on each partition. This is what we call small data many models. rxExecBy has many features and can run in many different compute contexts, e.g. RxSpark, RxInSqlServer, local. In this blog, I'm going to cover RxSpark compute context and help you understand the details.
In RxSpark, rxExecBy supports RxTextData, RxXdfData(Composite Set), RxHiveData, RxParquetData and RxOrcData. I'm going to use AirlineDemoSmall.csv from RevoScaleR package, and convert it into all these supported data sources.
The first and most important feature of rxExecBy is partitioning. Given a data source, rxExecBy can partition the data by single key, as well as multi keys. To do it, simply put the var names into a vector and pass to rxExecBy argument "keys".
rxExecBy will return a list of results from partitions, the length of list equals to number of partitions. For each result of partition, it's a list of "keys", "result" and "status",
Take the single key c("DayOfWeek") as example. The key has 7 values, so the returned value is a list of 7 objects, with each object is list of 3,
In the multi keys c("DayOfWeek", "ArrDelay") example, the length of returned value shows a total of 3233 partitions. From the result of 1st partition, we can see the "keys" is a list of 2,
Another very handy feature in partition result is "status", which is a list of 3,
This example shows how the "status" collects warnign and error message.
UDF is the R function provided by user to apply on each partition, it takes "keys" and "data" as two required input parameters, where "keys" determines the partitioning values and "data" is a data source object of the corresponding partition.
In RxSpark compute context, "data" is a RxXdfData data source,
This example shows the ease of use for rx functions by consuming RxXdfData directly.
Except two required parameters, UDF can also take additional parameters which allow user to pass arbitrary values into UDF. This is done by funcParams, here is an example.
Factor and ColInfo are also well supported by rxExecBy, however you need to refernce the R help documents to check what are supported for each data source.
This example shows how to read a hive data source and convert the string column into factor. It also defines the factor levels, so any string value of that column that doesn't show up in the level list will be considered as missing value.
Please sign in to use this experience.
Sign in