rxExecBy – Productivity and scale with partitioned data

There is often a need to train data for “many small models” instead of a “single big model”. Specifically, users may want to train separate models such as logistic regressions or boosted trees within groups (partitions) like “states”, “countries”, “device id”, etc. or they may want to compute summary statistics such as mean, min, max,…

0

Performance: rxExecBy vs gapply on Spark

rxExecBy is a new API of R server release 9.1, it partitions input data source by keys and applies user defined function on each partitions. gapply is a SparkR API that also provides similiar functionality, it groups the SparkDataFrame using specified columns and applies the R function to each group.   Prepare Environment The performance…


rxExecBy Insights on RxSpark Compute Context

rxExecBy is designed to resolve a problem that user has a very big data set, want to partition it into many small partitions, and train models on each partition. This is what we call small data many models. rxExecBy has many features and can run in many different compute contexts, e.g. RxSpark, RxInSqlServer, local. In this blog, I’m going…


Stratified Splitting using rxExecBy

In Microsoft R Server 9.1, we have a new function called rxExecBy() which can be used to partition input data source by keys and apply user defined function on individual partitions. You can read more about rxExecBy() here : Pleasingly Parallel using rxExecBy In this article, we will look at how to use rxExecby to…


Running Pleasingly Parallel workloads using rxExecBy on Spark, SQL, Local and Localpar compute contexts

RevoScaleR function rxExec(), allows you to run arbitrary R functions in a distributed fashion, using available nodes (computers) or available cores (the maximum of which is the sum over all available nodes of the processing cores on each node). The rxExec approach exemplifies the traditional high-performance computing approach: when using rxExec, you largely control how…