Troubleshooting “Error 1000: RPackage library exception: Failed to convert RObject to DataSet” when running R scripts in Azure ML

I’ve recently started experimenting with Azure ML and while I find this technology – which currently is still in preview – amazing, it’s also clear that the standard modules don’t natively support a lot of scenarios. Fortunately Azure ML lets you embed R scripts as part of experiments. This way you can overcome most limitations. In fact, R is so powerful that one way of looking at Azure ML is as a way to easily deploy an R solution in the cloud!

As I started experimenting with R scripts, I frequently encountered scripts which ran well outside of Azure ML – in RStudio – but failed inside Azure ML with the following error:

Error 1000: RPackage library exception: Failed to convert RObject to DataSet

As an Azure ML and R-newbie it took me some time  to get to the bottom of this.  The error occurs when you pass data back from R back to Azure ML. If you run into this error, try commenting out the line which calls the maml.mapOutputPort function, which is usually the last line in the R script. Now run the experiment again. If the error no longer occurs, this confirms that our suspicion. Of course, without calling maml.mapOutputPort you won’t get any data back to Azure ML so you need to add that line back and fix the actual problem.

I encountered at least three situations which can cause this error:

  1. Limited date/time support in Azure ML
  2. Duplicate column names
  3. You have a bug in your code which does not surface in RStudio

I’ll describe them in more detail below

Limited date/time support in Azure ML

Because Azure ML does not  natively support learning models to deal time series data, this is an important scenario where R is very useful. For example, if you want to build a model which predicts the gold price based on historic gold prices, you need to resort to R which has powerful time series analyses and forecasting capabilities. R has several classes which deal with dates and times, including Date, POSIXct and POSIXlt. One of the problems according to Mike Lanzetta is that Azure ML does not yet support parsing of POSIXlt values in the result set. I have not yet been able to verify this myself, but I guess it's a good idea to stay away from POSIXlt dates and use Date and POSIXct instead.

Another problem is that date and time values don’t always seem to convert well between different types of variables. I could only reproduce this with date/time values which were passed to R through an input port in Azure ML. Therefore I suspect there may be an issue in the way Azure ML initializes date/time values going into the R script component.

For example, let’s assume that your data contains two columns, one called Date which is a DateTime field and gets passed as vector of POSIXct into the script, and the other called Value which is vector of numeric values (see Loading daily gold prices into Azure ML for a quick way to implement this data source).

Let’s say we want to perform time series analyses using the zoo class.  We’ll leave away all the code that does actual time analyses and just focus on the conversions. The R script is shown below:

When we create the zoo time series variable called z in line 5, we provide the prices and the dates. The dates variable is a vector of POSIXct. In line 6 we load the dates from the z variable into a new vector called t, using the time() function. You’d expect t to be a vector of POSIXct since that was the type of the original dates vector which we used before, and when you run the code in RStudio, this is the case. However, when running this code in Azure ML, t becomes a vector of numeric. Somehow the time() function no longer sees this data as POSIXct and instead returns the numeric which represents the number of seconds since 1/1/1970.

Note: this sample does not generate the Error 1000 error message, but it does lead to a state where this could easily happen if you change the code a bit. However I no longer have the version of the code which generated that error message.

The work-around I found was to add two extra lines of code after line 3. The updated script below now also works in Azure ML:

Know a better solution, let me know!

Duplicate column names

Another case I encountered which caused Error 1000 was when multiple columns have the same name. It’s easy to understand that Azure ML does not like duplicate column names and throws an error.

You have a bug in your code which does not surface in RStudio

Consider the following R script:

This R script does not require any input data. It creates a vector called test and then two variables called v1 and v2 which will hold the classes of all elements in the vector. Then it concatenates v1 and v2 using the rbind() function and returns that to the output port. If you run this in RStudio (except for the last line) and then inspect the data.set variable, all looks as expected and this is the result:

         X1      X2      X3
v1 numeric numeric numeric
v2 numeric numeric numeric

However, when you run this in Azure ML, you’ll get “Error 1000: RPackage library exception: Failed to convert RObject to DataSet”.

The code has a bug. The rbind() function is intended to concatenate vector and matrix variables, but the output of the lapply() function is a list variable which can hold heterogeneous objects. Azure ML does not like heterogeneous objects because you can’t map the values to columns with a specific type. (I would have liked R to throw an error when I used rbind(), I’m not sure why it didn't do so)

One solution is to cast the output of lapply() to a text string, as follows:

Now it also works in Azure ML!

Technorati Tags: Azure ML,R