Using MRS with Teradata: Limitations of Teradata's API and Temporary Constant Columns

Overview

  • Microsoft R Server is capable of In-Database Analytics with Teradata, this is achieved by leveraging the Teradata Stored Procedure and Table Operator APIs. Because of limitations in the APIs, when using Microsoft R Server with Teradata, special considerations must be made regarding formatting and encoding.
  • One such consideration is naming temporary constant columns. When running queries in Teradata, there is a mechanism which allows users to select a column and assign an alias to it, or to fill all rows in a column with static data. This is achieved using the “AS” keyword.

The "AS" keyword in Teradata

  • For example, assume we have a table “Employees” in the Database “HRInfo”, with Unique Primary Index “EmployeeID”, and Employee information as follows:

  •  EmployeeID NUMERIC(5) NOT NULL,
    EmployeeName VARCHAR(10),
    EmployeeManagerEmployeeID NUMERIC(5),
    DepartmentID NUMERIC(5),
    Gendar CHAR(1),
    Birthday DATE,
    JobGrade NUMERIC(1),
    Salary NUMERIC(10,2)
    
  • Then, assume that we want to extract a table with EmployeeName and Birth Month and Day to create a “Happy Birthday” mailing list, but obscure the birth year for privacy reasons, We would run this query:

  •  DATABASE HRInfo;
    SELECT EmployeeName,
           EXTRACT(DAY FROM Birthday) AS “Day”,
           EXTRACT(MONTH FROM Birthday) AS “Month”
           FROM Employees;
    
  • Which will return a table with columns: “EmployeeName”, “Day”, “Month”, where “Day” and “Month” are Temporary Constant Columns, that is, if the returned table is not written to a table in the database, the columns are not stored anywhere.

  • Another use of this mechanism would be to change the name of a column mid-query. If we wanted to return a table with salaries, for example, but obscure that what we are showing is dollar values, we could write a query like:

  •  DATABASE HRInfo;
    SELECT Salary as “Value”
           FROM Employees;
    
  • This will return a one column table with the name “Value” and the values from the Salary column.

Using "AS" in Microsoft R Server

  • In Microsoft R Server, data can be extracted directly from a database using Teradata queries wrapped by a call to RxTeradata(). Let’s assume we wish to use the results of the above birthday query as input data to some MRS, we would do something like:

  •  inputData <- RxTeradata( sqlQuery = “SELECT EmployeeName, EXTRACT(DAY FROM Birthday) AS “Day”, EXTRACT(MONTH FROM Birthday) AS “Month” FROM Employees;” )
    
  • if we then use “inputData” within rxSummary() to return the data to the R client, we would observe that the returned data frame has variable names “EmployeeName”, “D”, “M”. This is obviously incorrect.

Why are our names getting truncated?

  • This is because of how Teradata APIs treat column names. Internally to Teradata, column names are a VARCHAR datatype, which then internally is a null terminated character array.
  • The issue occurs when a column name is returned from a query, it is 1 byte per character with a null terminator at the end of the name; alternatively, when a temporary constant column name exists in a query, such as our aliases above, the database reserves 2 bytes per character for special character support.
  • If the name assigned in the query uses only a ASCII/UTF-8 character set, every other byte returned will be NULL. The result is early termination of the string and only the first character of each alias being assigned to the final temporary constant column name.

How Do We Fix It?

  • There are two workarounds for this limitation in the Teradata API, which can be chosen depending on user need, each with Pros and Cons:

    • The simple and clean workaround is modifying the connection string. in MRS, to connect to Teradata, a valid Teradata connection string must be supplied to the compute context cluster specifications. A standard Teradata connection string would be:

      connectionString = "Driver=Teradata;DBCNAME=<Database Network Hostname>;Uid=<Database Username>;pwd=<Database User’s Password>;"
      

      Another accepted parameter to connection strings is “CHARSET”. If we add “CHARSET=ASCII” to our connection string, it would look like:

      connectionString = "Driver=Teradata;DBCNAME=<Database Network Hostname>;Uid=<Database Username>;pwd=<Database User’s Password>;CHARSET=ASCII; "
      

      The result being that all character data passed through the connection will be encoded as ASCII, and forcing temporary constant column names to be 1 byte per character.

      The drawback of this is if queries or data in tables use UTF or other special character encodings, the data encoding will be incorrect on all queries, as it is forced to ASCII. This could possibly cause queries to fail or data returned to be incorrect/malformed.

      This workaround is a good option if users use the aliasing mechanism often, and do not use special characters in queries or table data.

    • The second work around works on a single query, and would be cumbersome to apply if the aliasing mechanism is used often, but required if data or queries use special characters.Taking our example query above:

      inputData <- RxTeradata( sqlQuery = “SELECT EmployeeName, EXTRACT(DAY FROM Birthday) AS “Day”, EXTRACT(MONTH FROM Birthday) AS “Month” FROM Employees;” )
      

      To fix this query specifically, we can use Teradata’s “TRANSLATE” keyword, which will change the encoding for a string literal, or other VARCHAR data, in a query, and fix the double byte issue. Applying TRANSLATE to our example, the resulting query would be:

      inputData <- RxTeradata( sqlQuery = “SELECT EmployeeName, EXTRACT(DAY FROM Birthday) AS TRANSLATE( “Day” USING UNICODE TO LATIN) , EXTRACT(MONTH FROM Birthday) AS TRANSLATE( “Month” USING UNICODE TO LATIN)  FROM Employees;” )
      

      The resultant table will have the appropriate column names “Day” and “Month”, and no modifications will be made to any other character encodings during this session.

Summary

  • Workaround #1 would be preferred in most cases, it simplifies queries and keeps standard data encoding across all data passed to and from the database.
  • Workaround #2 is useful is aliasing is done rarely, or if the data which is being operated on contains special characters.