DB2 Code Page Fun

A while back I had a customer who was importing data from AS/400 DB2 tables into SQL Server, using SSIS, and we came upon an interesting error. The below describes how to actually reproduce the issue, and the fix. The fix is currently in all versions of our DB2 Data Provider at this point (assuming you have the latest of the stand-alone versions, or latest QFEs for HIS 2004). In the customers case, they were using the standalone provider that is part of the SQL Server Feature Pack.

The customer has very large tables, with all their CHAR() columns using CCSID 65535. A CCSID of 65535 means the data is stored on the AS400 in binary format, and requires ‘no conversion’. Almost true, if you only want the actual hex values.

 

Also, about half of the customer’s data was binary English, but the rest was binary Japanese, which complicates things a bit.

 

So, let’s begin. First, we need to create a table to show how things behave. To do so, you need 2 columns using CCSID 939 (Japanese), 2 columns using CCSID 65535 (binary), and one column using CCSID 37 (English).

 

Create a linked server to the AS400 using English (CCSID 37, PC Code page 1252). Call this linked server CPENGLISH.

 

Now, we are going to create the table. Run the following 2 scripts (be sure to change the schema name to yours).

 

exec('

CREATE TABLE <schema>.CODEPAGEFUN (

JCHAR939 CHAR(40) CCSID 939 DEFAULT NULL ,

JCHAR65535 CHAR(40) CCSID 65535 DEFAULT NULL ,

ECHAR37 CHAR(40) CCSID 37 DEFAULT NULL ,

ECHAR939 CHAR(40) CCSID 939 DEFAULT NULL ,

ECHAR65535 CHAR(40) CCSID 65535 DEFAULT NULL )

') at CPENGLISH

Exec('

LABEL ON COLUMN <schema>.CODEPAGEFUN

( JCHAR939 IS ''Japanese 939 CP '' ,

JCHAR65535 IS ''Japanese 65535 CP '' ,

ECHAR37 IS ''English 37 CP '' ,

ECHAR939 IS ''English 939 CP '' ,

ECHAR65535 IS ''English 65535 CP '' )

') at CPENGLISH

 

Now we need to insert some data into the database. Since we are dealing with Japanese data, you can not use English to insert Japanese data (humor me) so create a new linked server using CCSID 939, PC Code Page 932. Call this one CPJAPANESE.

 

Then, run the following script:

 

DECLARE @var1 nchar(40);

DECLARE @var2 nchar(40);

set @var1 = N'チャールズ';

set @var2 = N'Charles';

EXEC('Insert into <schema>.CODEPAGEFUN

(JCHAR939, JCHAR65535, ECHAR37, ECHAR939, ECHAR65535)

Values (?,?,?,?,?)',@var1,@var1,@var2,@var2,@var2)

at CPJAPANESE

 

The above script inserted Japanese into the first 2 columns, then English into the last 3 columns.

The JCHAR65535 column contains the hex code for “チャールズ”, which translated to English is Charles (hopefully :) ).

Now, what is the issue you ask? This table contains multiple columns, with each column using different character sets. This is not common, but I’ve seen this in a few customer tables, and 2 of the columns are using CCSID 65535.

So, what happens in SQL if you try to run a select statement against this table?

First, let’s use the English code page (the host is English, the Windows system is English, so why not, right?). Since we already have our CPENGLISH linked server setup, let’s start there.

EXEC('SELECT * FROM <schema>.CODEPAGEFUN') at CPENGLISH

This query returns:

JCHAR939 JCHAR65535 ECHAR37 ECHAR939 ECHAR65535

       0x0E43924353435843AF43C70F40404040404040404040404040404040404040404040404040404040 Charles 0xC38881999385A2404040404040404040404040404040404040404040404040404040404040404040

Well it appears only 3 columns seem to have come back with data – JCHAR65535, ECHAR37, ECHAR65535. What’s up with that?

Well, the codepages for 1252/37 do not contain the Japanese character set, so those can’t be translated. This seems simple enough, right. So, let’s change the provider string to use the Japanese linked server we setup earlier:

EXEC('SELECT * FROM <schema>.CODEPAGEFUN') at CPJAPANESE

This query returns:

JCHAR939 JCHAR65535 ECHAR37 ECHAR939 ECHAR65535

ƒ`ƒƒ?[ƒ‹ƒY 0x0E43924353435843AF43C70F40404040404040404040404040404040404040404040404040404040 Charles Charles 0xC38881999385A2404040404040404040404040404040404040404040404040404040404040404040

<Note that the Japanese character set does not display properly in SQL Query Analyzer’s result window>

Well, better, all the columns came back this time. Also note that columns ECHAR37 and ECHAR939 both return “Charles ”. It just so happens that the 939/932 code pages are a MBCS (multi-byte character set), and contain both DBCS for Japanese, and the SBCS code page for English. Thus, you can use this combination to convert Code Page 37 to Code Page 932 and not lose any data.

But, we still can’t actually ‘read’ the 65535 data. Our data provider has an option to convert Binary to Character, so let us try this and see what happens. Create a new linked server, with “Process Binary as Character=True” set (default is false). Call this one CPJAPBIN.

New Query using the new linked server would be:

EXEC('SELECT * FROM <schema>.CODEPAGEFUN') at CPJAPBIN

And the output:

JCHAR939 JCHAR65535 ECHAR37 ECHAR939 ECHAR65535

ƒ`ƒƒ?[ƒ‹ƒY Charles Charles                                

Well. That seems to have made things worse. The binary columns did not come out. You can actually verify this by running the following query, which would return 2 ‘empty’ columns:

EXEC('SELECT JCHAR65535, ECHAR65535 FROM <schema>.CODEPAGEFUN') at CPJAPBIN

JCHAR65535 ECHAR65535

      

Nothing appears to have come down. Those 2 columns ‘should’ have been converted to character data but were not. To make matters worse no errors are reported back.

So, let’s try something different.

If you were attempting to import this table via SSIS into SQL with settings similar to the CPJAPBIN linked server, you WOULD get errors with Binary as Character set to true.

[Source - CODEPAGEFUN [1]] Error: There was an error with output column "JCHAR65535" (23) on output "OLE DB Source Output" (11). The column status returned was: "The value could not be converted because of a potential loss of data.”

[Source - CODEPAGEFUN [1]] Error: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The "output column "JCHAR65535" (23)" failed because error code 0xC0209072 occurred, and the error row disposition on "output column "JCHAR65535" (23)" specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.

Basically, SQL can not transform the column properly, because it is DBCS data. Apparently after the transformed data is converted from Binary to Char, and from EBCDIC to ASCII, there is no guarantee that the you would have 40 characters, so it gives up.

There is a ‘reason’ for this, in a sense, and you have to be very careful when using some DBCS languages. In some character sets, an EBCDIC DBCS value may be represented by one ‘character’. When converted from EBCDIC to ASCII, the ASCII representation may be 2 characters in length. The reverse is also true. I’ve seen this with Arabic, and it can happen for other BIDI languages:

· The Arabic Lam Alef is one character in 20420 while it is two in 1256

· Final Seen, Sheen, Sadh and Dhadh are two characters in 20420 while they are one in 1256

With MBCS, you also have SO/SI characters that are part of the data that will be transferred.

So, after talking with the development team about this, we came up with a solution to help convert the columns as character. A new parameter was added to the data source string which ‘forces’ the conversion to take place in a separate manner than previous done (this is mainly a problem for DBCS languages). This additinoal parameter is not available via the “Data Access Tool”, but can be accessed via the ALL tab on the “Data Link Properties”. The property is “Binary Codepage”, and only works when “Process Binary as Character” is set to “True”.

In the data provider string, we need to add this new parameter in. The easiest way to do this is as follows.

1) Using the DAT (Data Access Tool), create a new UDL similar to the linked server CPJAPBIN, and save this as CPJAPBINSET.

2) The DAT‘s default directory for storing UDLS is “C:\Documents and Settings\<username>\My Documents\Host Integration Projects\Data Sources”, so open this directory.

3) Double click the CPJAPBINSET.udl file. This will bring up the “Data Link Properties” window.

4) Under the “Advanced” tab, verify that the Host CCSID is 939, the PC code page is 932, and Process Binary as character is checked. Then go to the “All” tab.

5) Find the “Binary Codepage” setting. Double click this and set this to 939. Click “OK” and save the UDL file.

6) Now open the UDL file up in notepad, and copy the init string. It should be similar to this:

Provider=DB2OLEDB;Password=password;Persist Security Info=True;User ID=USERID;Initial Catalog=S105FRDM;Defer Prepare=False;Binary Codepage=939;Derive Parameters=False;Rowset Cache Size=0;Network Transport Library=TCP;Host CCSID=939;PC Code Page=932;Network Address=S105FRDM;Network Port=446;Package Collection=CHARLIEE2;Default Schema=CHARLIEE2;DBMS Platform=DB2/AS400;Process Binary as Character=True;Connection Pooling=False;Units of Work=RUW

Now, create a new linked server using the init string from above, and name this linked server CPJAPBINTRUE.

Now, using the new linked server, run the same query again:

EXEC('SELECT * FROM <schema>.CODEPAGEFUN') at CPJAPBINTRUE

This should return:

JCHAR939 JCHAR65535 ECHAR37 ECHAR939 ECHAR65535

ƒ`ƒƒ?[ƒ‹ƒY ƒ`ƒƒ?[ƒ‹ƒY Charles Charles Charles

Success!!!

And SSIS should now work without errors when “Binary Codepage =939” is set.

In this particular customer’s case, their tables columns contained mixed Japanese and English.

For an example, run the following script:

declare @counter int

DECLARE @var1 nchar(40)

DECLARE @var2 nchar(40)

set @counter = 0

while @counter < 5

begin

      set @counter = @counter + 1

      print @counter

      if (@counter = 1)

      begin

            set @var1 = N'マイクロソフト'

            set @var2 = N'Microsoft'

      end

      if @counter = 2

      begin

            set @var1 = N'ホスト統合サーバー'

            set @var2 = N'Host Integration Server'

      end

      if @counter = 3

      begin

            set @var1 = N'ノースカロライナ州'

            set @var2 = N'North Carolina'

      end

      if @counter = 4

      begin

            set @var1 = N'Washington'

            set @var2 = N'Washington'

      end

      if @counter = 5

      begin

            set @var1 = N'Texas'

            set @var2 = N'Texas'

      end

      begin

            print @var1

            print @var2

            EXEC('Insert into <schema>.CODEPAGEFUN

                  (JCHAR939, JCHAR65535, ECHAR37, ECHAR939, ECHAR65535)

                  Values (?,?,?,?,?)',@var1,@var1,@var2,@var2,@var2)

                  at CPJAPANESE

      end

     

end

 

Now, after running the following select query:

EXEC('SELECT JCHAR939, JCHAR65535 from <schema>.CODEPAGEFUN') at CPJAPBINTRUE

You should see the following returned (again note that Query Analyzer's output does not display the Japanese Character set properly here, but the binary bits are correct):

JCHAR939 JCHAR65535

ƒ`ƒƒ?[ƒ‹ƒY ƒ`ƒƒ?[ƒ‹ƒY

ƒ}ƒCƒNƒ?ƒ\ƒtƒg ƒ}ƒCƒNƒ?ƒ\ƒtƒg

ƒzƒXƒg“??‡ƒT?[ƒo?[ ƒzƒXƒg“??‡ƒT?[ƒo?[

ƒm?[ƒXƒJƒ?ƒ‰ƒCƒi?B ƒm?[ƒXƒJƒ?ƒ‰ƒCƒi?B

Washington Washington

Texas Texas

 

This also shows that it is possible for a customer to have mixed languages in the same column, depending on the actual codepages they are using.

 

Note: Another solution to ‘force’ conversion of the binary strings is by telling DB2 to do the conversion by doing a CAST against the column. So, using the ‘Japanese” linked server we created originally, the following query would return everything as character data:

EXEC('SELECT

JCHAR939,

cast(JCHAR65535 as CHAR(40) CCSID 939) as JCHAR65535,

ECHAR37,

ECHAR939,

cast (ECHAR65535 as CHAR(40) CCSID 939) as ECHAR65535

from <schema>.CODEPAGEFUN') at CPJAPANESE

JCHAR939 JCHAR65535 ECHAR37 ECHAR939 ECHAR65535

ƒ`ƒƒ?[ƒ‹ƒY ƒ`ƒƒ?[ƒ‹ƒY Charles Charles Charles

ƒ}ƒCƒNƒ?ƒ\ƒtƒg ƒ}ƒCƒNƒ?ƒ\ƒtƒg Microsoft Microsoft Microsoft

ƒzƒXƒg“??‡ƒT?[ƒo?[ ƒzƒXƒg“??‡ƒT?[ƒo?[ Host Integration Server Host Integration Server Host Integration Server

ƒm?[ƒXƒJƒ?ƒ‰ƒCƒi?B ƒm?[ƒXƒJƒ?ƒ‰ƒCƒi?B North Carolina North Carolina North Carolina

Washington Washington Washington Washington Washington

Texas Texas Texas Texas Texas