What's special with the BULK provider when loading LOB character columns using OPENROWSET?

This is another problem I've been working on recently, whose resolution I felt could be interesting to share here.

Let's first reproduce the errors, before trying to explain anything. So, please, connect to your instance of SQL Server 2005 and run this:

CREATE DATABASE TestDatabase
GO
USE TestDatabase
GO
CREATE TABLE [dbo].[TestTable] ([TestColumn] nvarchar(max) NULL)
GO

Now, using the DataFile.dat and FormatFile.fmt files compressed in a single zip and attached to this post, run this statement and notice how it works with no error, and it inserts the data available in the .dat file, without any truncation or any error.

BULK INSERT dbo.TestTable FROM 'c:\DataFile.dat'
WITH
(
FORMATFILE = 'c:\FormatFile.fmt'
)

Finally, run the following statement and see how it doesn't work:

INSERT INTO dbo.TestTable
SELECT [TestColumn]
FROM OPENROWSET (BULK 'c:\DataFile.dat',
FORMATFILE = 'c:\FormatFile.fmt')
AS c([TestColumn])

instead of completing successfully, it throws the following errors:

Server: Msg 4866, Level 16, State 7, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.
Server: Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Server: Msg 7330, Level 16, State 1, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

You might also try to import the data using the bcp utility (%programfiles%\Microsoft SQL Server\90\Tools\Binn\bcp.exe) and you will also see that it works fine, as you would expect:

bcp TestDatabase.dbo.TestTable in c:\DataFile.dat -S.\yukon -T -f c:\FormatFile.fmt

My FormatFile.fmt looks like this:

9.0
1
1 SQLNVARCHAR 8 0 "~R~" 1 TestColumn SQL_Latin1_General_CP1_CI_AS

So, what's the reason why it only fails when using the OPENROWSET clause?

Well, it happens to be caused by the way our syntactic parser is implemented (and its relationship with the relational engine). Let me give you more details about it.

When you issue a BULK INSERT, at parse time, we know what's the name of the table/view where you plan to bulk insert your data into. So, we pass the name of that table to into the constructor of the class which takes care of all the bulk load functionality (i.e. parsing the format file, reading the data from the data file, converting the data if required, etc.) With that information (i.e. the name of the target object), the relational engine can infer the data type, length, scale, precission and the like, of the target columns (all that information is stored by the relational engine - sys.columns). This way, in our example, SQL can infer our target column TestTable.dbo.TestColumn is a wide char BLOB column (nvarchar(max)), and therefore, it assumes that what is being loaded from the data file is a blob and can exceed the limit of 8000 bytes.

With the bcp utility, it's a similar story, but it uses the bcp APIs implemented in the ODBC native driver available in our SQL Native Client library. But, in the end, a similar thing happens: it retrieves the metadata of the target object and based on that information, it accepts source strings that exceed the 8KB limit.

On the other hand, when you issue the INSERT INTO ... SELECT ... FROM OPENROWSET, and the syntactic parser is parsing the OPENROWSET, it doesn't even know, in that context, that the results of the OPENROWSET will be used as the input source of an INSERT, so it cannot infer any metadata from any target object, and it ends up assuming the max length is 8000 bytes (4000 characters for wide char data types like SQLNCHAR and 8000 characters for single byte characters like SQLCHAR).

So... Is there any way around this? Of course there is. :-)

You can specify in your format file max length column any value bigger than 8000 bytes, like this:

9.0
1
1 SQLNVARCHAR 8 8001 "~R~" 1 TestColumn SQL_Latin1_General_CP1_CI_AS

It's also worth reminding you that, in case you are still using the old-style format files, you should start familiarizing with the new XML format. Sooner than later, the old-style format files will be deprecated and will be replaced by this new format introduced with SQL Server 2005. Using this new format, the format file I used in my previous examples, would look like this:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="NCharPrefix" PREFIX_LENGTH="8" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="TestColumn" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>

And this one would be the variation which explicitly defines the max length attribute so that it works when used with OPENROWSET:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="NCharPrefix" MAX_LENGTH="8001" PREFIX_LENGTH="8" COLLATION="SQL_Latin1_General_CP1_CI_AS"/>
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="TestColumn" xsi:type="SQLNVARCHAR"/>
</ROW>
</BCPFORMAT>

Hopes this saved you some time.

SampleFiles.zip