SQLSweet16!, Episode 10: "I can eat glass …", but can I load it into a database?

Sanjay Mishra

Reviewed By: Dimitri Furman, Murshed Zaman, Kun Cheng

 

If you have tried to use BULK INSERT or bcp utilities to load UTF-8 data into a table in SQL Server 2014 or in an earlier release (SQL Server 2008 or later), you have likely received the following error message:

 Msg 2775, Level 16, State 13, Line 14
The code page 65001 is not supported by the server.

The requirement to support UTF-8 data for these utilities has been extensively discussed on various forums, most notably on Connect.

This requirement has been addressed in SQL Server 2016 (and backported to SQL Server 2014 SP2). To test this, I obtained a UTF-8 dataset from https://www.columbia.edu/~fdc/utf8/. The dataset is translation of the sentence “I can eat glass and it doesn’t hurt me” in several languages. A few lines of sample data are shown here:

data1_utf8

(As an aside, it is entirely possible to load Unicode text such as above into SQL Server even without this improvement, as long as the source text file uses a Unicode encoding other than UTF-8.)
-- SQL Server 2014 SP1 or earlier

CREATE DATABASE DemoUTF8_2014
GO

USE DemoUTF8_2014
GO

CREATE TABLE Newdata
(
lang VARCHAR(200),
txt NVARCHAR(1000)
)
GO

BULK INSERT Newdata
FROM 'C:\UTF8_Test\i_can_eat_glass.txt'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR='\t', CODEPAGE='65001')
GO

Msg 2775, Level 16, State 13, Line 14
The code page 65001 is not supported by the server.

-- SQL Server 2016 RTM or SQL Server 2014 SP2 or later

CREATE DATABASE DemoUTF8_2016
GO

USE DemoUTF8_2016
GO

CREATE TABLE Newdata
(
lang VARCHAR(200),
txt NVARCHAR(1000)
)
GO

BULK INSERT Newdata
FROM 'C:\UTF8_Test\i_can_eat_glass.txt'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR='\t', CODEPAGE='65001')
GO

(150 row(s) affected)
SELECT * FROM Newdata
GO

data2_utf8
You can now use CODEPAGE='65001' with BULK INSERT, bcp and OPENROWSET utilities.

Note that this improvement is only scoped to input processing by bulk load utilities. Internally, SQL Server still uses the UCS-2 encoding when storing Unicode strings.