My recommendation of SQL Server’s Globalization Development

Article
02/24/2009

In this blog, I will give some my recommendation of SQL
Server's Globalization Development.

1.
Use nchar/nvarchar instead of char/varchar
type. Nchar/nvarchar type is SQL
Server's Unicode data type, which can store any characters defined by Unicode
Org. char/varchar type always associates
with a code page, so the number of supported characters is limited. I give you some fact to show why use Unicode
data type is recommendation:

o
Starting from Windows 2000, Windows have full
Unicode support implemented by using UTF-16 Encoding. Windows APIs take WCHAR* as input which
represent a Unicode String.

o
Virtual C++ has WCHAR, which is Unicode char
type.

o
.Net String is Unicode String only encoded in
UTF-16.

o
Java String is Unicode String only encoded in
UTF-16.

o
Char type might have data corruption issue if
your client locale is different with server locale.

So it is time to deprecated the old
code page technique, and use the Unicode in your application, and also in your
SQL Server database. The only benefit
of using char type is the space saving for single byte code page. You can always use the Data Compression
feature in SQL Server 2008 if you care about disk space, which can save more
space than using char type.

2.
Also remember to put N' for your string literal in T-SQL. String literal N'abc' is a Unicode nvarchar
string literal. Literal ‘ab©' is a varchar literal, it always associates
with a codepage (string literal always use current database's collation).
Suppose char © is not in the code
page, you will get a question mark (?) when inserting the literal into a
table, even the column is nvarchar
type.

3.
Use nvarchar instead of nchar. The different between nvarchar and
nchar is
the storage. A column with nchar(30) type
always take 60 bytes to store on the disk, even the value is a single
character. The data size for a nvarchar(30) type column
is not fixed, it varies row by row or
value by value. A single character value
takes 2 bytes to store, and a value with 30 characters long takes 60
bytes to be stored. Another different between nvarchar and nchar
type is the performance. Nchar types
always stored in fixed location in every row, which can be retrieved
faster
than nvarchar type which is stored in different location for different
row. However, I believe the benefit of
less stored space for nvarchar types usually overcomes the cost of
locating the
value in a row.

4. If possible, avoid using a column collation which is different with the database's collation. You will have
less collation conflict. If you want
a query use a special collation's sorting rule, use explicit collate clause in that query.

5. Use Windows Collation
instead of SQL Collation. The only
exception here is the default collation for en_US locale which is
sql_latin1_general_cp1_ci_as.

6.
Never store UTF-8 string in varchar types, you
will get data corruption.

7.
Keep in mind, string comparison always ignore
trailing spaces (Unicode U+0020), no matter what collation you are using.

8.
Keep in mind, LEN function always return the
number of characters exclude the trailing spaces. DataLength function returns the storage size in term of bytes.

9.
using _BIN2 collation, instead of _BIN collation if you want binary, code page
base string comparison behavior.

My recommendation of SQL Server’s Globalization Development

Additional resources