Collation Basic


In SQL Server, each string type, such as varchar, nvarchar, text, ntext
associate with a collation no matter whether you specified or not during
creation time.   If your app just works
fine and you never deal with collation explicitly or you even don’t know collation,
Congratulation! You are really lucky as most of our customers do.  However, as soon as you hit collation issues
in your app, you will find the collation in SQL Server is really messy.   I helped several customers to solving their
collation issues, and I like to share some knowledge to your guys in a serie
of blogs. In this article, I will try to cover the basic concept of
collations.  Before I start this, I
recommend your guys read a great article: International
Features in Microsoft SQL Server 2000
by Michael Kaplan. 

In this post, I will use varchar type to represent SQL Server’s non-Unicode
character data types, which are char, varchar, varchar(max) and text, and use
nvarchar type to represent SQL Server’s Unicode character data types, which are
nchar, nvarchar, nvarchar(max) and ntext type. So Let us start on what is
collation? In SQL Server, a collation is an attribute/property associate with a
string type.  It specifies  

  • How to encode strings into byte sequence and
    store into a varchartype? Each
    collation has a codepage, which
    was used for this purpose.
  • The language of the strings stored in the character
    type, such as us-English, Chinese, etc. Each
    collation has a Language
    Identifiers
    to identify the language in a country.
  • How to sort and compare character data.

For example, there is a collation Chinese_prc_stroke_90_ci_as introduced in
SQL Server 2005.   It use code page 936 to encode
varchar types, and the language is Simplified Chinese (LCID 0x0804, which is mainly
used by Mainland China).  The Chinese have two different sorting
methods, one is based on the pronunciation, and another is based the stroke
order (A stroke is a movement of a writing instrument, and the stroke order
refers to the correct order in which the strokes of a Chinese character are
written.

Wait a minute, what is the “90”,
and “ci_as” mean?  The 90 is the collation version.  We introduce new collation in each release of
SQL Server, the 90 level collations are introduced in SQL Server 2005 (the
internal version is 9.0).  If you see a
collation has name of _100_, which means that it was introduced in SQL Server
2008 (internal version is 10.0).   If the
version is missing, which means that it was introduced in SQL Server 2000 or
prior release.  “ci_as” is called
comparison flag.  Sometime, we want
slightly different comparison behavior, such as case insensitive, etc, that is
the “ci_as” part comes into play.  I am
not going to discuss details about the comparison flags, you can find the
information from  Windows Collation Sorting Styles 

Let us end today’s topic by answering following question: how can I get a
list of collations supported by SQL Server and the properties of a collation.  You can find the list of collations from BOL “Collation
Settings in Setup”
or function fn_helpcollations.  You can get the details about one collation
by calling function CollationProperty.  Tomorrow, I will discuss different types of
collations, it will be a very interesting topic since your guys will be
confused by these types.

Comments (3)

  1. AReader says:

    It is a nice blog!

    Could you please tell me if Collation affects datetime datatype or only for string type? If it is, how does it affect the sql server store the datetime value? For example, store the time value in the different format (12 hours vs. 24 hours).

    Thanks!

  2. Qingsong says:

    Collation only have impact on string column. There is another setting called Language, which will have impact on how we convert a string to datetime.

    For storing datetime value, we always use interal format to storing, which is language independent.   Please sende me more question, and I can explain you more.