Why I write so many collation topics?


   You may wonder why I keep writing collation topics in my blogs.  I found our Books Online topics related to collation is not so clearly enough, and sometime it confuse our reader.  I intend to present my idea related to collation in a series of blogs, and finally I want group all these topics together to draw a clear picture about collation. My blogs will try to answer following question in details:



  • How I should choose a collation?

  • What is the different between collation A and B?

  • How collation impact the behavior of query, such as string functions,  variables assignment?

  • Best practices for using SQL Server's collation.

 Again, I strong suggest your guys send my comments or collation related question to me,   If I can not answer, I will find some language expert to answer this, and I will also share the answer as well.

Comments (9)

  1. Jiggle says:

    SQL Server 2005 (90) versus 2008 (100) Binary Collation BIN2 etc.

    I cannot find any information detailing the internals of collations and the actual difference between versions of collations.

    I would like to know what the actual internals are and the difference between Latin1_General_BIN2 and Latin1_General_100_BIN2?

    Can you help?  Also if you can where did you get the information as I have trawled the net and cannot find what I need?

    I noticed something odd that the Latin1_General_BIN for NVARCHAR column ordered characters in the order NULL, 32, 0, 1, 2, 3,… (SPACE 32 straight after NULL) but for VARCHAR ordered NULL, 0, 1, 2, etc.

    Do you have any explanation for this?

    The BIN2 collation orders as I expected for NVARCHAR and VARCHAR i.e. NULL, 0, 1, 2, 3, etc.  Is the BIN2 collation surrogate character friendly (treats UTF-16 surrogate character as two UCS-2 characters out of valid range) or is it fully aware.

    The Microsoft description of the differences between BIN and BIN2 appear to try and mask the fact that BIN comparrison of multiple NVARCHAR characters for ordering in my tests is broken and hidden away is some reference to comparing the first character as 16-bits then each following byte of the characters but the Intel endian will order incorrectly.  Is my understanding correct?

    I think that my application should have the database created using BIN2 collation.  I feel that my application will by default match the equality tests with SQL servers equality tests of Unicode data.

    I can then set a separate collation for any special case.  I will perform any localised ordering within the application.  The only thing I cannot decide is the default collation for the database server instance.  Is there any issues with setting this to BIN2 apart from compatibility with existing databases and the case sensitivity of object names?

    Thanks,

    Justin…

  2. QingsongYao says:

    Hello, Justin

     One quick answer is that all BIN2 collation are the same in term of sorting/comparision in matter what version it was included.  The same is true for all Bin collation as well.    I will follow up  your questions later, and also will post one blog related to the binary collations.

  3. intangible says:

    If in a unicode column, the default collation is Latin General for English

    language, what is the sorting order applied to other lanugage in the same

    column, e.g., Chinese or Spanish? Are they sorted by unicode code points

    (binary order)? Thanks.

  4. QingsongYao says:

    Hello,  Intangible

      It is common misunderstand that Latin1 General collation can only sorting latin1 character correctly.   Unicode contain characters in a wide range of language.  If the sorting rule of two languages are not conflict, then it is possible to sorting the characters mixing of two language together.  For example,  latin_general can also sorting german lanugage correctly.  In term of chinese or spanish, they are sorting linguistically.   However, since chinese langauge and spanish both have two sorting algorithms, so we have to pick one in Latin1_General collation.  Chinese language has another addtion issue which is that the characters are re-used by Japanese and Korean. In other word, you may not know a character is a chinese character or korean character, and  which make choose correcting sorting order for these characters much hard.  

  5. intangible says:

    Hi Yao, I read all your posts about collation and they are excellent. Sql book online sometimes confuses me. Regarding to your answer, I have some further questions. Basically I have a table which has two columns, for example,

         ctrycd char(2)

         data nvarchar(50)

    I will store text of multiple languages in data. But each query will only filter and sort on one language. Because I have the ctrycd(contry code), so I will not mix Chinese with Korean. Now the problem is I want to sort each language linguistically. If the column collation is Chinese windows collation, I have a lot of choices, like Chinese_Simplified_Pinyin_100_CI_AS, Chinese_Simplified_Stroke_Order_100_CI_AS, Chinese_PRC_90_CI_AS. But if the collation is Latin1 General, how do I know which sorting order is for Chinese if it is not sorted by unicode code point? Since there are many more languages can be in this column, can each language be sorted according to their own linguistic order under the default Latin General collation?

    The reason I do not put text of different languages into different columns or different tables is we might store such text in a lot of places. Separate storage can give me the freedom to apply collation separately but the database schema can be extremely difficult to maintain, not mention we might need to add additional language support.

    However, if I stick with one column unicode design, I might need to add an explicit collate clause in my query. It will affect performance, do you have any data showing how much performance difference it will be?

  6. QingsongYao says:

    Hello, intangible

      The using explict collate clause,  you will see no performance different if you have no index defined on that column, otherwise,  you will not be able to use the index defined on the column.

      Latin1_General_CI_AS or Latin1_General_100_CI_AS can sort more than 50 languages. I will describe what is the reason behind this, and the languages supported by this collation later.  You noticed that we have French_CI_as, Turkush_CI_As collations.  The reason is that these languages have sorting conflicts with latin1_general, which have to be resolved by introducing new collations.

     Please let me know if you have further questions.

  7. Leif says:

    Good blogs about unicode and collations.

    A answer to following question I cannot find though.

    I have following table

    CompanyID varchar(20) COLLATE Latin1_General_CI_AS

    CompanyName varchar(60) COLLATE Latin1_General_CI_AS

    This table construction is both used in a english database, as well as russian, as well as chinese.

    So the data in the table is not saved "correctly", but can be displayed correctly in chinese.

    But now we are going to change the table to following

    CompanyID nvarchar(20) COLLATE Latin1_General_CI_AS

    CompanyName nvarchar(60) COLLATE Latin1_General_CI_AS

    How can we convert the russian and chinese characters saved in the databases to correct unicode (nvarchar) characters?

    We have experimented with MultiByteToWideChar function and it seems to work for chinese.

    Is there any other options?

    Performance?

  8. QingsongYao says:

    Hello, Leif

     varchar data type always associate with a code page, and it can only store the characters defined in the codepage.  For example, the code page of latin1_general_ci_as collation is 1252, which is western eroupe. You can look at the codepoint table for this at here:http://en.wikipedia.org/wiki/Windows-1252 ,  

     However, you say you stored chinese and russian characters in the varchar column.  It seems impossible, but it really possible.  The reason is that varchar(10) is just 10 bytes, and you can store any binary sequences in matter what is the encoding.  

     I believe this can be achieved as binding as varbinary data type at the client program, and send the binary sequences to the server directly.  Note, if the real encoding (encoding and codepage are the same concept) of the column is different with the codepage associated with the column is different, you will have trouble when issue query to SQL Server to querying the data.  Since SQL Server assume the codepage is 1252, but you store data in other encoding format.  You will observed that SQL Server returns some character you cannot read, or sometime a question mark(?) which is indicate the data is corrupt.

     So for you case, since you already insert data with different encoding into varchar column, you need first identity whether the data was corrupted or not. Second, find the really encoding of the column. From your describation, it looks like a encoding can store both russian and chinese character, which might eithe UTF-8 or UTF-16. and call corresponding MultiByteToWideChar  function with the correct codepage.  Note

    , if your data is already corrupted in the database, you might have issue to covert them to nvarchar type.

  9. Román says:

    I have a question that I cannot have answered yet and the Microsoft documentation seems confused at least to me.

    Obviously I have googled it but I cannot find a real answer to this:

    Why must I choose Windows Collations intead of SQL collations.

    What are the real advantages of using Windows Collations instead of SQL Collations.

    I know that they manage in a different way Unicode and Non-unicode data, but, in the real life, which is the difference?

    Thank you in advance !

Skip to main content