Today, I will
discuss SQL Server’s Binary collations.
In SQL Server, we have two kinds of binary collations: BIN collation,
which has collation name ending with _BIN and BIN2 collation, which has
collation name ending with _BIN2. In SQL
Server, a string value (either varchar or nvarchar value) is encoded and stored
as a sequence of binaries. For nvarchar
value, we always use the UCS-2 encoding.
For varchar value, we use the codepage of corresponding collation
associated with the value.
For “Binary” collation, we use a sorting
algorithm based on the binary sequence of the string to sort the string data. Now,
let us look the binary sorting algorithm.
For varchar type, the algorithm is quite simple; the sorting result of a
set of varchar values with binary collation is the same as we sort them use the
binary sequence. In T-SQL syntax:
from T order by varcharC
from T order by convert(varbinary(8000), varcharC)
Note, it is always true for both BIN collation and BIN2
Now, let us look
at sorting the nvarchar values. In this
case, we can not sort them purely use the binary sequence, the reason is that
the binary sequence represents a sequence of UCS-2 characters, which take two
byte per character. Because SQL Server
internally uses Little-Endian to store the UCS-2 character, the binary sorting
may result undesired result. For
example, character a (We usually use U+0061 to indicate its Unicode Code Point
assigned by Unicode Org) is encoded as 0x6100 in SQL Server, so If we sort
these characters in their binary sequence order, we will not be able to get a Unicode Code
Point ordered sorting result which is desired for many application. In SQL Server’s BIN2 collation, we sort the
nvarchar type according to their Unicode Code Points instead of the binary
sequence (internally, we do comparison per WCHAR based, which is 2 bytes). Now, comes the BIN Collation, the algorithm
is wired. It first compares two characters
based on the first WCHAR values of the two characters, if the values are equal,
the algorithm will do binary based comparison for the reminding binary sequences,
i.e., byte per byte comparison.
Hope your guys
can follow my writing. If not, you can
simply assume that BIN2 is a Unicode Code Point based sorting algorithm, and it
is desired by many customer. But BIN is not.
From the sorting algorithm’s view,
all BIN collation use the same algorithm and all binary2 collation use another
algorithm. Then why we have
latin1_general_bin or Japanese_bin collation?
The reason is that they use different code page for encoding the
characters sorting in the varchar type. A
Latin1_General_Bin collation can not hold any Japanese collation, so even they
use the same algorithm to sort the data, but people still need to choose the
language which the data will stored, such as English, Chinese, Japanese
etc. Note, this only apply to varchar
data type. For nvarchar data types,
these collations have no difference at all.
On the other
hand, SQL Server has many linguistic collations which sort string linguistically
using the language associated with the collation. All collations which are not binary collation
are linguistic collations. For example,
Latin1_General_CI_AS is a linguistic collation; it uses a sorting algorithm
compatible with several of English language and many Western European languages. Please don’t confuse with the name of
Latin1_General, it actually can sort all Unicode characters defined in Unicode
3.2 characters and it can also sort many other languages correctly as well (if
the language has no sorting conflict with the latin1_general sorting rule). I will start another article to describe the linguistic
collation in deep details.
As a conclusion, binary collations have
better performance than linguistic collations, and that is the main advantage
of using a binary collation. A binary collation is always case sensitive and
assent sensitive. User should try use BIN2 collation instead of BIN collation.