Team Foundation Internationalization (Part 3) – Collation

 

In the last couple of posts (here and here) I wrote some considerations related to the UI Language of Team Foundation (both Server and Client) and some around Locale, and how those affect the user experience.

This time I am going to talk about Collation in TFS.

“what is a Collation?”

Collation is a concept that SQL Administrators and developers targeting SQL should already be quite familiar with, so I am not going to bore you with details (you can find some here and elsewhere on the web), but in summary it is a SQL property that affects (a) the codepage used to store non-Unicode data in SQL, and (b) the rules of characters/string sorting.

The TFS databases are Unicode enabled, as all text-type columns are of type nText, nChar or nVarChar, so point (a) above does not affect us.

So how much can (b) affect a TFS user? Depending on which characters you are using, it can range from wrong sorting to wrong search results to inability to check-in a file. Let’s see why and how to avoid that.

When you install SQL Server, you are prompted to select the preferred/default collation for the whole SQL instance. The SQL Setup will prompt you with a default collation that “fits” with the System Locale settings of the machine where you are installing it on. In fact, at setup time, you can select either a different one and/or select additional qualifiers (e.g. Kana sensitive, Width sensitive, etc). Once you have selected it, all databases, columns and metadata (e.g. Tables’ names, Stored Procedure’s names, variables’ names, etc) will respect that and behave accordingly.

Just to give you an idea of what could happen already: if you had the following simple statement << SELECT itemID FROM MyTable >> on a table whose field was actually named “ItemID” - it would run correctly in most cases, but not if you were to install it on a SQL server instance with a Case Sensitive collation or with a Turkish collation (even if Case Insensitive). So please make sure you keep all of your metadata case-sensitive, especially if containing any “i” letter.

We found and fixed several such issues in our Stored Procedures – these are breaking issues, but they are relatively easy to find and fix.

“Is that it? If not, what else?”

OK - that was about metadata - now let’s talk about data.

Sorting: given that Collation setting affects sorting, it is quite obvious why using a specific Collation will affect how strings coming from the TFS Server are sorted. One could argue that there are ways to affect that behavior from the client or at each request, so that the sorting matches the preferred sorting on the client (using the User Locale info). That is very true but comes at a price - performance. Especially if you are dealing with a lot of data and if you don’t have a specific Index for that on each field.

Unexpected search results: we said that Collation affects sorting, therefore it affects search. Usually, each character has a “sort weight” associated and each Collation defines what that is. Unfortunately, some characters do not have a sort weight assigned on some/all Collations, so they are treated almost as “non-existing” or “equal” in comparisons or, in search, as wildcards.

That’s not specific to SQL Collations, but even in the .Net Framework if you were to compare e.g. character 'ᢨ' (u18A8 - MONGOLIAN LETTER MANCHU ALI GALI BHA) with 'ᢤ' (u18A4 - MONGOLIAN LETTER MANCHU ALI GALI ZHA) you would get that they are “equal” across most Cultures and Comparers (except for the Binary comparers and the Mongolian traditional culture “mn-Mong-cn”) – as already discussed here by Michael Kaplan (based on my question Smile)

image

(as you can tell from the Culture name, I’m running this on an Italian Windows 7 machine, but it does not matter for the overall result)

So what happens if you search for such a character (one with a sort weight == 0 in a specific SQL Collation, that is) in a SQL field? If you were to run a <<Select * from MyTable where Title='ᢨ'>> statement against a SQL Instance/DB/Field with, say, a Latin1_General Collation, that would return all rows in that table!

There’s really not much you can do about it, except selecting the most appropriate Collation for your needs (both in terms of data and metadata) before you install TFS. Please see some guidance on this article for more info - SQL Server Collation Requirements for Team Foundation Server.

Inability to Check-In a file: given the description of the issues above, it should be fairly easy now to understand how that would happen. TFS stores file names that are being checked-in/out of the Source Control repository in a SQL column – hence it may happen that SQL “tells” TFS that a file name is already in the list (as if it had already been checked in), according to what SQL thinks “equal” means.

In the image below, I was able to check-in both files “Web01test.cs” and “Web01test.txt” (notice how “Web” in the second filename is made by Full Width characters – uFF37, uFF45, uFF42) when using a Latin1_General_100_CI_AS_KS_WS (WS = Width-sensitive)

image 

while I was not able to do that with the Japanese default SQL Collation (Japanese_CI_AS), because it is not Width-sensitive. Files could be added to the Source Control Explorer, but one of the two files always “disappeared” form the Pending Changes dialog.

image

The sample above uses the Width sensitivity, but that could be true for other properties, like Kana-sensitivity, and for special range of characters that may not have a sort weight defined in a collation. An example of this would be Surrogate pairs (Unicode Extension B), or some Chinese Minority Scripts (Mongolian, Yi, etc) and other ranges more recently defined.

To make sure you can handle these characters, you should select the appropriate SQL Collation that supports them. Usually, more recent collations do that (like the ones ending with _100 – if you are using SQL 2008)

In Summary

I hope the topic was of some interest/relevance to you, and I hope it will help you prevent some issues when dealing with international characters and SQL. It is not meant to be a complete guide, but just a collection of random thoughts based on our “real life” experience with TFS. You can certainly find more articles on this in the MSDN Library.

Stay tuned for more TFS-related topics and please let me know if you have any comment/question/request.

Thanks! Aldo

Technorati Tags: Software Globalization,Globalization,TFS,SQL Server,Collation