Hello Word

Hello, I'm Shital Shah. I currently work with NGIM team at Microsoft as SDE. I intend to use my MSDN blog for mostly technology/programming related stuff with occasional detours :).

Let me put out standard big disclaimer:

All views expressed in this blog are mine and not my employer or any teams at Microsoft. The information, data and opinions being presented here are neither validated nor endorsed by Microsoft and should not be considered as an official statement of the company.

So let's get started. One of the classic confusion that new Linq users come across is how to write composite key (or multi-key) joins in Linq to Objects (especially in C#). Unfortunately the join keyword in C# Linq is pretty limited compared to SQL, however, there are things that you can do in Linq that you can’t do with SQL joins. Hopefully this short entry will demonstrate some possibilities.

There are at least 4 ways to do composite key joins in Linq to Objects. Let's say you had two sequences defined like this:

    1: class MainRow
    2: {
    3:     public int PK1; public int PK2;
    4: }
    5:  
    6: class RelatedRow
    7: {
    8:     public int PK1; public int PK2;
    9:     public string RelatedText;
   10: }
   11:  
   12: //Somewhere else
   13: MainRow[] mainTable; 
   14: RelatedRow[] relatedTable;

And now let's say you want to produce a "de-normalized" sequence that combines rows from mainTable and relatedTable by joining on PK1 and PK2.

Option 1

Simply join in where clause:

    1: var join1 = from mainRow in mainTable
    2:             from relatedRow in relatedTable
    3:             where mainRow.PK1 == relatedRow.PK1 && mainRow.PK2 == relatedRow.PK2
    4:             select new JoinReturn
    5:             {
    6:                 MainRowPK1 = mainRow.PK1,
    7:                 MainRowPK2 = mainRow.PK2,
    8:                 RelatedRowPK1 = relatedRow.PK1,
    9:                 RelatedRowPK2 = relatedRow.PK2,
   10:                 RelatedText = relatedRow.RelatedText
   11:             };

Here, we are essentially doing cross join and filtering resulting sequence on two keys. Yes, very inefficient if you had large sequences to join but its probably the most efficient for small sequences!

Option 2

Join on temp object that has composite keys:

    1: var join2 = from mainRow in mainTable
    2:             join relatedRow in relatedTable
    3:             on new { KeyPart1 = mainRow.PK1, KeyPart2 = mainRow.PK2 } equals new { KeyPart1 = relatedRow.PK1, KeyPart2 = relatedRow.PK2 }
    4:             select new JoinReturn
    5:             {
    6:                 MainRowPK1 = mainRow.PK1,
    7:                 MainRowPK2 = mainRow.PK2,
    8:                 RelatedRowPK1 = relatedRow.PK1,
    9:                 RelatedRowPK2 = relatedRow.PK2,
   10:                 RelatedText = relatedRow.RelatedText
   11:             };

This you would see a lot and is the most efficient among all options described here for large data sets. Most SQL enthusiasts however will wonder about the time it takes to create all those temporary objects and if GC will kill itself looking at code like this. Believe it or not, in all my experiments, this seems to perform well for sequences as large as 10,000 elements. Also notice that you can go far beyond simple comparison. For instance, you can use custom IComparer along with a class to implement complex joins that uses all kind of operators other than equals or custom logic that accesses external data stores, Internet in real time and so on. You can go far beyond SQL here.

Option 3

Use powerful SelectMany:

    1: var join4 = mainTable
    2:             .SelectMany(mainRow => 
    3:                 relatedTable.Where(relatedRow => mainRow.PK1 == relatedRow.PK1 && mainRow.PK2 == relatedRow.PK2)
    4:                 .Select(relatedRow =>
    5:                     new JoinReturn
    6:                     {
    7:                         MainRowPK1 = mainRow.PK1,
    8:                         MainRowPK2 = mainRow.PK2,
    9:                         RelatedRowPK1 = relatedRow.PK1,
   10:                         RelatedRowPK2 = relatedRow.PK2,
   11:                         RelatedText = relatedRow.RelatedText
   12:                     }
   13:             ));

SelectMany is very powerful feature and above fragment shows how you can leverage it to perform multi-key joins. Notice that here too you can use operators such as <, >, <=, >= if your join involves them. However this method isn't as efficient as Option2 because it won't do internal hash table mapping to do its job in one fell swoop instead it will take each row in mainTable separately to get related rows. Well, still to keep in mind this option if your mainTable has only few rows.

Option 4

Using partial joins:

    1: var join5 = from mainRow in mainTable
    2:             join relatedRow in relatedTable
    3:             on mainRow.PK1 equals relatedRow.PK1 into subJoinTable
    4:             from subJoinRow in subJoinTable
    5:             join  mainRow2 in mainTable 
    6:             on subJoinRow.PK2 equals mainRow2.PK2
    7:             select new JoinReturn
    8:             {
    9:                 MainRowPK1 = mainRow.PK1,
   10:                 MainRowPK2 = mainRow.PK2,
   11:                 RelatedRowPK1 = subJoinRow.PK1,
   12:                 RelatedRowPK2 = subJoinRow.PK2,
   13:                 RelatedText = subJoinRow.RelatedText
   14:             };

Here we are doing slightly better than Option1 by getting intermediate table which would be pretty large if your input sequences are large (but not as bad as full cross join in Option1). This is an useful option if you knew PK1 is a major filter on sequences. On average, Option 3 and Option 4 might perform comparably.

In general, if you are doing complex joins on very large sequences using Linq to Objects then there is something wrong and you are bound to loose on performance compared to relational databases. You might give this a consideration in your designs.