The DLinq Dilemma: SQL shall be SQL, and C# shall be C#, and never the twain shall meet


With a tip of a hat to Rudyard and his ballad, I would say


Oh SQL is SQL and C# is C#, and never the twain shall meet
Till rows and objects manually a dev has both to beat
But there is neither SQL nor C#, nor join nor dot
When LINQ stands to link the twin what magic have I got


Ballads aside, this is a real problem that we are grappling with. There is really nothing new about this problem – it is the problem of Object Relational Mapping (ORM). The two worlds are rules by languages and run-times that have different semantics. The problem is less acute when you are providing a window to the other through classic API and string query language based ORM components. It is more important when you are truly bridging the two with language integration – that is what LINQ does – with its LINQ to SQL (aka DLinq) component. Here is the first in the series and the most controversial one – null comparison.


In SQL, a = b is not true when a and b are both null (special settings aside).
In C#, a == b evaluates to true when a and b are both null.


(Asides:
1. I am talking about the behavior of shipped products here – not LINQ design choice so far. And I am not going to reopen the debate about null semantics in C# here. You can see posts in my blog and my colleagues’ blog for background.
2. VB has a different semantics so this post is C#-specific)


So in the LINQ project, we have to pick our translation from C# to SQL. Consider a simple query:


from c in db.Customers
from s in db.Suppliers
where c.City == s.City
select new {c.CustomerID, s.SupplierID};


If we go with the C# behavior, SQL users moving to LINQ will be surprised.
If we pick SQL behavior then C# users may be surprised. Worse still, those who run the same query with data moved to in-memory collections will get different results if c.City and s.City happen to have null values.


In previous CTPs, we chose the SQL way. So the generated query looks like.


SELECT [t0].[CustomerID], [t1].[SupplierID]
FROM [Customers] AS [t0], [Suppliers] AS [t1]
WHERE [t0].[City] = [t1].[City]


Of course, we could have done a different translation for higher semantic fidelity with C# by changing the comparison to
WHERE t0.City = t1.City
OR (t0.City is null and t1.City is null)


Under this proposal, a user can still get SQL semantics by changing predicate in LINQ query to
c.City != null && c.City == s.City


So here are my questions:
1. Did you even notice the difference between C# and SQL semantics?
2. Which one do you prefer and why? (FWIW, we got no questions/comments about this as far as I remember – zip, zilch, nada)


There are more manifestations of null and other semantics but more about them and our reasons for the current choices in subsequent posts.


Comments (49)

  1. Paul Wilson says:

    I prefer the SQL semantics.  Why?  Because when I’m using Linq for Sql, its Sql that is my target, and the "conversion" between Linq expressed in C# and Sql should be as close to what I express as possible.  This becomes even more true if you consider the differences between C# and VB that you’ve said to ignore for now — don’t ignore the difference, instead realize that its pointing you to the fact that the object language is not the real point here and never should be — its all about Sql instead.

  2. Gabe says:

    I would want the semantics to be the same for in-memory objects as they would be for SQL-based objects. That said, the most important thing is for the behavior to be transparent and well-documented.

  3. Edgardo says:

    I’d go with the C# way, if a is null and b is null, then they’re both null, if == stands for equality, then all you need to do is look into the dictionary:

    "of the same measure, quantity, amount, or number as another"

    "like in quality, nature, or status"

    The problem here is what do we do if we also want to get the rows where c.City and s.City are null?, well I’d use "?", just like with nullable types in c#

    from c in db.Customers

    from s in db.Suppliers

    where (c.City == s.City)?

    select new {c.CustomerID, s.SupplierID};

    Just like we use ? to indicate that value types can be null….

    Cool ugh? 🙂

  4. wagahai says:

    When using Linq targeting C#, I would expect it to honor the C# language. That includes how it handles == for nulls. How Linq does its magic is not important to me as long as it gets it done while hiding the details. When I write C# code, I expect C# behavior. If I expected SQL behavior, then I would write it directly in SQL, not Linq over C#.

  5. marlint says:

    I’d expect it to follow the SQL example. Just to take the example supplied, if you’re doing a join between a customers table and a suppliers table do you really want to join all the customers with an unknown city to all the suppliers with an unknown city? Of course not- this is just going to produce a load of meaningless pieces of data…

  6. Slobodan Filipovic says:

    Add two operation relational equal

    where c.City = s.City

    and C# equal

    where c.City == s.City

    so all people are happy

  7. Slobodan,

    We did consider some syntactic sugar in the past (the assignment won’t work since it is already taken in C# language) like

    c.City ?== s.City

    but decided against it at the language design level.

    But you suggestion points to that approach. DLinq is just a library component and can make decisions on translation one way or the other. Adding language features is a much bigger deal and such sugar does not meet the very high bar.

  8. Henk Kok says:

    Having started development in database environments, my answer should be no surprise: I prefer the Sql way.

    As I have always understood, in Sql null stands for "unknown", so actually the outcome of the comparison null = null should be null too, because it is unknown if unknown equals unknown. As this is very impractical in use (imagine having to write "WHERE ISNULL/NVL(a = b, TRUE/FALSE)"), it was decided to let null = null be false.

    In practice however, null values are often used to indicate "not (yet) applicable" for a value, often used in modelling subtypes ("never the twain shall meet…") or state. I have to credit Dr. Rene Veldwijk, a dutch "Sql guru" (had to use the quotes as one of his rules of thumb is "never trust a so-called guru") for arguing that sql should introduce a second null("-ish") value to indicate just that.

    In that semantic context it would make sense to have null = null return true.

    In short: I second Edgardo in his wish to let the programmer decide upon the null-comparison behavior and the ?== operator looks very usefull to me, sorry to hear that it didn’t make it. How about statement attributes (I personally prefer the ?== operator!):

    [StatementAttributes(NullBehavior=NullBehavior.CSharp/AnsiSql)]

    {

    <statement>

    }

  9. Frank Buckley says:

    I would vote for C# semantics. I thought that Linq was aiming to be a consistent abstraction over multiple data stores. To pick the semantics of one data store in one case and the semantics of another in another case breaks that abstraction and is likely to create confusion. If a Linq query (in C#) did one thing against a Linq to Sql provider and a different thing against a (say) Linq to Xml provider, this would be a mess. If the demand for an operator to support alternative semantics is overwhelming, then I can’t see ?== doing too much damage to C#.

  10. nlhowell says:

    I’d prefer consistency with all the other queries; since this is C#, use C# syntax. The whole point of LINQ is to get away from datasource-specific conventions, and to stick with programming language conventions.

  11. Mike Kidder says:

    C# semantics…. abstract what is below….

  12. Leon says:

    C# semantics. I like things feeling consistent in a language.

  13. Sven says:

    Why not ask Chris Date for his opinion … if ‘someone’ has been confronted in practice with this dilemna it must be him (with all due respect for the Microsoft gurus expertise ;-o, of course)

  14. jods says:

    Linq is all about writing queries in C#. What is there to add ? I am writing in _C#_. So why should I expect a non-C# behavior ? And Linq is supposed to unify data access. If each Linq flavor starts to behave differently the beautiful promise is broken, and I can’t switch my data source from a SQL DB to an XML file.

    This is what I want conceptually. In practice I’m a bit worried about things like joins, as a previous poster already mentionned. Unfortunately, no good solution comes to mind. Maybe the Linq compiler should be optimized to transform this join:

    C#: Person1.BossId == Person2.PersonId && Person1.BossId != null

    to simply

    SQL: Person1.BossId = Person2.PersonId

    if this optimization can be made, I think it’s quite ok. It is natural for someone using C# to exclude the null explicitely.

    The fact that the SQL guys are going to be surprised doesn’t bother me: if they want to code in C#, they should learn it first, right ?

    The C# guys being surprised by their own language is much more disturbing.

  15. Chris Salt says:

    I don’t think this is a matter of SQL versus C# semantics. Null means different things in C# and SQL. In SQL, null means unknown. You can’t compare two unknown values against each other, and say they are equal.

  16. ken ambrose says:

    1. Did you even notice the difference between C# and SQL semantics?

    Not really, because I keep database access and user interface logic completely seperate in my head, and completely seperate in my programming.

    2. Which one do you prefer and why? (FWIW, we got no questions/comments about this as far as I remember – zip, zilch, nada)

    I prefer the current Sql Server treatement of null comparisons. Try and evaluate this function: (a > b).

    Obviously you can’t because I haven’t told you what the values of a or b are! And that is exactly the situation when you compare two nulls, so why should such a comparison provide either true or false?  Either one is a guess.

  17. anil says:

    My vote is for the behaviour to stay inline with SQL.

    As posted in multiple posts two unkown value cannot return true when checked for equality.

  18. Alexey Lavnikov says:

    C# behavior for the win!

    Programmer has to care about null values himself (like we all already did in past before Linq).

    If you make Sql-favorable breach in C# today, tomorrow people will ask about default null dereferencing. Like Person.Father.Mother.BirthDay has to return null if Father or Mother references are not set.

    BTW, in my experience I have never ever thought about NULL is DB as unknown value. Unknown values mean you have a bad DB schema design. NULL for me is known ABSENCE of the value.

  19. Jon says:

    I understand the dilemma. I read through the comments and agreed at first with those desiring SQL ("behavior should follow what I target"). But I tend to agree with the school of thought that the whole idea of Linq was *supposed* to be working with engine-neutral data objects in the C# language, not SQL, despite SQL-like grammar being added to C# when accessing the data via Linq.

    Incidentally, SQL behaviors existed before C#; in my opinion, that makes them old and obsolete.

    However, the very specific example you bring up about comparing a against b while both are null is one that should be addressed differently than "C# behavior vs. SQL behavior"; there should be a syntactical differentiator (i.e. alternate operator) between the two so that both options are available, because I can see both comparisons being useful at different times. What syntactical differentiator? I don’t know, you guys are designing the thing. You went through something fairly similar to this when comparing values vs. instances with Object.Equals().

  20. Jean-Claude Manoli says:

    If I write LINQ queries in C#, then I expect C# semantics.  

  21. Kris Williams says:

    If ?== looks wrong in C# for the SQL unknown (null = null == true) comparison, then perhaps use ~= (the tilde looks like an "n").  So leave == alone, and let it do the default C# behavior.

    (null == null) > true

    (null ~= null) > false

  22. Kris,

    The issue is not what token we use (?== or ~=) although we do agonize a lot over such things. The main issue was whether it was worth adding more things to the language. I was merely pointing out that language is not going to add features for one library component like DLinq – even if it comes from the same design team.

    The tilde also reminds me of approximate equality.

    Thanks for your suggestion though.

    Dinesh

  23. Alexey wrote

    > If you make Sql-favorable breach in C# today, tomorrow people will ask about default null dereferencing.

    Interesting that you mentioned this. I should write about this in the context of LINQ join. Then it might not seem that far-fetched a thing. (No, we are not doing such a thing – just that there is a place where people look for such things.)

  24. Alexander Kozlovsky says:

    NULL value can potentially have two absolutely different meanings: "the value is unapplicable" and "the value is unknown". The "unknown" meaning was (unfortunately) selected when Codd introduce concept of NULL values in extended relational algebra, and this is the reason of three-valued logic in SQL.

    But in reality NULL values uses primarily in foreign keys and with meaning "the value is unapplicable". Because of this usefullnes of three-valued logic in SQL is somewhat questionable. For this reason it is better in LINQ to stay with the classic two-valued logic and treat null values as equal.

  25. I don’t care what semantics are used, as long as it’s consistant with it’s context (which here is C#).  I don’t see how you could use SQL semantics therefore; as that would most existing apps.

    Therefore my stylistic preferece is pretty much moot.

    It would be extremely annoying to anyone implementing or extending their own LINQ-able collections if the semantics were to differ from the norm.

  26. Aiax says:

    The expresion x ==y in C# is not actually checking to see if x and y are equal, if they are refrence types.

    If x and y are reference types the expression checks to see if x and y point to the same object. It does NOT check to see if the values in the object are equal.  Which is why if x and y are null the expression returns true, they both point to nothing.

    This discussion is silly however, because a Value type can never be null.

  27. Zerg says:

    Попал мне в руки сей дакумент. Я читал, ахуел. Афтар дакумента – охранник в школе!

    Кароче, он ночю пизданул из сейфа в кабинети директара, лаве, двуху зелени. Утром его приняли, и заставили написать обиснительную и расписку. Да в гору ему еще 800$? Суки!

    Такой аффтар пропадает, сцука.

    Без наибалова, риальная расписка

  28. sadek.drobi says:

    i vote for the c# way, for the reason of abstraction.

    but not to have a limitation from ignoring the sql way, and not to change the c# syntax for one library, the solution i c is to add an extention method (helper method) to the library that we use for linq,so we will have

    if(a.DbEquals(b)) will be the sql way, and the good news is that it aplies to both c# and vb

    and of course the implementation of that method will check for null 🙂

    i beleive that the solution of a problem of a library should be in the library itself 🙂 thanks to c#3 🙂

    sadek.drobi@gmail.com

  29. Jafar says:

    My vote goes for C# as well with the understanding the DLinq would recognize the pattern value != null && … and translate it into the SQL way.  Otherwise it would be an awfully leaky abstraction.  If SQL syntax is required DLinq doesn’t preclude writing T-SQL.

  30. Chris Trelawny-Ross says:

    SQL Syntax should _absolutely_ be followed:  as others have said: NULL means ‘unknown’ in SQL, and (unknown == unknown) yields unknown.

    C# is an inherently scalar language – we deal with single objects when we do comparisons (even when we’re iterating over collections, we’re doing repeated single-object comparisons) and we don’t have operators that correlate one collection with another (which is, after all, what LINQ is adding to the language).

    SQL, on the other hand, deals with vectors (sets) – and comparisons are done on the basis of attribute values. Comparisons are used to determine inclusion in, or exclusion from, a set – or correlation between elements of related sets.

    In C# a null value means "I haven’t allocated, or didn’t find, that object" (or variations of that theme, like ‘… an object with that value of that attribute’); in SQL, a null value is an indication of ‘unknown’ for an attribute – NOT – an indication of ‘not found’ for an object (where ‘object’ correlates with ‘element in the set’, or ‘row in the table’ in the SQL world).

    So null in C# is in reality a different creature from null in SQL.

    If I truly want the set of rows from table T1 where attribute T1.X is unknown, joined with the set of rows from T2 where T2.X is also unknown, then I should deliberately request it … and I’ll get a cartesian product of _all_ pairings of the relevant T1 rows (i.e. T1 rows where X is null) with the relevant T2 rows.

    Notice that that result is a far cry from a join of T1 to T2 where T1.X == T2.X and neither is null – I’ll get an inner join (i.e. the join you’d expect – T1 rows are paired only with T2 rows for which the X values match, yielding a result set with distinct elements for each distinct value of X).

    These two results – the inner join and the cartesian join – are significantly different.

    If the "null == null" collection of results is claimed to be ‘just another variant of "thisValue == thisValue"’ (i.e. null is as valid and as concrete a value as is ‘London’) then null is not really being used as null (unknown), but is being (incorrectly and dangerously) used as just another discrete value in the set of values for the attribute – and should be implemented as such (perhaps with an "Unknown City" value).

  31. Chris TR says:

    SQL Syntax should _absolutely_ be followed:  as others have said: NULL means ‘unknown’ in SQL, and (unknown == unknown) yields unknown.

    C# is an inherently scalar language – we deal with single objects when we do comparisons (even when we’re iterating over collections, we’re doing repeated single-object comparisons) and we don’t have operators that correlate one collection with another (which is, after all, what LINQ is adding to the language).

    SQL, on the other hand, deals with vectors (sets) – and comparisons are done on the basis of attribute values. Comparisons are used to determine inclusion in, or exclusion from, a set – or correlation between elements of related sets.

    In C# a null value means "I haven’t allocated, or didn’t find, that object" (or variations of that theme, like ‘… an object with that value of that attribute’); in SQL, a null value is an indication of ‘unknown’ for an attribute – NOT – an indication of ‘not found’ for an object (where ‘object’ correlates with ‘element in the set’, or ‘row in the table’ in the SQL world).

    So null in C# is in reality a different creature from null in SQL.

    If I truly want the set of rows from table T1 where attribute T1.X is unknown, joined with the set of rows from T2 where T2.X is also unknown, then I should deliberately request it … and I’ll get a cartesian product of _all_ pairings of the relevant T1 rows (i.e. T1 rows where X is null) with the relevant T2 rows.

    Notice that that result is a far cry from a join of T1 to T2 where T1.X == T2.X and neither is null – I’ll get an inner join (i.e. the join you’d expect – T1 rows are paired only with T2 rows for which the X values match, yielding a result set with distinct elements for each distinct value of X).

    These two results – the inner join and the cartesian join – are significantly different.

    If the "null == null" collection of results is claimed to be ‘just another variant of "thisValue == thisValue"’ (i.e. null is as valid and as concrete a value as is ‘London’) then null is not really being used as null (unknown), but is being (incorrectly and dangerously) used as just another discrete value in the set of values for the attribute – and should be implemented as such (perhaps with an "Unknown City" value).

  32. Simon says:

    I think when I write C#, as I do in DLinq, I expect C#.

    I admittedly had problems with the SQL rules, but found spots where  they shine as well as where they are just in the way.

    Whether unknown or not allocated, it is all about ‘no thing here’.

    IMHO problems arise from implicitly assuming "not equal is unequal", which is not applicable in face of the unknown.

    I would like to vote for a more diverse set of operators reflecting different expectations for null-behaviour.

    BTW, I don’t expect a rush of half-baked SQL-to-DLinq translations just for having it.

  33. This week you can watch the first in a series of videos featuring members of the Microsoft C# team. A video of Raj Pai, the Group Program Manager for the C# team, leads off the series.

  34. Wesner Moise says:

    At the conceptual level, where LINQ expression trees are at, we shouldn’t strive to match the precise behavior of programming language.

    Keep in mind, expression trees are shared among many different programming languages, which have different equality behavior from C#–VB for example. VB treats nulls differently from C#–more like SQL, yet emits the same expression tree for =.

    As for language consistency, I think that we have a precedent where = can behave differently depending on the operands, due to operator overloading. Value types also behave differently from reference types, and all of SQLs types are value types anyway.

    We should use data-dependent semantics. The behavior of = really depends on the meaning of nulls in the underlying data. Nulls are essentially overloaded. If we go against the underlying assumptions of the data, then most queries to the database involving nulls will require a complicated syntax.

    The default behavior would not be the desirable one, which could be a major problem if IQueryable was being used as pluggable processor. One can say that to use IQueryable as a pluggable processor, we need precise behavior, but I think we need behavior appropriate for the data, since nulls in C# and SQL really refers to 2 different concepts. With memory objects, null signifies non-existent values, whereas in SQL they refer to unknown values, so it would be more appropriate to use SQL.

    If precise behavior is necessary, having a separate function for DBEqual and EqualIncludingNulls for both usages is better.

  35. Evan says:

    I think you are asking the wrong questions.  The real question is do you want to support the expected PL/SQL behavior when using Oracle as the datastore?  What if I use db2 as the datastore? Do I get db2 style behavior in LINQ?  Don’t forget to give me c# style expressions when running queries against collections and arraylists in memory.

  36. mbu says:

    Well, SQLs way is usefull for handling large sets (try joining on a column for which half of the values are null and you will receive a VERY large result).

    So why not give the user best of both worlds? Let the == stay the C# way and implement (let’s say) =? be the new "is mathmatically equal" operator.

  37. mbu says:

    Well, should have read more of the comments.

    If there is no way to add the syntactic sugar (which it is not with respect to lambda expressions and translation to SQL!), you are stuck with:

    C#s way and doing things like

    where(a.x == b.x && a.x != null) which would translate to

    where ((a.x = b.x OR (a.x is null and b.x is null)) AND a.x is not null for which you may hope that the SQL processor is capable of reducing this back to a.x = b.x (to use some indexes!).

    or using SQLs way and setting up a complete breach of logic inside the language.

    So you just go ahead and choose, both choices are lousy.

  38. Bruno Guardia says:

    LINQ by its definition is language query, so it should have a behaviour consistent with the language… So I, as most comments, agree with C# behaviour. If it is "embedded SQL", even in behaviour, then LINQ loses its purpose.

    The case for SQL syntax, handling large sets, it is precisely some that LINQ hasn’t solved yet; so maybe the special sugar or different semantics should apply, once the large set/batch solution is on LINQ.

    Have you looked the possibilities of adapting language constructions like the ones available xBase (Clipper and dBase), to C# and VB syntax? That language was procedural, still very clear for database operations; it was not SQL, neither similar, but easily mapped the equivalent operations. Something like that would be a better match for keeping semantics without making LINQ more complex than needed.

  39. JD says:

    I think the best way is to write out some common examples and explain the behavior.

    I prefer C# semantics as a default (it is C#). That is because I imagine taking a given query (written in a function that essentially returns the expression tree) and reusing it for queries over whatever data source I happen to be using.

    One comment above got me thinking though, really how often do you want out data stores expecting no changes? From this point of view, you want the query to be as readable and efficient as possible against the store you are actually using.

    So again, if you can give examples of queries that would help the most. As much as I want the purity of the C# language, it is already going to be broken by the expression tree: < > == null all these operators may be different than C#. The underlying data types may be different than C# as well.

    So maybe portability of expression trees across stores is too high of a goal and unrealistic given the tuning that will likely be required due to differing performance characteristics anyway.

    When it comes down to it, if I have to write ugly code to do normal operations then I’m not going to like it. So what are the normal operations? Examples would help.

  40. steve says:

    I prefer the Criteria used in NHibernate with the NHibernate.Expressions

    This is not all new by the way, perhaps you should learn NHibernate and then you can be informed.  NHibernate provides options, which is good, rather than force everyone into the same query language, ie. Criteria, HQL, SQL, LINQ, etc…

    Just some advice

  41. Skup says:

    I truely prefere the C# way because I’m not writing SQL in my C# but C# queries. The == operator should not change it’s meaning when used in this context. But it’s true that the SQL = operator is usefull when joining tables, it would be interesting to add a new operator that would just be a shortcut (as a += b is for a = a + b) to get this kind of equality. The execution of the operator would be based on ==.

  42. damien morton says:

    I think you need to introduce ‘NULL’ to the c# language, and have it be completely different from ‘null’.

    Of course, then youd have to decide whether NULL == null.

Skip to main content