Introduction to Joins

Joins are one of the most important operations performed by a relational database system.  An RDBMS uses joins to match rows from one table with rows from another table.  For example, we can use joins to match sales with customers or books with authors.  Without joins, we might have a list of sales and customers or books and authors, but we would have no way to determine which customers bought which items or which authors wrote which books.

We can join two tables explicitly by writing a query that lists both tables in the FROM clause.  We can also join two tables by using a variety of different sub-queries.  Finally, SQL Server may introduce joins for a variety of purposes into a query plan during optimization.

This is the first of several posts that I am planning for joins.  In this post, I’m going to start by introducing the logical join operators that SQL Server supports.  These are:

  • Inner join
  • Outer join
  • Cross join
  • Cross apply
  • Semi-join
  • Anti-semi-join

Here is a simple schema and data set that I will use to illustrate each join type:

create table Customers (Cust_Id int, Cust_Name varchar(10))

insert Customers values (1, 'Craig')

insert Customers values (2, 'John Doe')

insert Customers values (3, 'Jane Doe')

create table Sales (Cust_Id int, Item varchar(10))

insert Sales values (2, 'Camera')

insert Sales values (3, 'Computer')

insert Sales values (3, 'Monitor')

insert Sales values (4, 'Printer')

Inner joins

Inner joins are the most common join type.  An inner join simply looks for two rows that put together satisfy a join predicate.  For example, this query uses the join predicate “S.Cust_Id = C.Cust_Id” to find all Sales and Customer rows with the same Cust_Id:

select *

from Sales S inner join Customers C

on S.Cust_Id = C.Cust_Id

 

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 2 John Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

Notes:

  • Cust_Id 3 bought two items so this customer row appears twice in the result.
  • Cust_Id 1 did not purchase anything and so does not appear in the result.
  • We sold a ‘Printer’ to Cust_Id 4.  There is no such customer so this sale does not appear in the result.

Inner joins are fully commutative.  “A inner join B” and “B inner join A” are equivalent.

Outer joins

Suppose that we would like to see a list of all sales; even those that do not have a matching customer.  We can write this query using an outer join.  An outer join preserves all rows in one or both of the input tables even if we cannot find a matching row per the join predicate.  For example:

select *

from Sales S left outer join Customers C

on S.Cust_Id = C.Cust_Id

 

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 2 John Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

4 Printer NULL NULL

Note that the server returns NULLs for the customer data associated with the ‘Printer’ sale since there is no matching customer.  We refer to this row as “NULL extended.”

Using a full outer join, we can find all customers regardless of whether they purchased anything and all sales regardless of whether they have a valid customer:

select *

from Sales S full outer join Customers C

on S.Cust_Id = C.Cust_Id

 

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 2 John Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

4 Printer NULL NULL

NULL NULL 1 Craig

The following table shows which rows will be preserved or NULL extended for each outer join variation:

Join

Preserve …

A left outer join B

all A rows

A right outer join B

all B rows

A full outer join B

all A and B rows

Full outer joins are commutative.  In addition, “A left outer join B” and “B right outer join A” are equivalent.

Cross joins

A cross join performs a full Cartesian product of two tables.  That is, it matches every row of one table with every row of another table.  You cannot specify a join predicate for a cross join using the ON clause though you can use a WHERE clause to achieve essentially the same result as an inner join.

Cross joins are fairly uncommon.  Two large tables should never be cross joined as this will result in a very expensive operation and a very large result set.

select *

from Sales S cross join Customers C

 

Cust_Id Item Cust_Id Cust_Name

----------- ---------- ----------- ----------

2 Camera 1 Craig

3 Computer 1 Craig

3 Monitor 1 Craig

4 Printer 1 Craig

2 Camera 2 John Doe

3 Computer 2 John Doe

3 Monitor 2 John Doe

4 Printer 2 John Doe

2 Camera 3 Jane Doe

3 Computer 3 Jane Doe

3 Monitor 3 Jane Doe

4 Printer 3 Jane Doe

Cross apply

We introduced cross apply in SQL Server 2005 to enable joins with a table valued function (TVF) where the TVF has a parameter that changes for each execution.  For example, the following query returns the same result as the above inner join using a TVF and cross apply:

create function dbo.fn_Sales(@Cust_Id int)

returns @Sales table (Item varchar(10))

as

begin

  insert @Sales select Item from Sales where Cust_Id = @Cust_Id

  return

end

 

select *

from Customers cross apply dbo.fn_Sales(Cust_Id)

 

Cust_Id Cust_Name Item

----------- ---------- ----------

2 John Doe Camera

3 Jane Doe Computer

3 Jane Doe Monitor

We can also use outer apply to find all Customers regardless of whether they purchased anything.  This is similar to an outer join.

select *

from Customers outer apply dbo.fn_Sales(Cust_Id)

 

Cust_Id Cust_Name Item

----------- ---------- ----------

1 Craig NULL

2 John Doe Camera

3 Jane Doe Computer

3 Jane Doe Monitor

Semi-join and Anti-semi-join

A semi-join returns rows from one table that would join with another table without performing a complete join.  An anti-semi-join returns rows from one table that would not join with another table; these are the rows that would be NULL extended if we performed an outer join.

Unlike the other join operators, there is no explicit syntax to write “semi-join,” but SQL Server uses semi-joins in a variety of circumstances.  For example, we may use a semi-join to evaluate an EXISTS sub-query:

select *

from Customers C

where exists (

    select *

    from Sales S

    where S.Cust_Id = C.Cust_Id

)

 

Cust_Id Cust_Name

----------- ----------

2 John Doe

3 Jane Doe

Unlike the previous examples, the semi-join only returns each customer one time.

The query plan shows that SQL Server indeed uses a semi-join:

  |--Nested Loops(Left Semi Join, WHERE:([S].[Cust_Id]=[C].[Cust_Id]))
|--Table Scan(OBJECT:([Customers] AS [C]))
|--Table Scan(OBJECT:([Sales] AS [S]))

There are left and right semi-joins.  A left semi-join returns rows from the left (first) input that match rows from the right (second) input while a right semi-join returns rows from the right input that match rows from the left input.

We might similarly use an anti-semi-join to evaluate a NOT EXISTS sub-query.

Miscellaneous notes

In all of the above examples, I used a join predicate that compares whether two columns, one from each table, are equal.  This type of join predicate is called an “equijoin.”  Other join predicates (such as inequalities) are possible, but equijoins are especially common and SQL Server has many more alternatives when optimizing equijoins than when optimizing joins with more complex predicates.

SQL Server has more flexibility over join order and algorithms when optimizing inner joins than when optimizing outer joins and cross applies.  Thus, given two queries that differ only in that one strictly uses inner joins while the other uses outer joins and/or cross applies, SQL Server may be able to find a better plan for the query that uses inner joins only.

Next time …

In my next post, I’ll continue with joins by introducing the physical join operators (or algorithms) that SQL Server uses to implement these logical join operators.