Database Design – Many to Many Join Tables


I was asked why my previous post suggested to “not use an Identity Column as a Key for Many to Many Join Tables”. This post outlines some of the pitfalls I’ve seen.

UPDATED: 17 May 09: To clarify a question about how to do this with higher normal forms <see Blue section at end>. I thought it may be worth sharing.

 

What are we talking about?

Using a Normalised design approach to your database schema, you will find that Many to Many Relationships between 2 objects are commonly expressed as another table, the Join Table. eg: Consider a Web site which allows you to customise your experience by setting your preference. eg: Smoking/Non-Smoking, Aisle/Window Seat, Rock/Classical/Jazz/Rap, Gay/Straight/Both/Neither. You may create the following tables.

Table 1: Customers (usual Login & demographic fields)

Table 2: Preferences (Pref_ID & Pref_Label)

Join Table: CustomerPreference (CustomerKey, PreferenceKey, & maybe other details about the relationship)

So the question is how could the Join Table be defined & what are the trade-off’s?

Design Alternative 1: Use the Primary Keys of the MANY tables as a Composite Key.

Sample code: Create the Primary Key as a composite key containing the 2 keys of the MANY tables & maybe other info about the relationship this row describes.

-- Define Table with the PRIMARY KEY as the Keys of the two MANY tables
CREATE TABLE [dbo].[CustPref](
    [CustID] [int] NOT NULL,
    [PrefID] [int] NOT NULL,
    [Details] [varchar](50) NULL,
 CONSTRAINT [PK_CustPref] PRIMARY KEY CLUSTERED ([CustID],[PrefID])
) 
GO

-- Define Foreign Keys to the two "MANY" tables.
ALTER TABLE [dbo].[CustPref]  WITH CHECK ADD CONSTRAINT [FK_CustPref_Customer] FOREIGN KEY([CustID])
REFERENCES [dbo].[Customer] ([CustID])
GO

ALTER TABLE [dbo].[CustPref] CHECK CONSTRAINT [FK_CustPref_Customer]
GO

ALTER TABLE [dbo].[CustPref]  WITH CHECK ADD CONSTRAINT [FK_CustPref_Preference] FOREIGN KEY([PrefID])
REFERENCES [dbo].[Preference] ([PrefID])
GO

ALTER TABLE [dbo].[CustPref] CHECK CONSTRAINT [FK_CustPref_Preference]
GO

Design Alternative 2: Give it its own Primary Key. (most likely machine generated)

Sample code: Create Key for the Join Table perhaps Identity Field, the 2 Primary Key of the MANY tables & maybe other info about the relationship this row describes. Optionally put a Unique Constraint on the two MANY keys.

-- Define Table with its own PRIMARY KEY.
CREATE TABLE [dbo].[CustPrefWithID](
    [CustPrefID] [int] PRIMARY KEY CLUSTERED IDENTITY(1,1) NOT NULL,
    [CustID] [int] NOT NULL,
    [PrefID] [int] NOT NULL,
    [Details] [varchar](50) NULL,
 CONSTRAINT [IX_CustPrefWithID] UNIQUE NONCLUSTERED ([CustID],[PrefID])
)
GO

-- Define Foreign Keys to the two "MANY" tables.
ALTER TABLE [dbo].[CustPrefWithID]  WITH CHECK ADD  CONSTRAINT [FK_CustPrefWithID_Customer] FOREIGN KEY([CustID])
REFERENCES [dbo].[Customer] ([CustID])
GO

ALTER TABLE [dbo].[CustPrefWithID] CHECK CONSTRAINT [FK_CustPrefWithID_Customer]
GO

ALTER TABLE [dbo].[CustPrefWithID]  WITH CHECK ADD  CONSTRAINT [FK_CustPrefWithID_Preference] FOREIGN KEY([PrefID])
REFERENCES [dbo].[Preference] ([PrefID])
GO

ALTER TABLE [dbo].[CustPrefWithID] CHECK CONSTRAINT [FK_CustPrefWithID_Preference]
GO

 

The challenges I see with Option 2 are:-

  1. You need an extra INT column. 
    Worst case you’ve just increased the row length by 50%. Which increases the number of pages in the table by ~50%. This increases the table scan reads by ~50% etc. Clearly if you needed other columns to describe this M:M relationship the % increase would be less significant. Join tables are often heavily read, so maximising the number of rows in each page can reduce IO.
  2. If you use the GUI designers, it will make the index on the PK, clustered. It is highly likely that you will get better perf if you cluster on one of the Foreign Keys, in this example the CustID.
  3. You need an extra Index, to hold the Unique Constraint. That causes extra maintenance, disk storage & Operations staff may need to defrag it.
  4. Worst still. I commonly see databases where they’ve forgotten to add the Unique Constraint. Because of this they never notice the error when they add the same combination multiple times. This usually means they don’t delete old or redundant relationships. I once saw a 2GB Table that only contained ~400K of Unique rows, the rest were duplicates & other garbage they didn’t need (LastUpdatedBy = “AnonymousWebUser”). This extra baggage had a massive adverse Perf impact.

The arguments for this approach are:-

  • Everything can be joined with everything else & so eventually you have these really long composite keys.

For Example: An Education system has Tables for Course, Location, Teacher, Student.

A Class             is a M:M join of a Course, Location, Teacher & Time. ie: Many Year 10 Maths Classes at same School in 2nd Period.

An Enrolment   is a M:M join of Student & Class.

An Assignment if a few students collaborate on a project it could be a M:M join of a few enrolments & a time. (need to ensure that only students enrolled in the course can submit work)

Clearly the composite key approach can become unwieldy, so at some point you might find it worthwhile creating a Unique Key for a M:M table. But don’t overlook the advantage to carrying all these FK’s around in your design. It gives the optimiser many more alternatives & can save a huge number joins. ie: If I want to create a report listing all the assignments grouped by Teacher, I could just join the Teacher & Assignments table, saving the 4 Table Join; Teacher, Class, Enrolment, Assignment.

 

Last thoughts

I realise that design is personal & not everyone agrees with my opinions above. I’d love to hear your experience; either for, against or different to the points raised in this post

Hope this helps

Dave

= = = = = = = = < UPDATE 17 May 09 > = = = = = = = =

The QUESTION: Many to Many conundrum – Using DRI to enforce complex constraints / business relationships

Consider this:

Table A has 2 child tables ( B & C )

The PK of B and C is an identity, and they both have a FK back to Table A.

I have a table D, which is a join table on B and C

Table D has PK of an identity and FK’s back to B and C.

Here’s some example Data

Table A:  Two rows …. Org1 and Org2

Table B:  Two rows …. 1 linked to Org 1 and 1 linked to Org 2  ( PK ids are 44(Org1) and 45(Org2) )

Table C:  Two rows …. 1 linked to Org 1 and 1 linked to Org 2  ( PK ids are 106(Org1) and 107(Org2) )

I can add a row to Table D which has an ‘B’ id of "44" and a ‘C’ id of "107"

This is invalid … the B id in Table D refers back to Org1 and the C id in Table D refers back to Org2

If I use composite primary keys I can stop this happening, but I cannot see how to stop this if I use identities …

Alan

– – – –

I’m not clear on what you are saying.  I think that you agreeing with me, in that creating an Identity col in Table D is a bad idea, Correct?

(Actually is not a bad idea, but by itself, it is adding overhead & leaving you exposed to data corruption issues.)

Below is a diagram of what I think you described. And a way to use Declarative Referential Integrity (DRI) constraints to enforce the correct relationship with Table A. These sorts of problems are why 4th & 5th Normal Form were introduced.

clip_image002

Dave

– – – –

Thank you very much for your reply, that’s cleared up loads for me.

Obviously, you could still have an identity on table D, and then enforce a unique constraint on the the other 3 fields, but you wouldn’t be gaining much by this.

Alan

– – – –

True there is no one correct way (there are plenty of incorrect ways or those that will yield potential for bad data)

  1. If your data model has some kind of sub-domain ie: Product Design vs Sales OR (Courses & Classes) vs (Students & Enrolments). Then sometimes having an Identity key on TableD, means you don’t have to carry the 3 cols with all their constraints through all the other tables in your data model.

    Eg: In an educational institution: TableA = Classes, TableB = Teachers who teach that Class, TableC = Venues suitable for that class (think Science Labs vs PE/Sport). TableD could be a Class. (An instance of that course/teacher/venue & date). Giving TableD an uniqueID (ie: Class ID), makes it simpler with the next set of tables (Students, Class Enrolments, Assignments, Payments, Exam Results etc) to have just the one FK to TableD.

    Clearly you’d have to weigh up the benefits (eg: Less overhead in FK lookups to enforce the DRI, narrower rows) vs disadvantage (eg: need join thru TableD if you want to Group or filter on TableA,B or C)

  2. I’m unclear if you will achieve much with a unique key over the 3 cols unless you also have the 2 FK constraints. But that really depends on your use of data. Good idea if you’ve designed it well with strong normalisation. But if you had other FK’s (ie a  Time column) it may be OK to have duplicates, as the uniqueness is in the extra columns.  Eg: This car drove over this bridge & the time, is sufficient to do billing. But I agree I’d prefer the unique constraint on all 4 cols.

Thanks for the feedback

Dave


Comments (4)

  1. Some additional thoughts:

    1) An (additional) artificial key in an association table is only reasonable if it is meaningful to reference an individual entry. In your Customer/Preferences sample for me a case is hard to imagine (though not impossible). The "meaning" here lies on the LIST of preferences, not on individual entries.

    2) Not having an additional artificial key retains the "content addressing" idea. If you look on the data in a table referencing the association table you directly see the codes of the associated tables  (in expression the referenced rows) without need to access the association table.

    3) A sample where referencing the individual association could be an order system where the "order item" table can be seen as a many to many relation between orders and items. If you consider substituting items on delivery on the delivery note you might want to reference the specific order line not the order#/item combination.

    4) An artifical key allows content independent referencing. If not the actual values in the association but the association itself is meaningful (e.g. the "line") the artificial key allows changing the content of the references association without updating all references.

    If the key of the association is often used as foreign key this is a hint that the association is meaningful on its own.

    5) The decision on using an artificial key should be done more on the relationships in the datamodel not so much on join considerations.

    6) Composite keys with five or more elements become unhandy. This excludes elements used to partitition the database (e.g. in hosted solutions where the same database and instance is used for multiple clients).

    7) I fully aggree on "I realise that design is personal …"

    Thanks for your inspiring post.

  2. dysos says:

    I just missed one:

    8) Not having an artificial key allows fine tuning of referential integrity: If you maintain a list of items on stock with e.g. item# and color you would not want to allow to change the color while it is referenced in an order line. If it is not referenced then it would be fine to allow the user to change it. Without the artificial key this can be implemented by referential integrity. With an artificial key you would need to implement this constraint yourself.

  3. Justin A says:

    Another great post, David 🙂

    I’ve been following the second option: an extra INT column that is the IDENTITY PK. Looks like I’m off to re-evaluate my M:M relationships 😛

    I’m glad you’ve compared both of them – it’s very helpful indead 🙂

    I wonder how many other people don’t know about this? To me, this should be basic stuff yet I wonder if this has slipped through the gaps .. ??

    cheer David!

    /me tries to think of some more design questions 🙂

  4. davele says:

    Thanks for the value add, you highlight that there is a lot more we could expand on.

    I agree with you. There is a fine line between an Entity Table & a Relationship (Join) Table. In my Education example above I expect that "Assignments" are the nouns in my business & would’ve been given their own ID. But doing so then destroyed my example.

    In fact the Education example I choose is a pretty poor example. Class is also likely to be an Entity in its own right with a number of foreign keys to the Course & Teacher tables.

     So if I had to give a simple rule it would be this. Use the Entity-Relationship-Attribute (ERA) approach. Look for all the entities relevant to your task. THen look at for the Relationships. Where you have a M:M relationship & not much else. Those are the tables you will invent that don’t need their own key. (In the early ’90’s I used to give a 2 day course on DB Design, so challenging to distill that level of analysis into a few rule that work in all circumstances.