Editor’s Note: The following MVP Monday post is by SQL Server MVP Jeff Moden and is part of our special series on SQL Server 2012.
Hierarchies: Convert Adjacency List to Nested Sets
If you search the Internet for articles on hierarchical data in SQL Server, you’ll find that there are many types of hierarchies and hierarchical structures to store them in database. For most of us, a “Hierarchy” is usually something that looks like an Organizational Chart and is usually stored in one of three different basic hierarchical structures. Even at that level of “simplicity”, hierarchies and how to use them in T-SQL are still huge subjects that I could never hope to do justice to in just one relatively short article.
This article covers a lot of different and sometimes fairly advanced techniques in a very short time and I have to assume certain prior knowledge on your part. For example, I have to assume you know a bit about recursive CTEs (rCTEs), what a numbers or Tally table is and how it can be used to replace certain loops, and even a fair bit about the 3 hierarchical structures included in this article.
This article isn’t about how to use each of the three hierarchical structures. It’s really about how to convert an Adjacency List to Nested Sets.
With that thought in mind, we’re going to blast through a quick reminder of what three common basic types of hierarchies are and then go right into a very high speed method for converting an Adjacency List to the hierarchical heaven of Nested Sets.
The Adjacency List
This is probably the most common type of hierarchical structure there is in databases. Part of the reason for its popularity is that it’s easy for humans to relate to and maintain. This is also where we start with data for this article.
One picture is worth a thousand words. Here’s the organizational chart of the data that we’ll be working with.
Each box (known as a “Node”) contains several different pieces of information.
- Line 1 in each box contains the name of an employee.
- Line 2 contains just what it says, the EmployeeID which is also the “ChildID”.
- Line 3 contains the ManagerID which is also the “ParentID” for the given EmployeeID.
- Line 4 contains 3 items. From left to right, they are the “LeftBower, a Sales amount, and the “RightBower”.
The “LeftBower” and “RightBower” are the “coordinates”, “addresses”, or “anchors” for the Nested Set hierarchy structure. I just couldn’t bring myself to calling them “lft” and “rgt” like many people do. The term “Bower”, in this case, is a nautical term. On large ships, the two large anchors on the bow are called “Bowers” and seem to be quite appropriate here.
The Example Table
Just to keep everyone safe during experiments, we’ll create the example table in TempDB. This table will contain the data from the organizational chart above stored as an Adjacency List..
What makes this an “Adjacency List” is the fact that the EmployeeID (ChildID) and the ManagerID (ParentID) for any given employee are both in the same row or “adjacent” to each other.
Here’s the code to build the example table in TempDB. It doesn’t contain all of the possible constraints or other checks you might want to put on such a table because I wanted to keep it simple for this article. It also doesn’t contain anything to prevent a “cycle” (interpret that as “never ending cycle”). It does contain some fairly useful indexes, though.
The Example Data
Here’s the code to populate the above Adjacency List table using the same data that we have represented in the organizational chart from the beginning of this section.
The greatest advantage to Adjacency Lists is that they’re both easy and intuitive for humans to understand and maintain. For example, to move “Bob” and his entire sub-tree of 5 other people to all fall under Marge, all that needs to be done is to change Bob’s ManagerID from 1 to 40 and we’re done. To replace Bob, all that we need to do is change the Name and EmployeeID in Bob’s node to the Name and EmployeeID of the new person and change the ManagerId of the direct-reports to the new person. Adding new employees in new positions is usually just as easy or easier.
The greatest disadvantage to using Adjacency Lists is that traversing or “expanding the hierarchy” (as it is also known) is comparatively very slow because it requires some form of looping or recursion to do the traversals. That’s seems fine for the normal single employee sub-tree lookups that a GUI might do but can become quite painful and slow when aggregates and batch runs are involved. Even without aggregations, when many connections are each expanding their own hierarchies, the server can become quite busy in very short order because of all the looping and I/O required.
To summarize, the Adjacency List is great for maintenance but mostly horrible to use both from a code standpoint and a server impact standpoint. In contrast, other hierarchical structures are a lot faster and easier to use but are much more difficult for humans to understand for maintenance. That’s onme of the reasons why converting Adjacency Lists to Nested Sets in as high a performance manner as possible is a common request.
Also widely known as a “Materialized Path”, the Hierarchical Path type of structure is less common than the Adjacency List but is gaining some favor because the HierarchyID data type (available as of SQL Server 2008) can be used to support this type of hierarchical structure. Coverage of how to use the HierarchyID is beyond the scope of this article.
There are actually 2 different types of Hierarchical Path structures. One stores the actual information, such as the EmployeeID, and the other stores positions for each relative to the root (which the HierarchyID data-type supports). We’ll use the type that stores the actual information which again means that we won’t be covering the HierarchyID data-type.
To be brief, the Hierarchical Path isn’t much more than the concatenation of all of the ancestors for a given node. Using a marked up version of the previous organizational chart, here’s what the Hierarchical Path would be for each node. Remember, this path example consists of EmployeeIDs (in Blue in the graphic below).
The next code block below is the code to convert the adjacency list to a Hierarchical Path using a recursive CTE that processes one level of the hierarchy for each iteration. In preparation for what’s coming up, I store the Hierarchical Path as the concatenation of the EmployeeIDs converted to 4 bytes (like an Integer) as a binary “string”. Notice that we not only build the Hierarchical Path, but we calculate the level, we number the nodes in the correct sorted order, and we also create some “place holders” for when we do the conversion to Nested Sets. As you can see by the timings listed in the header, this code is quite fast (although an equivalent While Loop will actually beat it).
For simplicity in typing, I’ve simply called the Hierarchical Path the “SortPath” in this example because sorting on the path will, indeed, list the nodes in the correct expected order.
The big advantage of the Hierarchical Path structure is that you can sort by the path and have it come out in an order that you would reasonably expect of a hierarchy without much work at all. Another advantage is that you can avoid the use loops and recursion to do traversals using a bit of clever code but it’s still slower than a Nested Set.
The big disadvantage is that Hierarchical Path structures are not so easy or intuitive for humans to maintain.
If you don’t already know, many consider Nested Sets to be the holy grail of hierarchical structures because they’re so very fast during usage.
Quick Review, How Nested Sets Work
Take a look at the following.
If you look at “Bob”, we can see that we can find everyone in Bob’s down line simply by finding everyone who has left and right bowers that are between Bob’s. You can just imagine how fast all of this can be especially when the Bowers are properly indexed. Because it’s such straight forward code, you can also see how easily you can do aggregates and the like and all in a 100% super high speed, set based fashion.
Here’s the code to do a simple “lookup” of Bob’s down line.
In a similar fashion, it’s also very easy to calculate “up lines”. Here’s the code to do a simple “lookup” of Megan’s up line. Notice that the only real code changes are the direction the comparison operators are pointed in. Everything else is the same as the down line code.
Here’s the graphic for the code above. Notice that we can find someone’s up line simply by finding everyone who has a Left Bower that’s less than Megan’s Left Bower and who also has a Right Bower that’s greater than Megan’s Right Bower.
Since the Bowers can be queried in a set based fashion on indexed columns, there’s no much that can beat Nested Sets for performance during usage and that’s their single big advantage.
They can be, however, horribly slow to create from an Adjacency List especially when using the traditional “push stack” method of creation which basically looks like the following. Note that each Red arrow is a “push” or a “pop” which involves at least one select and at least one insert or delete in a highly RBAR1 fashion.
To avoid such agony, many people will use some extraordinary code to do sub-tree moves and other changes. The code to replace a single employee in a single position is actually easier than it is in an Adjacency list. The code to move a sub-tree requires some pretty fancy code to double the value of some Bowers and cut others in half except for the first node where you must also either add or subtract 1 to the top node of the sub-tree being moved (whew!). Although computers certainly have no problem with all of that, if anything ever goes haywire, humans are going to have one heck of a time figuring out how to straighten the mess out.
Again, it’s much easier for humans to realize, understand, and maintain Adjacency Lists but they really want the speed of usage that Nested Sets provide. The key to it all is to have some method of quickly converting an Adjacency List to Nested Sets.
Converting the Adjacency List to Nested Sets
There are a lot of different methods out there for converting an Adjacency List to Nested Sets… and most of them involve a “push stack” or other form of RBAR1 which makes most of the them much too slow to be practical to use on a regular basis.
Itzik Ben-Gan came up with a brilliant method for doing this type of conversion and you can find it in his book “Inside Microsoft SQL Server 2008: T-SQL Programming” (Microsoft Press). To summarize his method, he doubles the number of rows and then numbers them in a particular fashion.
I wanted to take a different approach. I thought there might be some relatively easy way to number the Bowers of Nested Sets and was fortunate enough to not even have to think about what those formulas might be. I ran into a great article by Adam Machanic.
The formulas he came up with were absolutely brilliant in their simplicity. There was just one catch. One of the formulas needs the number of nodes in the down line for each node for it to work correctly. Adam’s code did that mathematically fine but it also took more than 16 minutes on a pretty good laptop just to convert 20,000 rows because of an accidental Cartesian product in the code. Itzik’s code would convert a million node hierarchy in just a couple of minutes. I was darned sure that Adam’s formulas would do the trick but we had the “Catch 22” thing going here. We needed a hierarchical structure to quickly calculate the down line count of nodes so that we could make a hierarchical structure to count the down line count of nodes (the Nested Sets).
Quickly Counting Down line Nodes
It turns out that we actually have a major part of the solution in the code that we have so far. Let’s take a look at it again.
Yeah… I know. It took me a while to see it, as well. The key is in the SortPath column itself. The number of people in the down line of each node is easily determined by the SortPath because that EmployeeID will appear that many times in the SortPath column. Take a look at the final table we’re going to end up with to see what I mean.
Look at the Red squares first. That’s EmployeeID #1 (Jim) and he’s at the very top of the 14 node hierarchy. His node count is “14” and, if you look at the “SortPath”, he appears in the same relative position (happens to match the Level, BTW) of all the SortPaths 14 times including himself.
Look at Bob. Bob is EmployeeID #3 and I’ve marked his information with Blue squares. If you’ll recall, Bob had a total of 5 people in his sub-tree and, if you include Bob, Bob has a total node count of 6. Looking at the SortPath, Bob appears in the SortPath (don’t forget this is a binary sort path) a total of 6 times which is the same as his node count.
Splitting Out the Down line Node Count
How are we going to split out that information? Enter my friend and the proverbial Swiss Army Knife of T-SQL, the Tally Table.
The Tally Table (and certain non-recursive CTEs that mimic it) can be used to very effectively replace certain While Loops for things like splitting sort paths. For more information on how to build a Tally Table and how it works, please see the following article.
Once we have a Tally Table built, we can use it to simultaneously split and count the number of times each EmployeeID appears in the SortPath.
If you look at the SortPath in the previous figure, the SortPath is made up of 4 bytes per Level from left to right. The first level starts at byte #1 and goes for 4 bytes. The second level starts at byte #5 and goes for 4 bytes. Each Level progresses in a similar manner.
We can have the Tally Table “count” through the Levels. The “starting position” of each Level in the SortPath can be calculated as (t.N*4)-3 where “t.N” is the number from the Tally Table and is equal to the value of each Level number. Of course, the length for each Level will always be “4”.
Once we have the split of the EmployeeIDs done, all that’s left is to count the number of times each appears. The following code will do that bit of Tally Table magic for us.
All that’s left now is to apply Adam’s formulas and we’ll have our Nested Sets. Using the “node count” code from above, here’s the code to convert the binary coded Hierarchical Path into Nested Sets.
Like I said, Adam’s formulas are absolutely brilliant in their simplicity.
The Left Bower is a simple calculation which makes the realization that each node (Numbered as “NodeNumber” in order by the SortOrder column) has two Bowers and you need to subtract the number of incomplete nodes in the upline (including the current node) for the node. That’s easily done simply by subtracting the Level of the current node.
The Right Bower calculation is a bit similar. It starts off with the downline node count not including the current node (that’s where the -1 comes in), mulitplies that by 2 because each node has 2 Bowers, adds all of that to the Left Bower and adds 1 because it’s a new Bower not previously accounted for.
All we needed to get the speed out of all of this was to use the Tally Table to get the downline node counts already available in the SortPath in a high performance manner.
Wrapping It All Up
We covered an awful lot of information in a very short period of time. Let’s quickly review what we got for our efforts.
What we end up with is the best of all 3 worlds. Here’s the final table, again.
We end up with our original Adjacency List (EmployeeID and ManagerID columns) which is easy for us mere humans to understand and maintain.
We end up with our Hierarchical Path information (HLevel, NodeNumber, NodeCount, and SortPath columns) which allowed us to do many things including sort the final output
Last but not least, we also ended up with the LeftBower and RightBower columns of the Nested Sets hierarchical structure. Now we have ease of maintenance and super high performance capabilities for hierarches and we can do it all on a million rows in about a minute and a half and in about 6 seconds on 100,000 rows. That’s fast enough to make it worth simply recalculating everything with some remarkably simple code instead of trying to maintain Nested Sets.
To make life a little easier, here’s all the code we used to do the conversion in one place.
At this point, the Left Bower provides the same sortability features as the NodeNumber and SortPath columns. You could delete them but why bother?
A Real Teaser
Here’s a real teaser for you… if you really think about how we built the NodeCount column to build the Bowers, I’m sure that you can see another article’s worth of information on how to make it so we might not even need Nested Sets for speed, anymore. We could preaggregate everything we need to know about any hierarchy and store it all in our own version of what a hierarchical datamart could look like. Like I said… future article. 😉
Thanks for listening folks.
RBAR1 is pronounced “ree-bar” like the steel rods stuck in cement forever and is a “Modenism” for “Row By Agonizing Row”. It has come to mean some “slow code” generally attributed to various forms of procedural programming when compared to other methods.
About Jeff Moden
SQL Server MVP Jeff Moden is mostly self taught and is a long time member of and one of the leading posters on SQLServerCentral.com. His articles have made “advanced” concepts such as the use and understanding of the Numbers or Tally Table more like child’s play. He coined the acronym of “RBAR” which is pronounced “ree-bar” and is a “Modenism” for Row-By-Agonizing-Row. It has come be known world round as an easy way to label performance challenged code. He presented at PASS 2010, a couple of SQLSaturdays, and at several local and remote PASS Chapters and User Groups. His passion for the art of SQL and teaching is apparent in his articles and his lectures. Jeff also won the Red Gate sponsored Worldwide Exceptional DBA Award for 2011 and continues to serve the SQL Server Community on a daily basis.
About MVP Mondays
The MVP Monday Series is created by Melissa Travers. In this series we work to provide readers with a guest post from an MVP every Monday. Melissa is a Community Program Manager for Dynamics, Excel, Office 365, Platforms and SharePoint in the United States. She has been working with MVPs since her early days as Microsoft Exchange Support Engineer when MVPs would answer all the questions in the old newsgroups before she could get to them.