CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
AS
(
SELECT
@Start AS theStart,
IntersectionId_End AS theEnd
FROM dbo.StreetSegments
WHERE
IntersectionId_Start = @Start
UNION ALL
SELECT
p.theEnd,
ss.IntersectionId_End
FROM Paths p
JOIN dbo.StreetSegments ss ON ss.IntersectionId_Start = p.theEnd
WHERE p.theEnd <> @End
)
SELECT *
FROM Paths;
GO
The anchor part of the CTE finds all nodes to which the starting intersection is connected—in this
case, given the data we’ve already input, there is only one. The recursive part uses the anchor’s output as
its input, finding all connected nodes from there, and continuing only if the endpoint of the next
intersection is not equal to the end intersection. The output for this query is as follows:
theStart theEnd
1
2
2 3
3 4
Note that although intersections E and G have been created, their corresponding north/south
segments have not yet been inserted. This is on purpose, as I’m going to use those segments to illustrate
yet another complication. Figure 12-6. A slightly more complete version of the street map
Once the new data is inserted, we can try the same CTE as before, this time traveling from Madison
and 1st Avenue to Lexington and 1st Avenue. To change the destination, modify the DECLARE statement
that assigns the @Start and @End variables to be as follows:
DECLARE
@Start int = dbo.GetIntersectionId('Madison', '1st Ave'),
@End int = dbo.GetIntersectionId('Lexington', '1st Ave');
Having made these changes, the output of the CTE query is now as follows:
382
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
theStart theEnd
1 2
2 3
2 6
6 5
3 4
4 8
8 7
7 6
6 5
There are now two paths from the starting point to the ending point, but it’s impossible to tell what
they are; the intersections involved in each path are mixed up in the output.
To solve this problem, the CTE will have to “remember” on each iteration where it’s been on
previous iterations. Since each iteration of a CTE can only access the data from the previous iteration—
AS varchar(255)
)
FROM Paths p
JOIN dbo.StreetSegments ss ON ss.IntersectionId_Start = p.theEnd
WHERE p.theEnd <> @End
)
SELECT *
FROM Paths;
GO
This code will start to form a list of visited nodes. If node A (IntersectionId 1) is specified as the
start point, the output for this column for the anchor member will be /1/2/, since node B
(IntersectionId 2) is the only node that participates in a street segment starting at node A.
As new nodes are visited, their IDs will be appended to the list, producing a “breadcrumb” trail of all
visited nodes. Note that the columns in both the anchor and recursive members are CAST to make sure
their data types are identical. This is required because the varchar size changes due to concatenation,
and all columns exposed by the anchor and recursive members must have identical types. The output of
the CTE after making these modifications is as follows:
theStart theEnd thePath
1 2 /1/2/
2 3 /1/2/3/
2 6 /1/2/6/
6 5 /1/2/6/5/
3 4 /1/2/3/4/
4 8 /1/2/3/4/8/
8 7 /1/2/3/4/8/7/
7 6 /1/2/3/4/8/7/6/
6 5 /1/2/3/4/8/7/6/5/
The output now includes the complete paths to the endpoints, but it still includes all subpaths
visited along the way. To finish, add the following to the outermost query:
WHERE theEnd = @End
Msg 530, Level 16, State 1, Line 9
The statement terminated.
The maximum recursion 100 has been exhausted before statement completion.
The issue is that these new intersections create cycles in the graph. The problem can be seen to start
at the fourth line of the output, when the recursion first visits node G (IntersectionId 7). From there,
one can go one of two ways: west to node F (IntersectionId 6) or north to node C (IntersectionId 3).
Following the first route, the recursion eventually completes. But following the second route, the
recursion will keep coming back to node G again and again, following the same two branches.
Eventually, the default recursive limit of 100 is reached and execution ends with an error. Note that this
default limit can be overridden using the OPTION (MAXRECURSION N) query hint, where N is the maximum
recursive depth you’d like to use. In this case, 100 is a good limit because it quickly tells us that there is a
major problem!
Fixing this issue, luckily, is quite simple: check the path to find out whether the next node has
already been visited, and if so, do not visit it again. Since the path is a string, this can be accomplished
using a LIKE predicate by adding the following argument to the recursive member’s WHERE clause:
AND p.thePath NOT LIKE '%/' + CONVERT(varchar, ss.IntersectionId_End) + '/%'
This predicate checks to make sure that the ending IntersectionId, delimited by / on both sides, does
not yet appear in the path—in other words, has not yet been visited. This will make it impossible for the
recursion to fall into a cycle.
Running the CTE after adding this fix eliminates the cycle issue. The full code for the fixed CTE
follows:
DECLARE
@Start int = dbo.GetIntersectionId('Madison', '1st Ave'),
@End int = dbo.GetIntersectionId('Lexington', '1st Ave');
WITH Paths
AS
(
SELECT
@Start AS theStart,
consider slightly different issues when modeling them.
Advanced routing
The example shown in this section is highly simplified, and it is designed to teach the basics of querying
graphs rather than serve as a complete routing solution. I have had the pleasure of working fairly
extensively with a production system designed to traverse actual street routes and will briefly share some
of the insights I have gained in case you are interested in these kinds of problems.
The first issue with the solution shown here is that of scalability. A big city has tens of thousands of street
segments, and determining a route from one end of the city to another using this method will create a
combinatorial explosion of possibilities. In order to reduce the number of combinations, a few things can
be done.
First of all, each segment can be weighted, and a score tallied along the way as you recurse over the
possible paths. If the score gets too high, you can terminate the recursion. For example, in the system I
worked on, weighting was done based on distance traveled. The algorithm used was fairly complex, but
essentially, if a destination was 2 miles away and the route went over 3 miles, recursion would be
terminated for that branch. This scoring also lets the system determine the shortest possible routes.
Another method used to greatly decrease the number of combinations was an analysis of the input set of
streets, and a determination made of major routes between certain locations. For instance, traveling from
one end of the city to another is usually most direct on a freeway. If the system determines that a freeway
route is appropriate, it breaks the routing problem down into two sections: first, find the shortest route
from the starting point to a freeway on-ramp, and then find the shortest route from the endpoint to a
freeway exit. Put these routes together, including the freeway travel, and you have an optimized path from
the starting point to the ending point. Major routes—like freeways—can be underweighted in order to
make them appear higher in the scoring rank.
If you’d like to try working with real street data, you can download US geographical shape files (including
streets as well as various natural formations) for free from the US Census Bureau. The data, called
TIGER/Line, is available from
www.census.gov/geo/www/tiger/index.html
. Be warned: this data is not
easy to work with and requires a lot of cleanup to get it to the point where it can be easily queried.
387
the output easier to read.
You can use the following T-SQL to create a table based on these columns:
USE AdventureWorks;
GO
CREATE TABLE Employee_Temp
(
EmployeeID int NOT NULL
CONSTRAINT PK_Employee PRIMARY KEY,
ManagerID int NULL
CONSTRAINT FK_Manager REFERENCES Employee_Temp (EmployeeID),
Title nvarchar(100)
);
GO
INSERT INTO Employee_Temp
(
EmployeeID,
388
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
ManagerID,
Title
)
SELECT
EmployeeID,
ManagerID,
Title
FROM HumanResources.Employee;
GO
represented by its upper management team—exactly the results that we expected.
389
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
EmployeeID ManagerID Title
6 109 Marketing Manager
12 109 Vice President of Engineering
42 109 Information Services Manager
140 109 Chief Financial Officer
148 109 Vice President of Production
273 109 Vice President of Sales
However, this query has a hidden problem: traversing from node to node in the Employee_Temp table
means searching based on the ManagerID column. Considering that this column is not indexed, it should
come as no surprise that the query plan for the preceding query involves a scan, as shown in Figure 12-8. Figure 12-8. Querying on the ManagerID causes a table scan.
To eliminate this issue, an index on the ManagerID column must be created. However, choosing
exactly how best to index a table such as this one can be difficult. In the case of this small example, a
clustered index on ManagerID would yield the best overall mix of performance for both querying and data
updates, by covering all queries that involve traversing the table. However, in an actual production
system, there might be a much higher percentage of queries based on the EmployeeID—for instance,
queries to get a single employee’s data—and there would probably be a lot more columns in the table
than the three used here for example purposes, meaning that clustered key lookups could be expensive.
In such a case, it is important to test carefully which combination of indexes delivers the best balance of
query and data modification performance for your particular workload.
In order to show the best possible performance in this case, change the primary key to use a
nonclustered index and create a clustered index on ManagerID, as shown in the following T-SQL:
ALTER TABLE Employee_Temp
DROP CONSTRAINT FK_Manager, PK_Employee;
Title
FROM Employee_Temp
WHERE ManagerID IS NULL
UNION ALL
SELECT
e.EmployeeID,
e.ManagerID,
e.Title
FROM Employee_Temp e
JOIN n ON n.EmployeeID = e.ManagerID
)
SELECT
n.EmployeeID,
n.ManagerID,
n.Title
FROM n;
GO
Note that this CTE returns all columns to be used by the outer query—but this is not the only way to
write this query. The query could also be written such that the CTE uses and returns only the EmployeeID
column, necessitating an additional JOIN in the outer query to get the other columns:
WITH n AS
(
391
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
SELECT
EmployeeID
FROM Employee_Temp
(
SELECT
EmployeeID,
ManagerID,
Title,
CONVERT(varchar(900),
RIGHT(REPLICATE('0', 10) + CONVERT(varchar, EmployeeID), 10) + '/'
) AS thePath
FROM Employee_Temp
WHERE ManagerID IS NULL
UNION ALL
392
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
SELECT
e.EmployeeID,
e.ManagerID,
e.Title,
CONVERT(varchar(900),
n.thePath +
RIGHT(REPLICATE('0', 10) + CONVERT(varchar, e.EmployeeID), 10) + '/'
) AS thePath
FROM Employee_Temp e
JOIN n ON n.EmployeeID = e.ManagerID
)
SELECT
n.EmployeeID,
n.ManagerID,
WITH n AS
(
SELECT
EmployeeID,
ManagerID,
Title,
CONVERT(varchar(900),
'0000000001/'
) AS thePath
FROM Employee_Temp
WHERE ManagerID IS NULL
UNION ALL
SELECT
e.EmployeeID,
e.ManagerID,
e.Title,
CONVERT(varchar(900),
n.thePath +
RIGHT(
REPLICATE('0', 10) +
CONVERT(varchar, ROW_NUMBER() OVER (ORDER BY e.Title)),
10
) + '/'
) AS thePath
FROM Employee_Temp e
JOIN n ON n.EmployeeID = e.ManagerID
)
SELECT
and convert the IDs to
binary(4)
. This would have the same net effect for the purpose of sorting and at the same
time take up less space—so you will see an efficiency benefit, and in addition you’ll be able to hold more node IDs
in each row’s path. The downside is that this makes the IDs more difficult to visualize, so for the purposes of this
chapter—where visual cues are important—I use the left-padding method instead.
The downside of including an enumerated path instead of a materialized path is that the
enumerated version cannot be easily deconstructed to determine the keys that were followed. For
instance, simply looking at the thePath column in the results of the first query in this section, we can see
that the path to the Engineering Manager (EmployeeID 3) starts with EmployeeID 109 and continues to
EmployeeID 12 before getting to the Engineering Manager. Looking at the same column using the
enumerated path, it is not possible to discover the actual IDs that make up a given path without
following it back up the hierarchy in the output.
395
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
Are CTEs the Best Choice?
While CTEs are possibly the most convenient way to traverse adjacency list hierarchies in SQL Server
2008, they do not necessarily deliver the best possible performance. Iterative methods involving
temporary tables or table variables may well outperform recursive CTEs, especially as the hierarchy
grows in size.
To highlight the performance difference between CTEs and iterative methods, a larger sample
hierarchy is necessary. To begin with, we can add width to the Employee_Temp hierarchy. This means that
the hierarchy will maintain the same depth, but each level will have more siblings. To accomplish this,
for each row below a given subtree, both the employee IDs and manager IDs can be incremented by the
same known amount, thereby producing a duplicate subtree in place. The following T-SQL
accomplishes this, running in a loop five times and doubling the width of the hierarchy on each
iteration:
DECLARE @CEO int;
SELECT
look at the CASE expression in the SELECT list, which increments all IDs except that of the CEO. This
ensures that the duplicate subtrees will be appended to the tree as a whole, by virtue of the roots of those
subtrees being subordinates of the CEO’s node, rather than the node at the top of each subtree
becoming an additional root node.
396
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS
Once this code has been run, the Employee_Temp hierarchy will have 9,249 nodes, instead of the 290
that we started with. However, the hierarchy still has only five levels. To increase the depth, a slightly
different algorithm is required. To add levels, find all managers except the CEO, and insert new duplicate
nodes, incrementing their employee IDs similar to before. Next, update the preexisting managers in the
table to report to the new managers. The following T-SQL does this in a loop four times, producing a
hierarchy with a depth of 50 levels and 31,329 nodes:
DECLARE @CEO int;
SELECT
@CEO = EmployeeID
FROM Employee_Temp
WHERE ManagerID IS NULL;
DECLARE @depth int = 32;
WHILE @depth <= 256
BEGIN
DECLARE @OldManagers table
(
EmployeeID int
);
--Insert intermediate managers
--Find all managers except the CEO, and increment their EmployeeID by 1000
SELECT EmployeeID
FROM @OldManagers
);
SET @depth = @depth * 2;
END;
GO
Be careful when adding additional depth to an experimental hierarchy. I’ve found that depth has a
much greater performance impact than width, and extremely deep hierarchies are not especially
common—for instance, even the largest companies do not normally have more than 20 or 30 levels of
management.
To iteratively traverse the hierarchy using a table variable, think about what recursion does: at each
level, the employees for the previous level’s managers are found, and then that level becomes the
current level. Applying this logic iteratively requires the following table variable:
DECLARE @n table
(
EmployeeID int,
ManagerID int,
Title nvarchar(100),
Depth int,
thePath varchar(900),
PRIMARY KEY (Depth, EmployeeID)
);
The Depth column maintains the level for nodes as they are inserted. The table is clustered on the
combination of Depth and EmployeeID; at each level, the depth will be queried first, and we know that
EmployeeID will be unique so we can exploit it as a method of ensuring that the key itself is unique.
To start things off, prime the table variable with the node you wish to use as the root for traversal. In
this case, the CEO’s node will be used, and the path is started with 1/, as I’ll be implementing the
enumerated path output shown in the previous example:
DECLARE @depth int = 1;
) + '/'
FROM Employee_Temp e
JOIN @n n on n.EmployeeID = e.ManagerID
WHERE n.Depth = @depth;
IF @@ROWCOUNT = 0
BREAK;
SET @depth = @depth + 1;
END
Finally, the output can be queried from the table variable. Like before, an ORDER BY clause is
necessary:
SELECT
EmployeeID,
ManagerID,
Title,
thePath
FROM @n
ORDER BY
thePath;
This method uses over 50 percent more code than the CTE, is quite a bit less intuitive, and has many
more potential areas in which you might introduce logic bugs. However, its performance is quite a bit
better than the CTE. The enumerated path CTE performs 347,282 reads and runs in 27.6 seconds on my
laptop against the enhanced Employee_Temp table. The iterative method, on the other hand, requires only
173,536 reads and runs in 13.2 seconds.
Despite the clear performance improvement in this case, I do not recommend this method for the
majority of situations. I feel that the maintainability issues overshadow the performance benefits in all
but the most extreme cases (such as that demonstrated here). For that reason, the remaining examples
in this chapter will use CTEs. However, you should be able to convert any of the examples so that they
use iterative logic. Should you decide to use this technique on a project, you might find it beneficial to
RIGHT(
REPLICATE('0', 10) +
CONVERT(varchar, e.EmployeeID),
10) + '/'
) AS thePath
FROM Employee_Temp e
JOIN n ON n.ManagerID = e.EmployeeID
)
SELECT *
FROM n
WHERE ManagerID IS NULL;
This query returns the path from the selected node to the CEO as a materialized path of employee
IDs. However, you might instead want to get the results back as a table of employee IDs. In order to do
that, change the outer query to the following:
SELECT
COALESCE(ManagerID, 217) AS EmployeeID
FROM n
ORDER BY
400
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.