As part of the blog series TSQL - Solve it YOUR Way, today's topic will cover a common programming interview question where we are asked to find the longest repeated substring in a given string, followed by different solutions and explanations from three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Kent Waldrop, and Jingyang Li.
Topic: Given a string, find the longest repeated substring
I was asked this question many years ago in a programming interview and I solved it in C# as was expected in the context of the role I was interviewing for. A few days ago, a friend of mine was discussing a recent programming interview where he was asked this same question. While discussing solutions with him, I realized that this could make for a great topic for the Solve it YOUR Way series. While T-SQL may be an unlikely method of solving this problem, the solutions below highlight the power of SQL Server through the creative solutions of the authors.
As with any good interview question, the initial phrasing of the question does not provide all of the necessary requirements to provide the best or only correct solution. In this case, in order to provide a correct solution, we would need details such as 1) can substrings overlap, and 2) what do we return if multiple substrings of the same length exist (first occurrence, all occurrences?), etc. I intentionally left these details out of the original topic to allow the answers a bit more freedom in their solutions. The solutions reflect these subtle differences.
Solution #1 - Provided by Steven Wang
Explanation of Steven's solution:
To get substrings of a string, the number of substrings is determined by the length of the substring. For a string with length equals to n, it has 1 substring if the length of the substring equals to n – 1 + 1. If the length of the substring equals to n – 2 + 1, then it can have 2 substrings with overlaps. With the same fashion, if the length of a substring equals to n – m + 1 (m<=n), a string will have m number of substrings with overlaps.
As our task is to find out the longest repeating substring with NO overlaps, the number m should be only starting from n/2 + 1 (remember in T-SQL integer n divided by 2 will always return back integer). However, this doesn’t guarantee the substrings without overlaps.
Since each m number has m number of substrings which is calculated by the RepeatingNumber virtual table Cross Applied with a substring function, we can use the window function Count () by partitioning the m number and the substring to find out the repeating substrings with overlaps. If the window function Count() Over returns back the value is greater than 1, then we find a repeating substring. But at this stage, we don’t know if the repeating substrings are overlapped or not.
I used another window function First_value to identify the repeating substrings without overlaps by using the code below:
A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position) - B.StringLen
The idea of above code is to get the offset value of the starting point of the remaining substrings with same value as the first substring within the same Number group. If the offset value is negative, then we can conclude that the two substrings have overlaps. Otherwise, the two substrings are non-overlapped repeating substrings.
At the last we use the TOP 1 clause with order by to get the longest repeating substring.
I have to mention that the white spaces will be treated as characters in this solution. we can modify the solution to take care of it if it is needed.
Solution #2 - Provided by Kent Waldrop
Explanation of Kent's solution:
The purpose of this query is to find the longest substring that is repeated within a given string. Note that this query depends on a table of numbers. This query is divided into three major parts:
• The "findRepeatChars" CTE
• The "primaryWork" CTE
• The outer query
"findRepeatChars" locates all characters in the input string that are repeated at some point later in the string. This CTE parses the string by using a table of numbers to spin through each character position except the last position, which cannot have a repeated character that comes later in the string.
"primaryWork" processes the locations found by "findRepeatChars" and determines the longest substring beginning at the given location. This CTE includes four CROSS APPLY operators to provide reusable calculations:
calcGroupOffsets uses the built in IIF function to calculate offsets that can be used as a base locations for dividing a string into smaller segments that can be quickly processed with a linear search.
calcGroupIndices determines if a string at the given location has a substring of a length determined by the group offset that repeats later in the given string.
calcLoopParms uses COALESCE functions to select the largest group offset that defines a substring that is later repeated in the given string based on the computed group indices. In addition, the built in IIF function is used to determine an appropriate search range based on the selected group offset and group index.
calcMaxN uses the computed loop parameters and a table of numbers to determine the largest number that can be added to the offset. This largest number when added to the largest group offset determines the largest substring that is repeated at a given location in the string.
After the primaryWork CTE processes the four apply operators, a RANK function is used such that the rank is partitioned according to the input' string and ordered by the descending string lengths of all of the repeating substrings for a given input string to provide a row number (rn) column.
The outer query calls the primaryWork CTE to process a list of all repeating substrings associated with a given input string. Records are filtered with the outer query such that only records that have a row number (rn) of 1 are retained.
I sort of used odd squares to segment out the data; I didn't investigate whether some kind of log method might be better.
Solution #3 - Provided by Jingyang Li
Explanation of Jingyang's solution:
Step 1: Create a numbers table based on the string length but it would be better to use an auxiliary numbers table;
Step 2: Cross join Numbers table to create positions and steps for all substring combinations;
Step 3: Calculate the occurrence of substrings in various lengths, the length for each substring and a sequence within a substring based on the step. The WHERE condition is to eliminate double count for any substrings after the last full step increments;
Step 4: Get all substrings with at least two occurrences;
Step 5: Remove repeating substring with overlapping and get the ranking order for the longest substrings;
Step 6: Retrieve the longest repeating substring(s) with two options.
The option1 is to get longest string with tie breaker (first occurrence) by using rnLen=1 and the second option is to get longest strings without tie breaker by using rnkLen=1;
In order to change the default recursion limit from 100 to no limit for CTEs, the query hints MAXRECURSION is set to 0.
Note: I include the rank order to choose the first occurrence or all longest repeating substrings in the query.
As you can see, all three of the above solutions provide variations of the result we were looking for, but do so in creatively different styles. Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions, including one solution that leverage new constructs from SQL Server 2012. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.
Special thanks to Jingyang, Steven, and Kent for their valuable forums contribution and for contributing to this series!
Hope that helps,
Sam Lester (MSFT)
Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.
Kent Waldrop started working with Sybase Transact SQL in 1989 as an application developer and continued working with Sybase until 1995 when SQL Server 6 was released. At that time, he became a Microsoft SQL Server database administrator and has continued to work with Microsoft SQL Server ever since. Currently he is a database architect working with Microsoft SQL Server, Oracle and UDB/DB2.
Jingyang Li has been working with SQL Server since he began his IT career as an ASP.NET developer in 2001. He enjoys working with T-SQL and recently took a full time job as a DBA. He has been participating in the Microsoft forums with alias Limno.