TSQL – Solve it YOUR Way – Finding the Longest Repeated Substring

Introduction:

As part of the blog series TSQL - Solve it YOUR Way, today's topic will cover a common programming interview question where we are asked to find the longest repeated substring in a given string, followed by different solutions and explanations from three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Kent Waldrop, and Jingyang Li.

Topic:  Given a string, find the longest repeated substring

I was asked this question many years ago in a programming interview and I solved it in C# as was expected in the context of the role I was interviewing for.  A few days ago, a friend of mine was discussing a recent programming interview where he was asked this same question.  While discussing solutions with him, I realized that this could make for a great topic for the Solve it YOUR Way series.  While T-SQL may be an unlikely method of solving this problem, the solutions below highlight the power of SQL Server through the creative solutions of the authors.

As with any good interview question, the initial phrasing of the question does not provide all of the necessary requirements to provide the best or only correct solution.  In this case, in order to provide a correct solution, we would need details such as 1) can substrings overlap, and 2) what do we return if multiple substrings of the same length exist (first occurrence, all occurrences?), etc.  I intentionally left these details out of the original topic to allow the answers a bit more freedom in their solutions.  The solutions reflect these subtle differences.

Solution #1 - Provided by Steven Wang

Code Snippet
1. --First of all, we need a generic number table. I know that we can use the common table expression to create
2. --a virtual number table. But I still prefer to have a physical number table in place.
3. --As SQL server can only store up to 8000 characters for a string value, for simplicity I will create
4. -- a number table with maximum value equals to 8000.
5. If Object_ID('dbo.LongestStr_Number', 'U') Is Null
6.     Begin
7.         SELECT TOP 8000 IDENTITY(INT,1,1) AS Number
8.         INTO    dbo.LongestStr_Number
9.         FROM    master.sys.all_columns A
10.                     CROSS JOIN
11.                 master.sys.all_columns B;
12.
13.         ALTER TABLE dbo.LongestStr_Number ADD CONSTRAINT PK_LongestStr_Number PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100;
14.     End
15.
16.
17. --Solution Starts from here. The solution works only for SQL server 2012, by some change can be easily used for SQL server 2005/08
18. Declare @MyString varchar(8000);
19. Declare @Length int;
20.
21. Set @MyString = 'ABAB CDCDCDCD ABAB';
22. Set    @Length = Len(@MyString);
23.
24. ;With RepeatingNumber
25. AS
26. (
27. Select    A.Number, Row_Number() Over(Partition By A.Number Order By (Select 0)) AS Position
28. From    LongestStr_Number A
29.             Cross Apply
30.         (
31.             Select    1 As Col
32.             From    LongestStr_Number B
33.             Where    Number <= A.Number
34.         ) C
35. Where    A.Number between @Length/2 + 1 and @Length
36. )
37. , CTE
38. As
39. (
40. Select    A.Number, B.MySubString, B.StringLen
41.         , Count(B.MySubString) Over(Partition By A.Number, B.MySubString) As MyCount
42.         , A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position Rows Unbounded Preceding) - B.StringLen As PositionGap
43. From    RepeatingNumber A
44.             CROSS APPly
45.         (    Select        Substring(@MyString, A.Position, @Length - A.Number + 1) As MySubstring
46.                         ,  @Length - A.Number + 1 As StringLen
47.         ) B
48. )
49.
50. Select        Top 1 with Ties MySubstring As longest_repeating_String
51. From        CTE
52. Where        MyCount > 1 and PositionGap >= 0
53. Order by    StringLen Desc

Explanation of Steven's solution:

To get substrings of a string, the number of substrings is determined by the length of the substring. For a string with length equals to n, it has 1 substring if the length of the substring equals to n – 1 + 1. If the length of the substring equals to n – 2 + 1, then it can have 2 substrings with overlaps. With the same fashion, if the length of a substring equals to n – m + 1 (m<=n), a string will have m number of substrings with overlaps.

As our task is to find out the longest repeating substring with NO overlaps, the number m should be only starting from n/2 + 1 (remember in T-SQL integer n divided by 2 will always return back integer). However, this doesn’t guarantee the substrings without overlaps.

Since each m number has m number of substrings which is calculated by the RepeatingNumber virtual table Cross Applied with a substring function, we can use the window function Count () by partitioning the m number and the substring to find out the repeating substrings with overlaps. If the window function Count() Over returns back the value is greater than 1, then we find a repeating substring. But at this stage, we don’t know if the repeating substrings are overlapped or not.

I used another window function First_value to identify the repeating substrings without overlaps by using the code below:

A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position) - B.StringLen

The idea of above code is to get the offset value of the starting point of the remaining substrings with same value as the first substring within the same Number group. If the offset value is negative, then we can conclude that the two substrings have overlaps. Otherwise, the two substrings are non-overlapped repeating substrings.

At the last we use the TOP 1 clause with order by to get the longest repeating substring.

I have to mention that the white spaces will be treated as characters in this solution. we can modify the solution to take care of it if it is needed.

Solution #2 - Provided by Kent Waldrop

Code Snippet
1. -- Create a numbers table for use in the solution
2. create table numbers(n int)
3. go
4.
5. declare @counter int
6. set @counter = 0
7. while @counter < 400
8. begin
9.     set @counter = @counter + 1
10.     insert into numbers
11.     select @counter
12. end
13.
14. -- Create a temporary table containing the strings for which we will find the longest substring
15. declare @test table(theString varchar(400));
16. insert into @test
17. select 'ABABABABABABABABABABABABABAB CDCDCDCDCDCDCDCDCDCDCDCDCDCD'
18. union all
19. select 'SQL Server 2012 has introduced several new window functions and enhanced support for window aggregate functions by introducing window order and frame clauses, support for offset functions. In this session, the presenter will apply these new functions to solve some most frequently asked questions in MSDN T-SQL forum.'
20. union all
21. select 'bananas'
22. union all
23. select '3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233'
24. union all
25. select 'Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.'
26. ;
27.
28. -- Solution starts here
29. --;WITH Numbers AS
30. --(
31. --     SELECT 1 AS n
32. --     UNION ALL
33. --     SELECT n + 1 AS n FROM   Numbers  WHERE  n < 400
34. --)
35. --,
36. with findRepeatChars as
37. ( select
38.     1 as stringLength,
39.     n as location,
40.       (len(theString)-n)/2 as diff,
41.     cast(substring(theString, n, 1) as varchar(200)) as MatchString,
42.       theString
43.   from @test
44.     outer apply
45.   ( select n from numbers
46.     where n < 400
47.      and n < len(theString)
48.       and charindex(substring(theString, n, 1), theString, n+1) > 0
49.   ) oa1
50. ), primaryWork as
51. (
52.   select
53.     substring(theString, location, loop_Offset + max_N) as matchString,
54.     theString,
55.     rank() over
56.     ( partition by theString
57.       order by loop_Offset + max_N desc
58.     ) as Rn
59.   from findRepeatChars
60.   cross apply
61.   ( select
62.       iif(diff>=9,     8, null) as group2_Offset,
63.         iif(diff>=25,   24, null) as group3_Offset,
64.         iif(diff>=49,   48, null) as group4_Offset,
65.         iif(diff>=90,   89, null) as group5_Offset,
66.         iif(diff>=144, 143, null) as group6_Offset
67.   ) calcGroupOffsets
68.   cross apply
69.   ( select
70.       nullif(sign(charindex(substring(theString, location, group6_offset + 1),
71.             theString, location + group6_Offset + 1)), 0) as group6_Index,
72.       nullif(sign(charindex(substring(theString, location, group5_offset + 1),
73.             theString, location + group5_Offset + 1)), 0) as group5_Index,
74.       nullif(sign(charindex(substring(theString, location, group4_offset + 1),
75.             theString, location + group4_Offset + 1)), 0) as group4_Index,
76.       nullif(sign(charindex(substring(theString, location, group3_offset + 1),
77.             theString, location + group3_Offset + 1)), 0) as group3_Index,
78.       nullif(sign(charindex(substring(theString, location, group2_offset + 1),
79.             theString, location + group2_Offset + 1)), 0) as group2_Index
80.   ) calcGroupIndices
81.   cross apply
82.   ( select
83.       coalesce( group6_Index * group6_Offset, group5_Index * group5_Offset,
84.             group4_Index * group4_Offset, group3_Index * group3_Offset,
85.               group2_Index * group2_Offset, 0
86.         ) as loop_Offset,
87.       coalesce
88.           ( group6_Index * iif(diff>=144, 57, null), group5_Index * iif(diff>=90,55, null),
89.           group4_Index * iif(diff>=49,  42, null), group3_Index * iif(diff>=25,  25, null),
90.             group2_Index * iif(diff>=9,   17, null), 9
91.         ) as loop_Range
92.   ) calcLoopParms
93.   cross apply
94.   ( select
95.       max(n) as max_N
96.     from numbers
97.     where n <= 57
98.       and n <= loop_Range
99.         and charindex(substring(theString, location, loop_Offset + n), theString,
100.                location + loop_Offset + n)
101.         > 0
102.   ) calcMaxN
103. )
104.
105. select
106.   matchString as longest_Substring,
107.   theString as input_String
108. from primaryWork
109. where Rn = 1
110.
111. OPTION (maxrecursion 0)

Explanation of Kent's solution:

The purpose of this query is to find the longest substring that is repeated within a given string.  Note that this query depends on a table of numbers.  This query is divided into three major parts:

•  The "findRepeatChars" CTE
•  The "primaryWork" CTE
•  The outer query

"findRepeatChars" locates all characters in the input string that are repeated at some point later in the string.  This CTE parses the string by using a table of numbers to spin through each character position except the last position, which cannot have a repeated character that comes later in the string.

"primaryWork" processes the locations found by "findRepeatChars" and determines the longest substring beginning at the given location.  This CTE includes four CROSS APPLY operators to provide reusable calculations:

•  calcGroupOffsets
•  calcGroupIndices
•  calcLoopParms
•  calcMaxN

calcGroupOffsets uses the built in IIF function to calculate offsets that can be used as a base locations for dividing a string into smaller segments that can be quickly processed with a linear search.

calcGroupIndices determines if a string at the given location has a substring of a length determined by the group offset that repeats later in the given string.

calcLoopParms uses COALESCE functions to select the largest group offset that defines a substring that is later repeated in the given string based on the computed group indices.  In addition, the built in IIF function is used to determine an appropriate search range based on the selected group offset and group index.

calcMaxN uses the computed loop parameters and a table of numbers to determine the largest number that can be added to the offset.  This largest number when added to the largest group offset determines the largest substring that is repeated at a given location in the string.

After the primaryWork CTE processes the four apply operators, a RANK function is used such that the rank is partitioned according to the input' string and ordered by the descending string lengths of all of the repeating substrings for a given input string to provide a row number (rn) column.

The outer query calls the primaryWork CTE to process a list of all repeating substrings associated with a given input string.  Records are filtered with the outer query such that only records that have a row number (rn) of 1 are retained.

Notes:
I sort of used odd squares to segment out the data; I didn't investigate whether some kind of log method might be better.

Solution #3 - Provided by Jingyang Li

Code Snippet
1. declare @MyString varchar(max)
2. Set @MyString = 'Server 2012 has introduced several new window functions and enhanced support for window aggregate functions by introducing window order and frame clauses, support for offset functions. In this session, the presenter will apply these new functions to solve some most frequently asked questions in MSDN T-SQL forum.';
3.
4. ;WITH Numbers AS
5. (
6.      SELECT 1 AS i
7.      UNION ALL
8.      SELECT i + 1 AS i FROM   Numbers  WHERE  i < LEN(@MyString)
9. )
10.
11. ,mycte1 AS
12.   (
13.   SELECT a.i AS Step,b.i AS StartPos, SUBSTRING(@MyString, b.i,  a.i) AS SubSeq FROM Numbers a CROSS JOIN Numbers b
14.   )
15.
16. ,mycte2 AS (
17.      SELECT  StartPos, Step, SubSeq
18.      ,Count(SubSeq)over(Partition by SubSeq) cntSubSeq
19.      ,ROW_NUMBER() Over(Partition by SubSeq Order by Step ) rnSubSeq
20.      ,MAX(Len(SubSeq)) OVER(Partition by SubSeq) maxLen
21.   FROM  mycte1  WHERE  LEN(SubSeq) >= Step
22. )
23.
24. ,mycte3 as
25. (
26.     SELECT Step,StartPos, SubSeq,  maxLen, rnSubSeq
27.     FROM   mycte2
28.     WHERE cntSubSeq>1
29. )
30.
31. ,mycte4 as
32. (
33.     SELECT  a.SubSeq, a.StartPos, a.maxLen
34.            ,RANK() OVER (order by a.maxLen DESC) AS rnkLen
35.            ,Row_Number() OVER (order by a.maxLen DESC,a.Step) AS rnLen
36.     FROM mycte3 a
37.     LEFT JOIN mycte3 b ON a.SubSeq=b.SubSeq and (a.rnSubSeq=b.rnSubSeq-1 OR b.rnSubSeq-1 = 0)
38.     WHERE (a.StartPos+a.Step-1) < b.StartPos
39. )
40.
41.     SELECT SubSeq as [LognestRepeatingSubstring] FROM mycte4
42.     WHERE rnkLen=1
43.     Order by maxLen DESC, StartPos
44.
45. OPTION (maxrecursion 0)

Explanation of Jingyang's solution:

Step 1: Create a numbers table based on the string length but it would be better to use an auxiliary numbers table;
Step 2: Cross join Numbers table to create positions and steps for all substring combinations;
Step 3: Calculate the occurrence of substrings in various lengths, the length for each substring and a sequence within a substring based on the step. The WHERE condition is to eliminate double count for any substrings after the last full step increments;
Step 4: Get all substrings with at least two occurrences;
Step 5: Remove repeating substring with overlapping and get the ranking order for the longest substrings;
Step 6: Retrieve the longest repeating substring(s) with two options.

The option1 is to get longest string with tie breaker (first occurrence) by using rnLen=1 and  the second option is to get longest strings without tie breaker by using rnkLen=1;

In order to change the default recursion limit from 100 to no limit for CTEs, the query hints MAXRECURSION is set to 0.

Note:  I include the rank order to choose the first occurrence or all longest repeating substrings in the query.

Conclusion:

As you can see, all three of the above solutions provide variations of the result we were looking for, but do so in creatively different styles.  Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions, including one solution that leverage new constructs from SQL Server 2012. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.

Special thanks to Jingyang, Steven, and Kent for their valuable forums contribution and for contributing to this series!

Hope that helps,
Sam Lester (MSFT)

Contributor Bios:

Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.

Blog: www.MSBICOE.com | LinkedIn: Steven Wang | MSDN Profile: Steven Wang - Shangzhou

Kent Waldrop started working with Sybase Transact SQL in 1989 as an application developer and continued working with Sybase until 1995 when SQL Server 6 was released.  At that time, he became a Microsoft SQL Server database administrator and has continued to work with Microsoft SQL Server ever since.  Currently he is a database architect working with Microsoft SQL Server, Oracle and UDB/DB2.

Jingyang Li has been working with SQL Server since he began his IT career as an ASP.NET developer in 2001. He enjoys working with T-SQL and recently took a full time job as a DBA.  He has been participating in the Microsoft forums with alias Limno.

1. Steven Wang - Shangzhou says:

For the first solution, if we need to find our the all occurrences for the longest repeating strings, then tweak the code a little bit to use TOP with TIE clause as below:

If Object_ID('dbo.LongestStr_Number', 'U') Is Null

Begin

SELECT TOP 8000 IDENTITY(INT,1,1) AS Number

INTO dbo.LongestStr_Number

FROM master.sys.all_columns A

CROSS JOIN

master.sys.all_columns B;

ALTER TABLE dbo.LongestStr_Number ADD CONSTRAINT PK_LongestStr_Number PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100;

End

–Solution Starts from here. The solution works only for SQL server 2012, by some change can be easily used for SQL server 2005/08

Declare @MyString varchar(8000);

Declare @Length int;

Set @MyString = 'ABAB CDCDCDCD ABAB';

Set @Length = Len(@MyString);

;With RepeatingNumber

AS

(

Select A.Number, Row_Number() Over(Partition By A.Number Order By (Select 0)) AS Position

From LongestStr_Number A

Cross Apply

(

Select 1 As Col

From LongestStr_Number B

Where Number <= A.Number

) C

Where A.Number between @Length/2 + 1 and @Length

)

, CTE

As

(

Select A.Number, B.MySubString, B.StringLen

, Count(B.MySubString) Over(Partition By A.Number, B.MySubString) As MyCount

, A.Position – First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position Rows Unbounded Preceding) – B.StringLen As PositionGap

From RepeatingNumber A

CROSS APPly

( Select Substring(@MyString, A.Position, @Length – A.Number + 1) As MySubstring

,  @Length – A.Number + 1 As StringLen

) B

)

Select Top 1 with Ties MySubstring As longest_repeating_String

From CTE

Where MyCount > 1 and PositionGap >= 0

Order by StringLen Desc

Thanks. –Steven

2. Kent Waldrop says:

The solution I provided works better if you have a table of numbers or use a better numbers CTE.  Probably better to comment out this stuff:

— Solution starts here

–;WITH Numbers AS

–(

—     SELECT 1 AS n

—     UNION ALL

—     SELECT n + 1 AS n FROM   Numbers  WHERE  n < 400

–)

–,findRepeatChars as

with findRepeatChars as

Sorry for the confusion.

Kent

3. Samuel Lester - MSFT says:

Thanks Steven, I updated your solution above.

4. Samuel Lester - MSFT says:

Thanks Kent, I made the modification to yours as well.

5. SathyanarrayananS says:

Good one !!!