TSQL - Solve it YOUR Way - Finding the Longest Repeated Substring

 

Introduction:

As part of the blog series TSQL - Solve it YOUR Way, today's topic will cover a common programming interview question where we are asked to find the longest repeated substring in a given string, followed by different solutions and explanations from three of the more helpful and creative contributors in the TSQL MSDN forums, Steven Wang, Kent Waldrop, and Jingyang Li.

Topic: Given a string, find the longest repeated substring

I was asked this question many years ago in a programming interview and I solved it in C# as was expected in the context of the role I was interviewing for.  A few days ago, a friend of mine was discussing a recent programming interview where he was asked this same question.  While discussing solutions with him, I realized that this could make for a great topic for the Solve it YOUR Way series.  While T-SQL may be an unlikely method of solving this problem, the solutions below highlight the power of SQL Server through the creative solutions of the authors. 

As with any good interview question, the initial phrasing of the question does not provide all of the necessary requirements to provide the best or only correct solution.  In this case, in order to provide a correct solution, we would need details such as 1) can substrings overlap, and 2) what do we return if multiple substrings of the same length exist (first occurrence, all occurrences?), etc.  I intentionally left these details out of the original topic to allow the answers a bit more freedom in their solutions.  The solutions reflect these subtle differences.

Solution #1 - Provided by Steven Wang

 

Code Snippet

  1. --First of all, we need a generic number table. I know that we can use the common table expression to create
  2. --a virtual number table. But I still prefer to have a physical number table in place.
  3. --As SQL server can only store up to 8000 characters for a string value, for simplicity I will create
  4. -- a number table with maximum value equals to 8000.
  5. If Object_ID('dbo.LongestStr_Number', 'U') Is Null
  6.     Begin
  7.         SELECT TOP 8000 IDENTITY(INT,1,1) AS Number
  8.         INTO    dbo.LongestStr_Number
  9.         FROM    master.sys.all_columns A
  10.                     CROSS JOIN
  11.                 master.sys.all_columns B;
  12.  
  13.         ALTER TABLE dbo.LongestStr_Number ADD CONSTRAINT PK_LongestStr_Number PRIMARY KEY CLUSTERED (Number) WITH FILLFACTOR = 100;
  14.     End
  15.  
  16.  
  17. --Solution Starts from here. The solution works only for SQL server 2012, by some change can be easily used for SQL server 2005/08
  18. Declare @MyString varchar(8000);
  19. Declare @Length int;
  20.  
  21. Set @MyString = 'ABAB CDCDCDCD ABAB';
  22. Set    @Length = Len(@MyString);
  23.  
  24. ;With RepeatingNumber
  25. AS
  26. (
  27. Select    A.Number, Row_Number() Over(Partition By A.Number Order By (Select 0)) AS Position
  28. From    LongestStr_Number A
  29.             Cross Apply
  30.         (
  31.             Select    1 As Col
  32.             From    LongestStr_Number B
  33.             Where    Number <= A.Number
  34.         ) C
  35. Where    A.Number between @Length/2 + 1 and @Length
  36. )
  37. , CTE
  38. As
  39. (
  40. Select    A.Number, B.MySubString, B.StringLen
  41.         , Count(B.MySubString) Over(Partition By A.Number, B.MySubString) As MyCount
  42.         , A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position Rows Unbounded Preceding) - B.StringLen As PositionGap
  43. From    RepeatingNumber A
  44.             CROSS APPly
  45.         (    Select        Substring(@MyString, A.Position, @Length - A.Number + 1) As MySubstring
  46.                         ,  @Length - A.Number + 1 As StringLen
  47.         ) B
  48. )
  49.  
  50. Select        Top 1 with Ties MySubstring As longest_repeating_String
  51. From        CTE
  52. Where        MyCount > 1 and PositionGap >= 0
  53. Order by    StringLen Desc

 

Explanation of Steven's solution:

To get substrings of a string, the number of substrings is determined by the length of the substring. For a string with length equals to n, it has 1 substring if the length of the substring equals to n – 1 + 1. If the length of the substring equals to n – 2 + 1, then it can have 2 substrings with overlaps. With the same fashion, if the length of a substring equals to n – m + 1 (m<=n), a string will have m number of substrings with overlaps.

As our task is to find out the longest repeating substring with NO overlaps, the number m should be only starting from n/2 + 1 (remember in T-SQL integer n divided by 2 will always return back integer). However, this doesn’t guarantee the substrings without overlaps.

Since each m number has m number of substrings which is calculated by the RepeatingNumber virtual table Cross Applied with a substring function, we can use the window function Count () by partitioning the m number and the substring to find out the repeating substrings with overlaps. If the window function Count() Over returns back the value is greater than 1, then we find a repeating substring. But at this stage, we don’t know if the repeating substrings are overlapped or not.

I used another window function First_value to identify the repeating substrings without overlaps by using the code below:

A.Position - First_Value(A.Position) Over(partition By A.Number, B.MySubstring Order by A.Position) - B.StringLen

The idea of above code is to get the offset value of the starting point of the remaining substrings with same value as the first substring within the same Number group. If the offset value is negative, then we can conclude that the two substrings have overlaps. Otherwise, the two substrings are non-overlapped repeating substrings.

At the last we use the TOP 1 clause with order by to get the longest repeating substring.
 
I have to mention that the white spaces will be treated as characters in this solution. we can modify the solution to take care of it if it is needed.

Solution #2 - Provided by Kent Waldrop

 

Code Snippet

  1. -- Create a numbers table for use in the solution
  2. create table numbers(n int)
  3. go
  4.  
  5. declare @counter int
  6. set @counter = 0
  7. while @counter < 400
  8. begin
  9.     set @counter = @counter + 1
  10.     insert into numbers
  11.     select @counter
  12. end
  13.  
  14. -- Create a temporary table containing the strings for which we will find the longest substring
  15. declare @test table(theString varchar(400));
  16. insert into @test
  17. select 'ABABABABABABABABABABABABABAB CDCDCDCDCDCDCDCDCDCDCDCDCDCD'
  18. union all
  19. select 'SQL Server 2012 has introduced several new window functions and enhanced support for window aggregate functions by introducing window order and frame clauses, support for offset functions. In this session, the presenter will apply these new functions to solve some most frequently asked questions in MSDN T-SQL forum.'
  20. union all
  21. select 'bananas'
  22. union all
  23. select '3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233'
  24. union all
  25. select 'Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.'
  26. ;
  27.  
  28. -- Solution starts here
  29. --;WITH Numbers AS
  30. --(
  31. --     SELECT 1 AS n
  32. --     UNION ALL
  33. --     SELECT n + 1 AS n FROM   Numbers  WHERE  n < 400
  34. --)
  35. --,
  36. with findRepeatChars as
  37. ( select
  38.     1 as stringLength,
  39.     n as location,
  40.       (len(theString)-n)/2 as diff,
  41.     cast(substring(theString, n, 1) as varchar(200)) as MatchString,
  42.       theString
  43.   from @test
  44.     outer apply
  45.   ( select n from numbers
  46.     where n < 400
  47.      and n < len(theString)
  48.       and charindex(substring(theString, n, 1), theString, n+1) > 0
  49.   ) oa1
  50. ), primaryWork as
  51. (
  52.   select
  53.     substring(theString, location, loop_Offset + max_N) as matchString,
  54.     theString,
  55.     rank() over
  56.     ( partition by theString
  57.       order by loop_Offset + max_N desc
  58.     ) as Rn
  59.   from findRepeatChars
  60.   cross apply
  61.   ( select
  62.       iif(diff>=9,     8, null) as group2_Offset,
  63.         iif(diff>=25,   24, null) as group3_Offset,
  64.         iif(diff>=49,   48, null) as group4_Offset,
  65.         iif(diff>=90,   89, null) as group5_Offset,
  66.         iif(diff>=144, 143, null) as group6_Offset
  67.   ) calcGroupOffsets
  68.   cross apply
  69.   ( select
  70.       nullif(sign(charindex(substring(theString, location, group6_offset + 1),
  71.             theString, location + group6_Offset + 1)), 0) as group6_Index,
  72.       nullif(sign(charindex(substring(theString, location, group5_offset + 1),
  73.             theString, location + group5_Offset + 1)), 0) as group5_Index,
  74.       nullif(sign(charindex(substring(theString, location, group4_offset + 1),
  75.             theString, location + group4_Offset + 1)), 0) as group4_Index,
  76.       nullif(sign(charindex(substring(theString, location, group3_offset + 1),
  77.             theString, location + group3_Offset + 1)), 0) as group3_Index,
  78.       nullif(sign(charindex(substring(theString, location, group2_offset + 1),
  79.             theString, location + group2_Offset + 1)), 0) as group2_Index
  80.   ) calcGroupIndices
  81.   cross apply
  82.   ( select
  83.       coalesce( group6_Index * group6_Offset, group5_Index * group5_Offset,
  84.             group4_Index * group4_Offset, group3_Index * group3_Offset,
  85.               group2_Index * group2_Offset, 0
  86.         ) as loop_Offset,
  87.       coalesce
  88.           ( group6_Index * iif(diff>=144, 57, null), group5_Index * iif(diff>=90,55, null),
  89.           group4_Index * iif(diff>=49,  42, null), group3_Index * iif(diff>=25,  25, null),
  90.             group2_Index * iif(diff>=9,   17, null), 9
  91.         ) as loop_Range
  92.   ) calcLoopParms
  93.   cross apply
  94.   ( select
  95.       max(n) as max_N
  96.     from numbers
  97.     where n <= 57
  98.       and n <= loop_Range
  99.         and charindex(substring(theString, location, loop_Offset + n), theString,
  100.                location + loop_Offset + n)
  101.         > 0
  102.   ) calcMaxN
  103. )
  104.  
  105. select
  106.   matchString as longest_Substring,
  107.   theString as input_String
  108. from primaryWork
  109. where Rn = 1
  110.  
  111. OPTION (maxrecursion 0)

 

Explanation of Kent's solution:

The purpose of this query is to find the longest substring that is repeated within a given string.  Note that this query depends on a table of numbers.  This query is divided into three major parts:

  •  The "findRepeatChars" CTE
  •  The "primaryWork" CTE
  •  The outer query

"findRepeatChars" locates all characters in the input string that are repeated at some point later in the string.  This CTE parses the string by using a table of numbers to spin through each character position except the last position, which cannot have a repeated character that comes later in the string.

"primaryWork" processes the locations found by "findRepeatChars" and determines the longest substring beginning at the given location.  This CTE includes four CROSS APPLY operators to provide reusable calculations:

  •  calcGroupOffsets
  •  calcGroupIndices
  •  calcLoopParms
  •  calcMaxN

calcGroupOffsets uses the built in IIF function to calculate offsets that can be used as a base locations for dividing a string into smaller segments that can be quickly processed with a linear search.

calcGroupIndices determines if a string at the given location has a substring of a length determined by the group offset that repeats later in the given string.

calcLoopParms uses COALESCE functions to select the largest group offset that defines a substring that is later repeated in the given string based on the computed group indices.  In addition, the built in IIF function is used to determine an appropriate search range based on the selected group offset and group index.

calcMaxN uses the computed loop parameters and a table of numbers to determine the largest number that can be added to the offset.  This largest number when added to the largest group offset determines the largest substring that is repeated at a given location in the string.

After the primaryWork CTE processes the four apply operators, a RANK function is used such that the rank is partitioned according to the input' string and ordered by the descending string lengths of all of the repeating substrings for a given input string to provide a row number (rn) column.

The outer query calls the primaryWork CTE to process a list of all repeating substrings associated with a given input string.  Records are filtered with the outer query such that only records that have a row number (rn) of 1 are retained.

Notes:
I sort of used odd squares to segment out the data; I didn't investigate whether some kind of log method might be better.

Solution #3 - Provided by Jingyang Li 

 

Code Snippet

  1. declare @MyString varchar(max)
  2. Set @MyString = 'Server 2012 has introduced several new window functions and enhanced support for window aggregate functions by introducing window order and frame clauses, support for offset functions. In this session, the presenter will apply these new functions to solve some most frequently asked questions in MSDN T-SQL forum.';
  3.  
  4. ;WITH Numbers AS
  5. (
  6.      SELECT 1 AS i
  7.      UNION ALL
  8.      SELECT i + 1 AS i FROM   Numbers  WHERE  i < LEN(@MyString)
  9. )
  10.  
  11. ,mycte1 AS
  12.   (
  13.   SELECT a.i AS Step,b.i AS StartPos, SUBSTRING(@MyString, b.i,  a.i) AS SubSeq FROM Numbers a CROSS JOIN Numbers b
  14.   )
  15.  
  16. ,mycte2 AS (
  17.      SELECT  StartPos, Step, SubSeq
  18.      ,Count(SubSeq)over(Partition by SubSeq) cntSubSeq
  19.      ,ROW_NUMBER() Over(Partition by SubSeq Order by Step ) rnSubSeq
  20.      ,MAX(Len(SubSeq)) OVER(Partition by SubSeq) maxLen
  21.   FROM  mycte1  WHERE  LEN(SubSeq) >= Step
  22. )
  23.  
  24. ,mycte3 as
  25. (
  26.     SELECT Step,StartPos, SubSeq,  maxLen, rnSubSeq
  27.     FROM   mycte2
  28.     WHERE cntSubSeq>1
  29. )
  30.  
  31. ,mycte4 as
  32. (
  33.     SELECT  a.SubSeq, a.StartPos, a.maxLen
  34.            ,RANK() OVER (order by a.maxLen DESC) AS rnkLen
  35.            ,Row_Number() OVER (order by a.maxLen DESC,a.Step) AS rnLen
  36.     FROM mycte3 a
  37.     LEFT JOIN mycte3 b ON a.SubSeq=b.SubSeq and (a.rnSubSeq=b.rnSubSeq-1 OR b.rnSubSeq-1 = 0)
  38.     WHERE (a.StartPos+a.Step-1) < b.StartPos
  39. )
  40.  
  41.     SELECT SubSeq as [LognestRepeatingSubstring] FROM mycte4
  42.     WHERE rnkLen=1
  43.     Order by maxLen DESC, StartPos
  44.  
  45. OPTION (maxrecursion 0)

 

Explanation of Jingyang's solution:

Step 1: Create a numbers table based on the string length but it would be better to use an auxiliary numbers table;
Step 2: Cross join Numbers table to create positions and steps for all substring combinations;
Step 3: Calculate the occurrence of substrings in various lengths, the length for each substring and a sequence within a substring based on the step. The WHERE condition is to eliminate double count for any substrings after the last full step increments;
Step 4: Get all substrings with at least two occurrences;
Step 5: Remove repeating substring with overlapping and get the ranking order for the longest substrings;
Step 6: Retrieve the longest repeating substring(s) with two options.

The option1 is to get longest string with tie breaker (first occurrence) by using rnLen=1 and  the second option is to get longest strings without tie breaker by using rnkLen=1;

In order to change the default recursion limit from 100 to no limit for CTEs, the query hints MAXRECURSION is set to 0.

Note:  I include the rank order to choose the first occurrence or all longest repeating substrings in the query.

Conclusion:

As you can see, all three of the above solutions provide variations of the result we were looking for, but do so in creatively different styles.  Each of these solutions leverages different SQL Server language constructs and includes different considerations in the final solutions, including one solution that leverage new constructs from SQL Server 2012. I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.

Special thanks to Jingyang, Steven, and Kent for their valuable forums contribution and for contributing to this series!

Hope that helps,
Sam Lester (MSFT)

Contributor Bios:

Steven Wang has worked with SQL server for more than 10 years. Steven is an active SQL server community participant and speaks at events such as TechEd, CodeCamp, SQL User Group etc.

Blog: www.MSBICOE.com | LinkedIn: Steven Wang | MSDN Profile: Steven Wang - Shangzhou

Kent Waldrop started working with Sybase Transact SQL in 1989 as an application developer and continued working with Sybase until 1995 when SQL Server 6 was released.  At that time, he became a Microsoft SQL Server database administrator and has continued to work with Microsoft SQL Server ever since.  Currently he is a database architect working with Microsoft SQL Server, Oracle and UDB/DB2.

Jingyang Li has been working with SQL Server since he began his IT career as an ASP.NET developer in 2001. He enjoys working with T-SQL and recently took a full time job as a DBA.  He has been participating in the Microsoft forums with alias Limno.