String search sample with C++ AMP


In this blog post I am going to share with you a string matching sample and describe the inner workings of the algorithm. On my machine with NVIDIA GTX580 the C++ AMP algorithm that I present below achieves up to 35X speedup over the std::strstr algorithm from standard C++ library.

In the first section I will clarify the string search problem that this algorithm was designed to solve. In the second, I will briefly describe the baseline algorithm, which is the CPU version that I used to check the correctness of the results and to compare the performance against the C++ AMP version. Next, I will describe the approach taken for the C++ AMP algorithm. I will focus on the most interesting design challenges that needed to be solved, like finding the right algorithm or working with the “char” type in C++ AMP. Finally I will end with a link to zip file that contains the sample’s source code for you to play with.

The problem

The input to the algorithm is M number of strings called “patterns” and N number of strings called “records”. The length of the record is larger than the length of the pattern. Each pattern is searched within each record to find an exact match of the pattern within the record.

Note: Working with a set of patterns and a set of records rather than one pattern and one record makes this problem adequately sized for data parallel accelerators.

If the pattern is not found within the record, then the result for that string search is -1. Otherwise the result is the index of the first occurrence of the pattern within a record. The output of the algorithm is therefore defined as an array of M*N elements. First M elements in the output array represent the results for each pattern against first record. Next M elements contain the results for second record and so on.

For simplicity all the patterns have the same length. Similarly all the records have the same length. Finally each pattern and record in the input array ends with 4 null terminating characters (I will explain later in the post why).

Here is an example:

Patterns: kitty\0\0\0\0puppy\0\0\0\0
Records: kitty and puppy\0\0\0\0puppy, elephant\0\0\0\0

Output: 0, 10, -1, 0

First pattern “kitty” can be found in first record at index 0. Second pattern “puppy” can be found at index 10 within first record. Pattern “kitty” cannot be found in second record, which gives us result -1, then the first occurrence of “puppy” in second record is located at index 0.

CPU Solution with strstr

The std::strstr algorithm from standard C++ library was picked as performance baseline and for verifying correctness of the C++ AMP algorithm. The std::strstr has been carefully optimized (including acceleration with SSE instructions) and rigorously tested which makes it a solid baseline.

Since we are dealing with a set of patterns and set of records all we need to do is to introduce two nested loops. The outer one would loop for all records and the inner one would loop for all patterns. In the most inner loop we will check if the record from the outer loop matches the pattern from the inner loop. The pseudo code for that would be:

 1: for (int r = 0; r < num_records; ++r)
 2:     for (int p = 0; p < num_patterns; ++p)
 3:         result[r][p] = strstr(&records[r], &patterns[p]);

Solution with C++ AMP

Selecting the right algorithm

The C++ AMP solution is based on the Rabin-Karp algorithm. We compute the hash value for the pattern and then compute hashes for all possible positions within the record. If the hash of the pattern matches the hash of the record at some position, then we have potential match. We cannot be certain, because different strings can produce same hash value. That is why all potential matches have to be fully verified by character to character comparison.

There are two reasons of why Rabin-Karp is a good algorithm for the C++ AMP version:

1) String patterns converted to hash values (32bit integers) have smaller memory footprint, so if we have less than 4096 of them, then we can fit them to constant memory and benefit from fast on-chip constant caches. The occupancy will be improved too, as read-only data in constant memory can be accessed directly without occupying registers. As an added bonus we also save a tiny bit on data transfer, since we have less data to copy-in.

2) The computation in Rabin-Karp is compute bound rather than memory bound as in naïve string search. Let me explain. In the most inner loop of the naive string search algorithm, we read a character of a record and a character from a pattern then we compare them. Since records would be stored in global memory most of our time would be spent on read accesses. In Rabin-Karp on the other hand the most inner loop compares a hash value of a record against all hash values of patterns. Doing so the latency of reading a record from global memory is hidden by many comparison operations that happen after it.

Working with the “char” type in C++ AMP

Next important decision to make is to decide on how to transfer the records to the accelerator for processing. We have two basic approaches:

1) Use textures of “char” types, and then extract a character inside the kernel as integer value.

2) Use of array or array_view of “unsigned int” types, and then performing bit manipulation to extract individual characters inside the kernel. One of our earlier blog posts discussed the technique of supporting character types by emulating them with integers.

Both approaches would do well it terms of kernel’s computational performance. I have decided to go with latter approach as this will allows us to generate initial records for our string problem in the staging array and then easily adapt such staging array with array_view. By using a staging array for the records, we cut down the data transfer time as the copy operation can be done directly from the staging array to accelerator’s memory. As a consequence for taking this approach, I decided to use 4 null terminating characters after each record, which allows me to simplify the unpacking logic (no two records occupy same integer). For consistency I kept the same format for patterns, although it was not necessary since we compute the hash values for them and they are never copied to the accelerator’s memory.

Dealing with false-positives

Due to the nature of Rabin-Karp algorithm, we need an extra post processing stage at which we will verify all tentatively identified matches. Assuming a small percentage of matches, this step should be a snap; therefore, the algorithm makes this computation on the CPU.

Notice that as the number of tentative matches increase (most of the patterns tentatively match most of the records), the Rabin-Karp stage on the accelerator becomes impractical as all tentatively identified matches would need to be rechecked in post processing stage on the CPU. Carefully crafting rolling hash function that is used inside the Rabin-Karp algorithm can reduce the number of false-positives. If you however expect high percentage of positive matches, then the presented algorithm in this blog post would be ineffective.

Building on the assumption that there is low number of matches, we can slightly change the output of Rabin-Karp algorithm. Instead of getting the results for each pattern, we can just store if a record matches any pattern. Doing so, there is a smaller output array to be transferred back from the accelerator (the array with a single integer for every record). More importantly the most inner loop of Rabin-Karp algorithm would make a store to a register when a hash value of a pattern is equal to the hash value of the record. Otherwise we would be introducing an array in shared or global memory to hold the result for each pattern. The only downside of change is that our host side post processing stage would now have to do little more work as if a record matches any pattern then we would have to check that record against all the patterns. I have decided to make this change as in testing I got performance gains from it for my tested inputs.

Putting it all together

Let’s repeat all steps of C++ AMP solution:

  • The C++ AMP algorithm computes the hashes for all patterns and places them in constant memory of the accelerator.
  • Records are copied as array_view of integers.
  • Inside the kernel each thread is responsible for string matching a single record against all patterns.
  • Bit manipulation is necessary when unpacking the record from the integer.
  • The outer loop walks on every index inside the record and computes a rolling hash of successive substrings of the length of the pattern.
  • In the innermost loop a hash for that index is checked against all patterns.
  • The outer loop continues until all indices of the record are checked.
  • The output of Rabin-Karp computation on the accelerator returns an array filled with 1 and 0 for each record. 1 signifies potential match for at least one pattern and 0 gives us a certainty that the record contains no possible matches to any pattern.
  • For each potential match detected on the accelerator we run multicore std::strstr on the host side to double check identified match and to expand our results array with the match/no match for individual pattern.

Most critical steps that boost the performance of C++ AMP:

  • Algorithmic efficiency of the Rabin-Karp approach and its compute bound nature.
  • Use of constant memory for hash patterns.
  • Large problem size that kept the accelerator busy (many records and many patterns).
  • Use of staging array for the records.
  • Data transfer of the records in packed form, rather than each ‘char’ represented as 4-byte integer.

 

Download the sample

I encourage you to download the Visual Studio project sample and study the algorithm that I described in this blog post. Please remember to use “Release” configuration when running a benchmark. Any feedback or comments are welcome either below or at the forum.

amp_string_search.zip

Comments (6)

  1. Out of interest, what CPU were you using when you measured a 35x speedup on the 580? My Radeon HD 7970 only shows a 10-14x speedup against my Core i5-660, which is a mid-range CPU from several generations ago. I'd expected more – could you paste the output from a typical run on your machine? I'd be interested to see the absolute times.

    I also noticed something else with this solution – every time I press F5, it recompiles all files before running. I'm still using VS2012 RC rather than RTM though, is this a known bug which was fixed?

  2. Hi Andrew!

    Thanks for trying out the sample. My CPU is Intel Core i7-930, here are my results for last test in benchmark:

    Size of records: 128MB, size of patterns: 8KB

    Running C++ AMP algorithm… done!

    C++ AMP string search took: 68.43 ms

    Running CPU algorithm… done!

    CPU string search took: 2653 ms

    Results matched, total number of patterns found: 2615

    Speedup (CPU/C++ AMP): 38.77X

    I have also second machine with Intel Core 2 Quad Q9550 and ATI 5870, here results are:

    Size of records: 128MB, size of patterns: 8KB

    Running C++ AMP algorithm… done!

    C++ AMP string search took: 188.2 ms

    Running CPU algorithm… done!

    CPU string search took: 1.22e+004 ms

    Results matched, total number of patterns found: 2615

    Speedup (CPU/C++ AMP): 64.82X

    Notice that although the speedup is better the absolute time of this card is about 3X slower than the one on first machine, which will put it around 14X when compared to the Core-i7.

    As for your second question, yes the F5 issue appears to be resolved. I have both RC and RTM of Visual Studio and I can repro on RC and not on RTM.

    Best regards,

    Simon

  3. Interesting. Here's a typical run on my machine (i5-660, HD 7970 @ 1010MHz):

    Size of records: 128MB, size of patterns: 8KB

    Running C++ AMP algorithm… done!

    C++ AMP string search took: 164.2 ms

    Running CPU algorithm… done!

    CPU string search took: 2378 ms

    Results matched, total number of patterns found: 2615

    Speedup (CPU/C++ AMP): 14.48X

    Note that the CPU time is slightly faster than your i7-930, which makes sense as they're from the same architectural family but the 660 is actually clocked slightly higher (its fewer cores aren't a problem in this single-threaded task – but for a fairer comparison against the GPUs, the CPU test should also be multithreaded!)

    However the AMP execution time is over twice as long as the 580's. The 7970 should stomp all over it, but I guess the problem is the a purely integer workload. It must have much lower integer performance than the 580.

  4. st3748 says:

    Wow really nice blog. I'm impressed. I've just started to write C++ Tutorials so if you are interested please check my blog. Thanks in advance. megacplusplustutorials.blogspot.com

  5. JP212 says:

    Many thanks for this impressive work. However, I tried to perform this and I noticed that the gain of GPU with CPU over the CPU witnesses regressions every few improvements although the general trend is increasing. Would you explain what is the reason of these regressions? is it related to the memory size ?

  6. AlGorr says:

    What I do not like in a present day exemples of code sources that it's not easy to understand during the first 5 minutes.

    Why do you need to complicate your code? You think you will look like more clever?

    Ex.:

       // Read patterns from the file to memory

       vector<char> patterns(num_patterns * pattern_length);

    Instead of:

       char* patterns = new char [num_patterns * pattern_length+1];

    Or

       std::string

    Are you sure that in 10 years  "vector<char>" will still be used?

    And there are plenty of other exemples like this in your code.