Regex 101 Discussion I2 – Find two words in a string


I2 – Find two words in a string


Find any string that has the following two words in it: “dog” and “vet”


******

This is an interesting one, since it’s not something that regex is particularly suited for. The test strings that I’m using are:


I took my dog to the vet
The vet fixed my dog
My dog likes to visit veterans
dog dog
The vet is great
He continued with dogged determination


The first two should be successful, all others should fail.


In the comments to the original post, Maurits said that you should use two regexes. I think that it may be the best solution (clearest and easiest), though it may be less performant. But I’m going to talk about the single-regex solution.


The only tricky thing about this is that we need to match words rather than characters. To do that, we can write:


\sdog\s


to find a dog surrounded by whitespace (please spend two minutes, think up the best joke you can having to do with “dog surrounded by whitespace”, and post it as a comment). Unfortunately, if I try to match that to:


I am going to walk my dog


it fails, because there’s no whitespace after “dog”. What we need is a way to match between a word and non-word. We can use that with “\b”, so if we write:


\bdog\b


we will get the behavior that we want. Two quick notes:



  1. Like the $ and ^ anchors, \b doesn’t consume any characters, it just asserts what condition must be true to match.

  2. The boundary is really between alphanumeric and non-alphanumeric characters.

So, time to string things together. We can match a sentence with dog followed by vet with the following:


\bdog\b.*?\bvet\b


That handles one case, and to handle the other case, we’ll just switch the order. Finally, we get:


\bdog\b.*?\bvet\b
|
\bvet\b.*?\bdog\b


which does what we want it to do, assuming we use RegexOptions.IgnoreCase when we use it.


That’s all for now. The next one is a nice one, but it will have to wait until next year…


 

Comments (2)

  1. Maurits says:

    You mentioned that the two-regexes case may be less performant… in Perl my experience has been that simpler regexes tend to run faster, but I thought I’d test it.

    Test script:

    http://www.geocities.com/mvaneerde/regex-speed-test.pl.txt

    Test results:

    Building strings…

    Done in 23.4671127796173 seconds

    Two regexes: /bdogb/i, /bvetb/i (no short-circuit)

    0.593423128128052 seconds (91 matches)

    Two regexes: /bdogb/i and /bvetb/i (short-circuit)

    0.343111991882324 seconds (91 matches)

    One maximal regex: /(?:bdogb.*bvetb)|(?:bvetb.*bdogb)/i

    1.89077997207642 seconds (91 matches)

    One minimal regex: /(?:bdogb.*?bvetb)|(?:bvetb.*?bdogb)/i

    1.84350800514221 seconds (91 matches)

    The .NET regex engine probably handles regexes differently than Perl’s, of course.

  2. Maurits says:

    I added Anonymous’ lookahead regex from the comments to the other post

    /^(?=.*?bdogb)(?=.*?bvetb)/i

    and it ran at exactly the same speed as the "two regexes without short-circuit"

    But, of course, it requires knowledge of lookahead subclauses… which is more of an "advanced" feature