Regex 101 Exercise S1 - Discussion

Welcome to Regex 101 discussion section 3.

My goals for these discussions are twofold. First, I'd like to give a reasonable answer (or set of answers, more likely) to the exercise. Second, I'd like to impart some understanding of how the regex works, both in the "this is what this construct does" and the "this is how this works under the covers" sense. This first one is going to cover a lot of basics, so if you're already familiar with regex, you may want to "read for flavor"...

Our challenge is the following:

*****

S1 - Match a Social Security Number

Verify that a string is a social security number of the format ddd-dd-dddd.

*****

So, we need to come up with a regex pattern that will match a valid string, and not match an invalid string. Consider "999-55-1827" as our sample string.

Our first task is to match a digit. We do that with a character class, written as follows:

[0123456789]

which means "match any one of the characters inside the []". So, this matches a single digit. It's also really ugly to write. We can write this using a shorthand:

[0-9]

meaning "any one character in the range 0-9". It turns out matching digits is a common operation, so the regex language provides a shorthand (table of common shorthands):

\d

So, to match three digits in a row, we'd write:

\d\d\d

Characters that don't have a meaning in the regex pattern language can be used as literals, and "-" doesn't have a meaning at this level, so we can also use that as a literal, giving us the pattern:

\d\d\d-\d\d-\d\d\d\d

That matches our sample correctly. But there is nothing to stop the engine from finding a valid match in the middle of a string - our pattern will also match:

My father's SSN was 999-55-1827, and he loved that number

To limit it to work only with the SSN string, we use anchors. Technically, they're called Atomic Zero-width assertions, which makes it clear why I prefer "anchor". Anchors set limitations on where a match must take place. The two that we want are "^", which means "anchor to the beginning of the string" and "$", which means anchor to the end of the string. So, our pattern becomes:

^\d\d\d-\d\d-\d\d\d\d$

and I would consider that a valid answer to the exercise. But not a good one - it's not as readable as it could be (and I'm not going where you think I'm going... yet...). C# provides the option to write a string on multiple lines, and the .net regex classes support comments within the regex (use RegexOptions.IgnorePatternWhitespace), so I'm able to write this as:

^ # beginning of string

\d\d\d # three digits

- # literal '-'

\d\d # two digits

- # literal '-'

\d\d\d\d # four digits

$ # end of string

When it is possible, I think all regex patterns should be written like this.

Now, that's a fine string, but it is a bit hard visually to parse "\d\d\d\d". We can make that simpler by using a quantifier, which modifies a matching item. To match four digits in a row, we can write:

\d{4}

and our whole pattern becomes:

^ # beginning of string

\d{3} # three digits

- # literal '-'

\d{2} # two digits

- # literal '-'

\d{4} # four digits

$ # end of string

Whether that is preferable to the previous choice is a matter of aesthetics. I think that "\d{3}" is slightly nicer than "\d\d\d", but I also think that "ooo" is better than "o{3}", so it's not an obvious choice.

To use this in a C# program? Well, using "Copy as C#" from regex workbench.

Regex regex = new Regex(@"
^ # beginning of string
\d{3} # three digits
- # literal '-'
\d{2} # two digits
- # literal '-'
\d{4} # four digits
$ # end of string
",
RegexOptions.IgnorePatternWhitespace);

Match match = regex.Match("999-55-1827");

    Console.WriteLine(match.Success);

I think that's a manageable chunk for the first exercise. If you're the kind who likes spoilers, you might want to read the quantifiers page.

Careful readers may have noticed that my solution doesn't cover some of the cases discussed in the comments to the original question. I will get to those parts later, but they're too complex for early posts.

I'm also considering disabling comments on the exercise posts so that people aren't distracted by them.

My goal is to do one of these a week - with the exercise on Monday, and the discussion on Thursday/Friday. Something like that.