Regular Expressions From Scratch, Part Five: The Regular Expression Language

Now things start to get really weird.

Definition 9: Take any alphabet S. The regular expression alphabet of S is S plus a bunch of extra symbols; it’s S ∪ {(, ), *, , Ø} (I assume that none of those symbols are already in S.)

I’m doing something that I said earlier that I would try not to do. I’m using symbols in an alphabet that I also use in expressions that talk about strings in that alphabet. (Of course, I also said that this would all fall apart when we got to regular expressions, and sure enough, it did. Foreshadowing: the sign of a quality blog.)

Again, keep a careful eye on when I’m using fixed-width blue, because those are the “meaningless” symbols, not expression notation.

Definition 10: Take any alphabet S. The regular expression language R of an alphabet S is a language formed from strings of the regular expression language of S, and is defined as follows:

  • Ø is in R.
  • Every member of S is in R
  • If u and w are strings in R then (uw) is in R.
  • Similarly, (uw) is in R.
  • Similarly, u* is in R.
  • Nothing else is in R

An example might help. Suppose that S = {a, b}. The regular expression alphabet of S is {a, b, (, ), *, , Ø}. The regular expression language of S is R = {Ø, a, b, (ØØ), (Øa), (Øb), (aØ), (aa), (ab), (bØ), (ba), (bb), (Ø∪Ø), (Ø∪a), (Ø∪b), (a∪Ø), (a∪a), (a∪b), (b∪Ø), (b∪a), (b∪b), Ø*, a*, b*, …}

We’ve defined an alphabet, we’ve defined a language — a language which looks suspiciously like the expressions we’ve been using to talk about languages. Next time we’ll do something insanely clever to bridge the gap between strings in the regular expression language and the languages which these strings would define if they were interpreted as expressions.

Comments (10)

  1. Curtis says:

    Just wanted to let you know how much I am enjoying this series!

  2. Centaur says:

    Norwegians are going to hate you for using the letter O with stroke (U+00D8) instead of the empty set character (U+2205) 🙂

  3. Eric Lippert says:

    The empty set character doesn’t render properly in Lucida Console.

  4. Lance Fisher says:

    I’m really enjoying this series too. Probably my favorite series yet on your blog.

  5. Shawn Cheng says:

    In definition 10, I think the phrase "the regular expression language of S" should be "the regular expression alphabet of S".

  6. Shawn Cheng says:

    Again about definition 10, should we add one more entry to the list as follows:

    If u is a string in R then (u) is in R.

  7. Jon Turner says:

    That extra definition would be redundant, and could be removed. (u) === u so (u)* === u* and so on for all of the operation symbols.

  8. Jon Turner says:

    In response to your other post, Shawn, definition 10 is correct in calling it the "regular expression language" since definition 4 says that a language is a subset of S* (which matches the notation used) and alphabets must be finite according to definition 3 (i think, maybe definition 2) while R is not.

  9. When it comes to validating input regular expression becomes a very important part of your security plan. …