Regular Expressions From Scratch, Part One: Defining Terms

Over the years that I’ve been writing this blog some of the most positive feedback I’ve received has been for those entries where I’ve explored fundamental concepts in computer science. I thought that I might take that to its logical extreme, and do a series on what exactly a “regular expression” is.

Most script developers are familiar with regular expressions – they’re those dense, hard-to-read, patterns that you can use to write self-obfuscating code that does string matching. I’d like to start by throwing out everything that we know about practical, real-world regular expressions and start from set theory, of all things.

This is going to be a very long series I think. But it will build character!

Here’s what I’m going to assume:

  • We have an intuitive understanding of sets. Sets are collections of stuff. They can be empty, finite, infinite, you name it. They can contain other sets. I am not going to start with the axioms of set theory, but trust me, we could justify our intuitive understanding of sets axiomatically if we wanted to.

  • We have an intuitive understanding of the set of natural numbers ℕ = {0, 1, 2, 3, …}

  • We can create arbitrary sets of symbols. For clarity, throughout this series I’ll be displaying symbols in a blue fixed-width typeface. We could have a set of symbols S = {a, b, c, +, , 0, 1, &} for instance.

  • We have an intuitive understanding of finite sequences. Remember, sets are by definition unordered, may be infinite, and contain no duplicates. A finite sequence is like a set but is finite, has an order, and may contain duplicates. A finite sequence may be empty. (Again, we could derive a suitable definition of a finite sequence from axiomatic set theory, but I won’t.) To distinguish finite sequences from sets, I’ll use curly braces for sets and round parentheses for finite sequences, but not for long.

That’s not starting with much. Let’s see what we can do.

Definition 1: An alphabet is a finite set of symbols.

For example, we could have the alphabet of capital Roman letters R = {A, B, C, …, Z} or the alphabet of binary digits B = {0, 1}.

The empty alphabet is a pretty boring alphabet, but it is a legal alphabet.

Definition 2: a string is a finite sequence of symbols drawn from a given alphabet.

It is a pain in the rear to continually say “(b, b, c) is a string over alphabet {a, b, c}” so for the vast majority of cases, I’ll be writing strings all run together like this: bbc and assuming that you can mentally treat this as a sequence.

I’ll also be assuming that you can figure out the alphabet for a given string from context. If the alphabet is ever unclear then I’ll call it out specifically.

I will sometimes use “variables” to refer to strings and languages. You’ll be able to distinguish variables from symbols because symbols are always written in blue fixed-width font. I shall endeavour to also avoid saying something confusing such as “Suppose b = bbc” though as we’ll see later, this will become unavoidable when we talk about regular expressions.

There’s no good way to show the empty string, not without getting into a whole lot of issues with quotation marks. I’m hereby declaring that unless I say otherwise, the variable e represents the empty string in whatever alphabet we’re presently talking about.

Note that the empty string is the only string that can be formed from the empty alphabet.

I’ll almost always use the variable S to represent an alphabet and u, v and w to represent strings.

Definition 3: Consider an alphabet S. The Kleene Closure of S is the set of all strings that can be formed from that alphabet and will be written as S*.

For example, let S = {0, 1}. Then S* = {e, 0, 1, 00, 01, 10, 11, 000, …}. That is, every string of finite length that can be formed from only 0 and 1, including the empty string.

S* is of course an infinite set, though I hasten to emphasize that no member of S* is infinitely long, because strings are by definition finite sequences.

Here comes our most important definition today:

Definition 4: A language over an alphabet S is any subset of S*.

This is a rather broad definition of “language” — any subset whatsoever of the set of all finite strings in any alphabet forms a language.

Next time we’ll look at some examples of languages, both sensible and crazy.

Comments (18)

  1. Hasani says:

    All I know is senior year, I took ‘theory of computation’ where we studied and created dfa and nfa graphs (if I remember correctly). A year later, I picked up an orielly book regex as I needed to use it for some work I was doing… Things quickly fell into place.

    For a day, I was king of the world!

  2. Steve says:

    Automata was a fun class that gave me plenty of headaches.

  3. Lance Fisher says:

    I love the way you address these subjects – very mathematically. I’m looking forward to the series.

  4. Have you thought about coordinating w/ <a href="">Eric Gunnerson</a>?

  5. Many apologies for messing up the link.

  6. Chris Moorhouse says:

    Sweet! Can’t wait for the rest…

  7. Looks like fun sequence of articles. Looking forward to it.

  8. Rich C says:

    I really enjoy reading this stuff, Eric. But just to make life easy, can’t we use Ø for the empty set? That’s the only one I’ve seen.

    It’s just a <ctrl> + /, <shift> + o away.

    Of course, if it doesn’t show up correctly when I post this, I’ll know why, eh :)?

  9. Rich C says:

    Then again, you might be trying to show empty _strings_ not empty _sets_. As that is the case, I’ll just read and enjoy the next in this series and watch myself.

  10. Jonathan says:

    Yeah, I liked Automata in University as well. Not that I remember any of it…

  11. Eric Lippert says:

    Right — e is a string, it is the empty _sequence_. But we’ll be using Ø for something like the empty string in a few episodes, don’t worry.

  12. Mark W says:

    Looking forward to this series… Could you make the blue symbol color a bit brighter, though? I can’t see any color difference between the regular text and the symbols without setting my text size to Largest.

  13. Adam G says:

    I thought that λ was pretty standard-ly the empty string…

    That being said, I remember CS Theory fondly and look forward to the rest of this series.

  14. Eric Lippert says:

    Indeed, traditionally lambda is used for empty string, capital sigma is used for alphabets, etc. My feeling is that Greek letters are somewhat off-putting to non-mathematicians. I’m therefore going to sacrifice both rigor and unimportant factors in traditional notation to hopefully make it easier on an audience that doesn’t necessarily have a background in CS theory. (If you do have a background in CS theory, you probably already know this stuff anyway.)

  15. When it comes to validating input regular expression becomes a very important part of your security plan.&amp;nbsp;…

  16. Tanveer Badar says:

    What about the empty language? The language with no strings in it.

  17. Eric Lippert says:

    Since that is a subset of S*, yes, the empty language is a language.

Skip to main content