# Regular Expressions From Scratch, Part Seven: Listing All Members Of A Language In Order

Regular languages are by definition those languages which can be described by a regular expression. It should be clear from the definition that the union of any finite number of regular languages is a regular language, the concatenation of any finite number of regular languages is a regular language, and the Kleene Closure of any regular language is a regular language.

That's rather a lot of languages. Are all languages regular languages?

We should have a strong intuition that regular languages are a very limited subset of all languages. It seems like it would be hard to come up with a finite regular expression that, say, matches the language {w is in 1* such that there are a prime number of 1s in w}.

But we can do better than intuition. There are many interesting ways to prove that there exists at least one non-regular language. Over the next couple entries we'll find a non-regular language by using a clever trick. The first thing we'll need to do is enumerate every member of a language.

It doesn't really matter how we enumerate the language, as long as we guarantee that we eventually hit every member of the language. Here's one way to do such an enumeration.

Consider an alphabet, say S = {a, b}. We can enumerate all the strings of S* first by length and then by alphabetical order. Of course, we'll need to define an order for our alphabet, but since alphabets are by definition finite sets, we can choose any old order. In general, for alphabets consisting of Roman alphabet characters and numbers we'll use the standard alphabetical ordering we're all used to. We'll start with the one zero-length string, then the two one-symbol strings, then the four two-symbol strings, and so on:

e, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, …

(A commenter correctly pointed out that there are languages which cannot be easily enumerated in this way, but my coming argument does not rely upon an alphabetical ordering. All we need is some schema for enumerating a language which guarantees that we eventually get all of them.)

Consider R, the regular expression language of S. We can enumerate it too, first by length and then by alphabetical order. Let's say that alphabetical order of R is * ( ) a b Ø ∪ just to pick an arbitrary ordering for the symbols. We can then enumerate R in the same way like this:

a, b, Ø, a*, b*, Ø*, a**, b**, Ø**, (aa), (ab), (aØ), (ba), (bb), (bØ), (Øa), (Øb), (ØØ), …

Of course we are leaving out any strings which are not in R, such as * or ))(.

Next time we'll use the fact that we can make both these lists to show that a non-regular language exists.

1. Reinder Verlinde says:

<pedantic>

"Consider an alphabet, say S = {a, b}. We can enumerate all the strings of S* first by length and then by alphabetical order"

No, we can’t. In <http://blogs.msdn.com/ericlippert/archive/2005/11/18/493482.aspx&gt;, you said "Definition 1: An alphabet is a finite set of symbols".

A set does not have an order. Because of this, you can sort on length, but not on alphabet.

The mathematician in me saw this logical error coming in part 1 (to be honest, I expected it to become problematic when you started mentioning things like [a-z]).

</pedantic>

2. Evin says:

Let H be the language containing all pairs of programs and inputs for which the program halts with the given input.

I can enumerate H. But I cannot enumerate H ordered first by length and then alphabetically.

3. Lance Fisher says:

Reinder, I believe the confusion here is the double use of the word alphabet which I’m sure you already understand. When Eric is talking about alphabetic order it’s the normal English alphabetic order, but when he says S = {a,b} he’s defining a Definition 1 type of alphabet.

Certainly you can order any finite set, you just have to define the ordering, just like he did with R.

Is there any reason why we can’t define an alphabet as an ordered set?

4. Eric Lippert says:

Correct — since an alphabet is by definition a finite set, then we can impose any total order relation we choose upon it. We could choose to define an alphabet as a sequence I suppose.

5. Eric Lippert says:

Evin, I would argue that _you_ can’t enumerate H because such an enumeration would require you to prove or disprove Goldbach’s conjecture, amongst other things.

My coming argument doesn’t depend upon the fact that we can make an _alphabetical_ enumeration, just that we can make an enumeration which guarantees that every member of the set is on the enumeration somewhere. A length-then-alpha enumeration just happens to obviously have the property that it hits all of them and misses none of them. But any complete enumeration will do.

6. Evin says:

I’m not sure if the _you_ was a reference to my limited lifetime,

but I can enumerate H as well as I can enumerate any other infinite set. The order of my enumeration is very important. First I will list all programs and inputs of length 1 which halt in 1 operation. Then I will list those no longer than 2 which halt in 2 or fewer operations (skipping those that I listed before). And so on. I will eventually list every member of H (and each will be in a well-defined position). I probably wouldn’t get around to Goldbach’s conjecture before the Sun expands, but if it is false, I would eventually list a program searching for the even number which disproves it.

If I could enumerate H ordered by length or alphabetically, I could solve H by calculating the appropriate number of elements in my list.

If we’re talking about languages in general, and not just the decidable ones, you must be careful with what can be done "clearly."

7. Eric Lippert says:

Very clever, and you are correct, the "clearly" is a big old hand-wave. ðŸ™‚

8. Jonas Grumby says:

I don’t mean to join the pedantry dogpile, but did you skip (a) in your enumeration? (it’s in R, right?)

9. Eric Lippert says:

No, "(a)" is not in R. Remember, the rule is:

* if u and w are strings in R then (uw) is a string in R.

"a" is a string in R, but the empty string is not, so (a) is not a string in R.

10. Lance Fisher says:

Evin, can you enumerate an infinite set? My intuative definition of enumerate is "make a (finite) list." Perhaps this is wrong.

However, I do know that you can’t order any old infinite set. Try ordering the points on a plane.

11. pcooper says:

Ordering the points on a plane doesn’t seem all that hard. Point (x1,y1) < Point (x2,y2) if (x1 < x2) or (x1 = x2 and y1 < y2).

12. Eric Lippert says:

You two are using two different definitions of "order".

pcooper is describing a total order on an arbitrary set. That is, given any two members of the set can we consistently describe whether one is >, = or < the other. Total orders should have nice properties like antisymmetry and transitivity.

Lance is describing enumeration. An enumeration imposes a total order on a set, but there are sets that can be ordered that cannot be enumerated.

By "enumerated" I mean "put into 1-1 correspondance with the positive integers". That is, for every positive integer you can find a unique associated member, and for every member you can find a unique integer.

Indeed, the set of real pairs is not enumerable. The set of reals is not enumerable either. The set of rationals is though.

13. pndmnm says:

pcooper, it’s worth noting that a total order on the pairs of reals isn’t nearly as useful as giving them field order would be (a field is "ordered" if you can seperate the non-zero elements into two sets, P and N, x is in P if and only if -x is in N, and P is closed under the field operations) — the standard field of pairs of reals (the complex plane) cannot be ordered in this fashion (by Artin-Schreier).

(I realize this has little to nothing to do with the topic at hand, but if anyone gets here by googling for orders on pairs of reals…)

Great sequence, I look forward to reading the rest of it.

14. When it comes to validating input regular expression becomes a very important part of your security plan.&amp;nbsp;…