# Every Program There Is, Part Two

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Suppose we want to come up with a CFG for numbers with additions. Consider this very simple grammar with only one nonterminal. We’ll say that an expression is either a number, or a sum of two expressions:

X: 1 | 2 | 3 | X + X

We can make the following series of substitutions, each time substituting the leftmost remaining nonterminal:

X
X + X
X + X + X
1 + X + X
1 + 2 + X
1 + 2 + 3

We can also make this series of substitutions, again, every time substituting for the leftmost remaining nonterminal:

X
X + X
1 + X            <-- different!
1 + X + X
1 + 2 + X
1 + 2 + 3

How interesting; we had two rather similar but nonetheless different “leftmost” derivations for the same sequence of terminals! CFGs for which there exist strings of terminals that have two different derivations are called “ambiguous” grammars. It is often extremely difficult to correctly parse an ambiguous CFG, though of course we are not attempting to parse here. But it is also difficult to use an ambiguous CFG to generate all the strings in a language because what sometimes happens is you end up generating the same string more than once. Ideally for our purposes we would like to guarantee that a given string is only generated once. If the CFG is non-ambiguous then we know that trying all possible combinations of rule applications will not lead to the same string generated twice.

Unfortunately, the general problem of ambiguity turns out to be provably unsolvable. There is no general way to (1) look at a grammar and determine whether it is ambiguous, or (2) take an ambiguous CFG and find an “equivalent” unambiguous CFG. However, not all is lost. There are ways to prove that particular grammars are unambiguous, even though there is no general solution. And there are lots of common sneaky tricks for making the sorts of grammars we typically encounter in programming languages into unambiguous grammars.  In this particular case we could say:

S: N | S + N
N: 1 | 2 | 3

And now the only leftmost derivation for 1 + 2 + 3 is

S
S + N
S + N + N
N + N + N
1 + N + N
1 + 2 + N
1 + 2 + 3

I’m not going to prove that this is an unambiguous grammar; that would take us too far afield. Just take my word for it, this is unambiguous. From now on we’ll be trying to produce only unambiguous grammars.

Next time: saying nothing requires extra work

[This is part of a series on generating every string in a language. The previous part is here. The next part is here.]

Tags

1. Tim Goodman says:

I think a proof could go something like this:

Let’s consider how we can make a given string with P plusses.

(1) Starting from one S, it is only possible to produce strings with one S in the left most position, or strings with no S’s.  (I believe this is provable by induction.)

(2) We can’t increase the number of plusses if our string doesn’t contain S.  So if our current string has fewer than P plusses, we can’t eliminate S.  Thus, working from the possible strings as given by (1) and only doing leftmost replacement the only replacement that will work is S –> S + N

(3) It’s impossible to reduce the number of plusses, so if we currently have more than P plusses we can’t produce the desired string.

(4) If we currently have P plusses and we have an S on the left, then given (3) the only valid leftmost replacement is S –> N (anything else will give more than P plusses).

(5) If we currently have P plusses and we don’t have an S on the left, then the only non-terminals we have are N’s (since from (1) we know we can’t have an S anywhere else), and N’s can only be replaced with terminals, thus the only valid leftmost replacement is to replace the leftmost N with the appropriate terminal.

Thus in all cases there is at most one valid replacement to make in order to proceed toward a given string with P plusses.  So the route to our given string is fully determined, and the grammar is unambiguous.

2. Tim Goodman says:

As for step (1), you can’t produce a string which contains an S from one which doesn’t contain an S, so it suffices to show that without ever eliminating all S’s we can only produce strings of the form S + N + N + N + . . .

So, considering only strings that contain an S,

We know that for the P=0 case we can only have the string S.

We assume that for the P=N case the only string we can get without eliminating S is S + N + N + N + … (P times)  [This is of course how induction works, you assume the P = N case and show that this implies the P = N + 1 case]

Based on that assumption, the only replacement that won’t eliminate S is the one that gives S + N + N + N + … (P+1 times)

It’s impossible to get a string with P+1 plusses from one with less than P plusses or from one with more than P + 1 plusses, and it’s impossible to go from S + N + N + N + … (P+1 times) to another string with P+1 plusses without eliminating S, so we see that S + N + N + N + … (P+1 times) is the *only* string with P+1 plusses that we can produce without eliminating S.

This completes the proof by induction.

3. Joren says:

Your proof seems correct, but I don’t see why you would use induction over P for part (1). Induction through the productions seems simpler.

Looking at the production rules, it is clear that a string that does not contain the symbol S can only produce other strings that also not contain S. (N -> 1, N -> 2, N -> 3 are the only possibilities.) It is also clear that a string starting with S and containing S only once can produce either a string without S (S -> N) or a string that also starts with S and contains it only once (S -> S + N). [These two observations compose the inductive step.]

The starting string is "S", which starts with S and contains it only once. [basis step.]

Therefore, the grammar only produces strings that do not contain S, and strings that contain S exactly once, and indeed start with S.

4. Tim Goodman says:

You have a good point, Joren.  There’s no particular reason I did it the way I did other than that it was the first thing that occurred to me (I was thinking in terms of number of plusses because my overall reasoning for the full proof was basically "We can’t do anything except S –> S+N until we get the desired number of plusses.")

5. Chris B says:

On the notion of ambiguity, I have often wondered how the C# compiler differentiates expressions like

if(X < Y > Z) // illegal, comparison operators are non-associative

and

if(X<Y>.Z) // valid as long as the class X<T> contains a boolean property/field named Z and Y is  a valid type

How do the lexer and the parser interact in this case?  It seems that the lexer can’t be sure of the meaning of the tokens unless it is allowed to make some use of the syntactic analysis, but I think I remember reading something at some point that said that the C# compiler doesn’t do that.

Your questions make numerous invalid assumptions, so let’s clear those up first.

First off, you seem to think that the lexer’s job is to determine the meaning of the tokens. That is not at all the purpose of the lexer. The lexer determines the boundaries of the tokens in the character stream; that is its only job. The parser organizes the resulting token stream into a syntax tree, and then then semantic analyzer determines the meaning of every element in the syntax tree, and then the code generator generates the code.

You correctly note that each of these stages proceeds with no hints traveling backwards in time from subsequent stages. Lexing needs no hints from parsing or semantic analysis to do a correct lex. Parsing needs no hints from semantic analysis.

This is a highly desirable property for a language to have, a property which, astonishingly, ECMAScript does not have. See section 7 of the E5 spec, which states that the goal production of the lexical grammar changes depending on where you are in the syntactic analysis of the tokens! The tokenizer has to know what the parser is going to do in the future.

Second, you have not demonstrated a program fragment that has an ambiguous lex. The token boundaries are unambiguous. There are a couple places where the lexer could get into trouble with generics, for instance, in A>>B there’s a right shift operator but in List<Func<A>>B that’s two close-angle-brackets. The lexer resolves this by always lexing > as an angle bracket; the parser then detects two angle brackets in a row where an operator is expected and treats that as a shift operator. In other words, the right shift operator is two tokens, not one. But everything in your program fragments lexes unambiguously. Whether it parses unambiguously is the question.

Third, you state that A < B > C must be an error because “comparison operators are not associative”. That statement is completely false; comparison operators are documented as being left associative in the specification. Your program fragment is perfectly legal lexically and syntactically. It is thereby a legal program provided that semantic analysis can find meanings for each operator. Here’s a program which uses this pattern and compiles and runs just fine:

class OrderedBool
{
public static OrderedBool True = new OrderedBool();
public static OrderedBool False = new OrderedBool();
public static implicit operator bool(OrderedBool b)
{ return b == True; }
public static implicit operator OrderedBool(bool b)
{ return b ? True : False; }
public static bool operator<(OrderedBool b1, OrderedBool b2)
{ return b1 == False && b2 == True; }
public static bool operator>(OrderedBool b1, OrderedBool b2)
{ return b1 == True && b2 == False; }
static void Main()
{
System.Console.WriteLine(True < False > True);
}

If you’re following along carefully you might now ask why it is that List<Func<A>>B parses unambiguously as a type followed by a variable declaration and not as a sequence of less than and right shift operators; we could probably come up with some contrived example where that was legal semantically, as I did with A<B>C. The answer is that the parser always knows when a type is expected or unexpected; if it expects a type, it parses the token stream as a type. If it expects an expression, it parses it as an expression.

Now that we have eliminated your invalid assumptions, your question remains: how does the parser analyze expressions which might be ambiguously parsed? I would draw your attention to three parsing problems that introduce ambiguities.

First is the cast operator. Is (A)-B a cast of -B to type A, or is it the subtraction of B from A? For an analysis of these ambiguities, see section 7.6.6 of the specification.

Second is generic method calls. Is F(G<A,B>(7)) a call to a method F that takes two bools, G<A and B>(7), or is it a call to a method F that takes one argument which is the result of generic function call G<A,B>(7) ? For an analysis of these ambiguities see section 7.5.4.2. This section answers your question about X<Y>.Z vs X<Y>Z.

Third is query comprehensions. If you have var select = 10; var query = from a in b select a + select; do the “selects” in the query mean the local variable, or the keyword select, or both or what? See section 7.15.1 for the rules there.

I hope that answers your questions. Let me know if any of that is not clear, or if there is an ambiguity that we missed.

— Eric

6. Chris B says:

That helps a LOT.  I should have realized the problem with X<Y>Z was not the associativity of the operators, but rather the semantic meaning based on the message the compiler gave me (Operator cannot be applied to operands of type bool and bool).  OTOH, I do find it surprising that the comparison operators are left associative.  I can only assume that this is due to operator overloading and implicit conversions.  At least, I can’t think of any other means to provide semantic meaning to such an expression in C#.

7. Pavel Minaev says:

@Chris:

I doubt they are deliberately left-associative in a sense that this is intended to be actually used. However, they have been historically left-associative since at least C, and probably older than that (IIRC, wasn’t Boolean in Pascal an enumerated type, and thus its values ordered and comparable with < and >?). Definitely within the curly braces language family, every other language out there has done it that way.

8. Random832 says:

Wouldn’t it have been simpler to allow >> [one token] to close two type argument lists when in such a context? It’d support all of the same scenarios without introducing a brand-new kind of whitespace sensitivity.

I highly doubt it. Try writing a BNF grammar for your way and see if it’s concise.

Alternately, why not just say “ok, weird, but no harm done” when encountering “> >” or “> >=” in an operator context, and treat it as a shift operator anyway?

That would be weird.

So… are these not-quite-ambiguities like A<B>C, List<Func<A>>B, etc, the _reason_ it is illegal to use any expression as a statement [CS0201], or was this rule (which happened to avoid a whole class of ambiguities the spec would otherwise have had to deal with) made for another reason?

Since that rule predates the invention of generics, it cannot be the reason. The rule that an expression statement must be a method call, assignment, increment, decrement or construction is from C# 1.0. Generics were not added until C# 2.0. The reason is because we like the principle that statements have side effects; ideally, a statement logically has exactly one side effect. Look at it this way: what is the compelling *benefit* of allowing a statement which has no side effect and only calculates a value and then discards it? Such a thing is almost certainly an error, so there had better be an extremely good reason why we should make it legal.

How does C deal with the cast thing?

I don’t know. I’m not an expert on the grammar of C. — Eric

9. Random832 says:

“That would be weird.” Much less weird than “These productions are treated specially” without bothering to even put it in the grammar, particularly considering it would only be weird from a historical perspective [in that it was once a single token], whereas the current solution is weird even looking at the C# 2.0 language in isolation [literally nothing else anywhere is sensitive to whitespace occuring between tokens].

I think you’ll find that we did in fact “bother” to put it in the grammar. See section 2.4.5 of the specification. — Eric

What’s the ‘extremely good reason’ for introducing this bizarrely singular case of special whitespace handling? Because not doing so would allow people to write programs that subjectively look ‘weird’?

Because doing so solves the ambiguity problem without introducing a weird inconsistency in the language. — Eric

Why not go further and regulate whitespace appearance in more places?

I think that would have been awesome. I don’t know if I would go as far as Python in making whitespace significant, but I think that having a standardized style for whitespace that could be optionally enforced by the compiler would be a nice feature. Unfortunately it is too late to make whitespace more significant now; that would break too many programs. Next time you design a new language from scratch, keep that in mind. — Eric

Requiring whitespace around binary operators and forbidding it for unary ones would disambiguate that cast, for instance – (A)-B is as different from (A) – B as x >> 1 is from x > > 1. Same for F(G < A, B > (7)) vs F(G<A, B>(7)) [the former is the one with relational operators, the latter with a generic method call]

Sure. But of course that would break lots of existing programs, so we cannot do it now. — Eric

10. Random832 says:

” think you’ll find that we did in fact “bother” to put it in the grammar. See section 2.4.5 of the specification. — Eric”

“The vertical bar in the right-shift and right-shift-assignment productions are used to indicate that, unlike other productions in the syntactic grammar, no characters of any kind (not even whitespace) are allowed between the tokens. These productions are treated specially in order to enable the correct  handling of type-parameter-lists (§10.1.3).”

The fact that it provides a text explanation of what the vertical bar notation means made me think that it’s a notation invented for the vaguely defined “treated specially” statement in the text, rather than being an actual part of a BNF grammar (nevermind that it is unclear as to whether this is meant to be part of the lexical grammar or the syntactic grammar) It is also impossible for a parser to use this rule if the terminals of the syntactic grammar are in fact tokens (implied by 2.2) rather than something different like “tokens plus a way to find out what characters are between it and the adjacent tokens”. The information on whether tokens were separated by whitespace is supposed to have been lost by the time it gets to the parser, and would be if not for vaguely-defined “special” handling. 2.3 outright states that whitespace has no impact on syntactic structure except to separate tokens.

Including it in the grammar would mean defining whitespace (maybe only after a > token) to be an actual token, and defining in the syntactic grammar where it is allowed (i.e. everywhere but inside a shift operator). Made-up >|> notation isn’t a substitute for that.

Yes, it is a substitute for that. That’s precisely what it is. — Eric

“Because doing so solves the ambiguity problem without introducing a weird inconsistency in the language. — Eric”

The inconsistency avoided here is not _in_ the language, it is between this language and other languages. Having two (and only two) pairs of tokens which cannot be separated by whitespace (in some contexts) is itself a weird inconsistency. If anything it would have been slightly more consistent to _never_ allow them to be separated by whitespace (making List<Action<T> > a syntax error) – that at least keeps it so that the lexer is the only stage concerned with whitespace.

So your proposal is to prioritize the logical independence of the lexer and the parser over the usefulness of the language to our customers? Do you by any chance have a background in academia? — Eric

If the design required you to make an operator that’s two tokens, that’s already an inconsistency [though a small one].

The cast operator also has two tokens, a ( and a ). The conditional operator has two, a ? and a :. The checked and unchecked opererators have three. An operator that has a bunch of tokens in it is not unusual. — Eric

Requiring the parser to break encapsulation just to maintain the illusion that it’s still one token in the places it was used before doesn’t “take it back” – it introduces a second, much larger, inconsistency.

Sure, I can see that point of view. Since you’re the first person, to my knowledge, to complain about it, I suspect that this “much larger” inconsistency is still pretty darn small.

C# is certainly not the only professional programming language to have rules where whitespace around tokens affects the syntactic analysis. It is somewhat surprising in fact that there is only this one special case. JScript has *many* places where whitespace affects the grammar. In any event, it is not our goal to make a grammatically or lexically “pure” language. That is desirable of course, but it is not a primary goal. Our goal is to make a useful language that makes it easy for customers to express the desired idiom without weird “gotchas”. I would bet that fewer than one user in ten thousand has any clue that there is something special about the handling of > in the grammar. That’s a sign that we’ve succeeded. — Eric