C++/CLI keywords: Under the hood

C++/CLI specifies several keywords as extensions to ISO C++. The way they are handled
falls into five major categories, where only the first impacts the meaning of existing
ISO C++ programs.

1. Outright reserved words

As of this writing (November 22, 2003, the day after we released the candidate base
document), C++/CLI is down to only three reserved words:

  gcnew generic nullptr

An existing program that uses these words as identifiers and wants to use C++/CLI
would have to rename the identifiers. I'll return to these three again at the end.

All the other keywords, below, are contextual keywords that do not conflict with identifiers.
Any legal ISO C++ program that already uses the names below as identifiers will continue
to work as before; these keywords are not reserved words.

2. Spaced keywords

One implementation technique we are using is to specify some keywords that include
embedded whitespace. These are safe: They can't possibly conflict with any user identifiers
because no C++ program can create an identifier that contains whitespace characters.
[I'll omit the obligatory reference to Bjarne's classic April Fool's joke article
on the whitespace operator. :-) But what I'm saying here is true, not a joke.]

Currently these are:

  for each
enum class/struct
interface class/struct
ref class/struct
value class/struct

For example, "ref class" is a single token in the lexer, and programs
that have a type or variable or namespace named ref are entirely
unaffected. (Somewhat amazingly, even most macros named ref are
unaffected and don't affect C++/CLI, unless coincidentally the next token in the macro's
definition line happens to be class or struct; more
on this near the end.)

3. Contextual keywords that can never appear where an identifier could appear

Another technique we used was to define some keywords that can only appear in positions
in the language grammar where today nothing may appear. These too are safe: They can't
conflict with any user identifiers because no identifiers could appear where the keyword
appears, and vice versa. Currently these are:

  abstract finally in
override sealed where

For example, abstract as a C++/CLI keyword can only appear in a class
definition after the class name and before the base class list, where nothing can
appear today:

  ref class X abstract : B1, B2 { // ok, can only be the keyword

    int abstract;                  //
ok, just another identifier

  };

  class abstract { };              //
ok, just another identifier

  namespace abstract { /*...*/ }   // ok, just another identifier

4. Contextual keywords that can appear where an identifier could appear

Some keywords can appear in a grammar position where an identifier could also appear,
and this is the case that needs some extra attention. There are currently five keywords
in this category:

  delegate event initonly
literal property

In such grammar positions, when the compiler encounters a token that is spelled the
same as one of these keywords, the compiler can't know whether the token means the
keyword or whether it means an identifier until it first does some further lookahead
to consider later tokens. For example, consider the following inside a class scope:

  property int x;   // ok, here property is the contextual
keyword

  property x;       // ok, if property
is the name of a type

Now imagine you're a compiler: What do you do when you hit the token property as
the first token of the next class member declaration? There's not enough information
to decide for sure whether it's an identifier or a keyword without looking further
ahead, and C++/CLI has to specify the decision procedure -- the rules for deciding
whether it's a keyword or an identifier. As long as the user doesn't make a mistake
(i.e., as long as it's a legal program with or without C++/CLI) the answer is clear,
because there's no ambiguity.

But now the "quality of diagnostics" issue rears its head, in this category of
contextual keywords and this category only: What if the user makes a mistake?
For example:

  property x;       // error, if no
type "property" exists

Let's say that we set up a disambiguation rule with the following general structure
(I'll get specific in just a moment):

  1. Assume one case and try to parse what comes next that way.

  2. If that fails, then assume the other case and try again.

  3. If that fails, then issue a diagnostic.

In the case of property x; when there's no type in scope named property,
both #1 and #2 will fail and the question is: When we get to the diagnostic in case
#3, what error message is the user likely to see? The answer almost certainly is,
a message that applies to the second "other" case. Why? Because the compiler already
tried the first case, failed, backed up and tried the second "other" case -- and it's
still in that latter mode with all that context when it finally realizes that didn't
work either and now it has to issue the diagnostic. So by default, absent some (often
prodigious) amount of extra work inside the compiler, the diagnostic that you'll get
is the one that's easiest to give, namely the one for the case the compiler was most
recently pursuing, namely the "other" case mentioned in #2 -- because the compiler
already gave up on the first case, and went down the other path instead.

So let's get specific. Let's say that the rule we picked was:

  1. Assume that it's an identifier and try to parse it that way

     (i.e., by default assume no use of the keyword extension).

  2. If that fails, then assume that it's the keyword and try again.

  3. If that fails, then issue a diagnostic.

Under that rule, what's the diagnostic the user gets on an illegal declaration of property
x;
? One that's in the context of #2 (keyword), something like "illegal property
declaration," perhaps with a "the type 'x' was not defined" or a
"you forgot to specify the type for property 'x'" in there somewhere.

On the other hand, let's say that the rule we picked was:

  1. Assume that it's the keyword and try to parse it that way.

  2. If that fails, then assume that it's an identifier and try again.

  3. If that fails, then issue a diagnostic.

Under this rule, the diagnostic that's easy to give is something like "the type 'property'
was not defined."

Which is better?

This illustrates why it's very important to consider common mistakes and whether the
diagnostic the user will get really applies to what he was probably trying to do.
In this case, it's probably better to emit something like "no type named 'property'
exists" than "you forgot to specify a type for your property named 'x'"
-- the former is more likely to address what the user was trying to do, and it also
happens to preserve the diagnostics for ISO C++ programs.

More broadly, of course, there are other rules you can use than the two "try one way
then try the other" variants shown above. But I hope this helps to give the flavor
for the 'quality of diagnostics' problem.

-
Aside: There's usually no ambiguity in the case of property (or
the other keywords in this category); the only case I know of where you could write
legal C++/CLI code where one of these five keywords could be legally interpreted
both ways, both as the keyword and as an identifier, is when the type has a global
qualification. Here's an example courtesy of Mark Hall:

   **initonly :: T t;**  
  
  
  
  
Is this a declaration of an **initonly** member **t** of  
type **::T** (i.e, **initonly ::T t;** ), or  
a declaration of a member **t** of type **initonly::T** (i.e, **initonly::T <br> t;** where if initonly is the name of a namespace or class then this is legal  
ISO C++). Our current thinking is to adopt the rule "if it can be an identifier, it  
is," and so this case would mean the latter, either always (even if there's no such  
type) or perhaps only if there is such a type.

I feel compelled to add that the collaboration and input over the past year-plus from Bjarne
Stroustrup
and the folks at EDG (Steve Adamczyk,
John Spicer, and Daveed Vandevoorde) has been wonderful and invaluable in this regard
specifically. It has really helped to have input from other experienced compiler writers,
including in Bjarne's case the creator of the first C++ compiler and in EDG's case
the folks who have one of the world's strongest current C++ compilers. On several
occasions all of their input has helped get rid of inadvertent assumptions about "what's
implementable" and "what's diagnosable" based on just VC++'s own compiler implementation
and its source base. What's easy for one compiler implementation is not necessarily
so for another, and it's been extremely useful to draw on the experience of comparing
notes from two current popular ones to make sure that features can be implemented
readily on various compiler architectures and source bases (not just VC++'s) and with
quality user diagnostics.

5. Not keywords, but in a namespace scope

Finally, there are a few "namespaced" keywords. These make the most sense for pseudo-library
features (ones that look and feel like library types/functions but really are special
names known to the compiler because the compiler does special things when handling
them). They appear in the stdcli namespace and are:

  array interior_ptr pin_ptr
safe_cast

That's it.

Now, for a moment let's go back to case #1, reserved words. Right now we're down to
three reserved words. What would it take to get down to zero? Consider the cases:

- nullptr: This has been proposed in WG21/J16 for C++0x, and at the last meeting three weeks ago the evolution working group (EWG) was favorable to it but wanted a few changes. The proposal
paper
was written by me and Bjarne, and we will revise the paper for the next meeting to reflect the EWG direction. If C++0x does adopt the proposal and chooses to take the keyword nullptr then the list of C++/CLI reserved words goes down to two and C++/CLI would just directly follow the C++0x design for nullptr, including any changes C++0x makes to it.

**gcnew**: One obvious way to avoid taking this as a reserved word would  
be to put it into bucket \#1 as a spaced keyword, "**gc new**".  
  • generic: Similarly, a spaced keyword (possibly "generic template")
    would avoid taking this reserved word. Unfortunately, spelling it "<anything> template"
    is not only ugly, but seriously misleading because a generic really is not at all
    a template.

Is it worth it to push all the way down to zero reserved words in C++/CLI? There are
pros and cons to doing so, but I've certainly always been sympathetic to the goal
of zero reserved words; Brandon and
others will surely tell you of my stubborn campaigning to kill off reserved words
(I think I've killed off over a half dozen already since I took the reins of this
effort in January, but I haven't kept an exact body count).

I think the right time to decide whether to push for zero reserved words is probably
near the end of the C++/CLI standards process (summer-ish 2004). At that point, when
all other changes and refinements have been made and everything else is in its final
form, we will have a complete (and I hope still very short) list of places where C++/CLI
could change the meaning of an existing C++ program, and that will be the best time
to consider them as a package and to make a decision whether to eliminate some or
all of them in a drive-it-to-zero cleanup push. I am looking forward to seeing what
the other participants in all C++ standards arenas, and the broader community, think
is the right thing to do as we get there.

Putting it all together, what's the impact on a legal ISO C++ program? Only:

- The (zero to three) reserved words, which we may get down to zero.

Macros with the same name as a contextual keyword, which ought to be rare because  
macros with all-lowercase names, never mind names that are common words, are already  
considered bad form and liable to break way more code than just C++/CLI. (For example,  
if a macro named **event** existed it would already be breaking most  
attempts to use Standard C++ iostreams, because the iostreams library has an enum  
named **event**.)  

Let me illustrate the macro cases with two main examples that affect the spaced keywords:

  // Example 1: this has a different meaning in ISO C++ and C++/CLI

  #define interface struct

In ISO C++, this means change every instance of interface to struct.
In C++/CLI, because "interface struct" is a single token, the macro
means instead to change every instance of "interface struct" to nothing.

Here's the simplest workaround:

  // Workaround 1: this has the same meaning in both

  #define interface interface__
#define interface__ struct

Here's another example of a macro that can change the meaning of a program in ISO
C++ and C++/CLI:

  // Example 2: this has a different meaning in ISO C++ and C++/CLI

  #define ref const
ref class C { } c;

In ISO C++, ref goes to const and the last line
defines a class C and simultaneously declares a const object of that
type named c. This is legal code, albeit uncommon. In C++/CLI, the
macro has no effect on the class declaration because "ref class"
is a single token (whereas the macro is looking for the token ref alone,
not "ref class") and so the last line defines a ref class C and
simultaneously declares a (non-const) object of that type named c.

Here's the simplest workaround:

  // Workaround 2: this has the same meaning in both

  #define REF const
REF class C { } c;

But hey, macro names are supposed to be uppercase anyway. :-)

I hope these cases are somewhere between obscure and pathological. At any rate, macros
with short and common names are generally unusual in the wild because they just break
so much stuff. I would rate example 1 above as fairly obscure (although windows.h
has exactly that line in it, alas) and example 2 as probably outright pathological
(as I would rate all macros with short and common names).

Whew. That's all for tonight.