String Literal Conversion to String: Is It a Disaster?

Article
07/23/2004

A reader asks,

Sender: Jack
re: String Literals are now a Trivial Conversion to String
Won't it break most of existing libraries who will try to port to C++/CLI? One override for String^ will break a lot of user code and make calls for the overriden function with string literals look much uglier. Maybe it is better to make two types of string literals differ by, say literal prefix (i.e. old string literals would look like "this" and new ones like c"this")?

As Adam Merz noted in his response to Jack's question of a literal modifier,

That is exactly what Managed Extensions for C++ did with the S prefix, and what they are trying to avoid at this point (I would think)...

Adam is correct. Here is how I described the change in an internal translation guide between the Managed Extensions for C++ and the revised C++/CLI language – this will give you the historical context for why we made the original change.

In the original language design, a managed string literal was indicated by prefacing the string literal with an S. For example,

String *ps1 = "hello";

String *ps2 = S"goodbye";

The performance overhead between the two initializations turns out to be non-trivial, as the following MSIL representation demonstrates as seen through ildasm:

// String *ps1 = "hello";

ldsflda valuetype $ArrayType$0xd61117dd

modopt([Microsoft.VisualC]Microsoft.VisualC.IsConstModifier)

'?A0xbdde7aca.unnamed-global-0'

newobj instance void [mscorlib]System.String::.ctor(int8*)

stloc.0

// String *ps2 = S"goodbye";

ldstr "goodbye"

stloc.0

That’s a pretty remarkable savings for just remembering [or learning] to prefix a literal string with an S. In the revised V2 language, the handling of string literals is made transparent, determined by the context of use. The S no longer needs to be specified.

What about cases in which we need to explicitly direct the compiler to one interpretation or another, as in the case of an overloaded pair of functions?

void f(const char*);

void f(String^);

f("ABC"); // by default calls f(const char*)

In the revised language, an explicit cast is used rather than the prefix S. For example,

f(( String^ )"ABC"); // ok: invoked f( String^ )

As you can see, the revised language originally sought merely to correct a failing in the original design – the surprisingly performance penalty for a user misstep in the declaration of a CLI string literal.

This subsequent refinement during the standardization of the language under ECMA represents imo a rebalancing of the CLI and native type systems -- an acknowledgement of the need for the design to be Janus-faced (an image I introduced in one of the three introductory blogs) – that is, to look equally on the needs of the CLI and native programmer.

Which brings us back to Jack's original question: isn't this going to blow existing code out of the water? Or, to put it more crudely, isn't this a disaster?

Well, I don't believe so, although I wasn't part of the decision-making process and since the change is currently undocumented as far as I am aware, I have not seen any analysis of the effect of the language change. So, let me give you my take on the effect, and give my reasoning as to why I don't see it as being quite as dire as does Jack.

The addition of a trivial conversion of a string literal to String^ is equivalent to that of const char* rather than taking precedence. The determination of a best viable function, therefore, results in the introduction of an ambiguity rather than to silently change the resolution (and therefore behavior) of the program. The user would then explicitly cast the invocation to the intended type; admittedly, this could be burdensome, but it is not dangerous.

Under C#, a change in the access level of a class member silently changes the resolution of a named reference in an existing program – which may be completely unknown to the person making the change. We are not talking here of that level of semantic rupture, but merely the difficulty of accommodating the mechanical importation of native code into the CLI program space.

So, let us try to address the issue of burden, which is what I suspect this all reduces to. I don't believe it is really all that extensive because the C++ overload mechanism operates by scope rather than signature, and so the set of candidate functions is constrained – unlike a language like C#, for example, where a change like this would be, I suspect, more severe and difficult to manage. (Again, I am not part of the design process currently and so I haven't drilled down on this as deeply as I otherwise might.)

1. Candidate functions are limited by scope. Therefore, the extent of the effect of adding an f(String^) to an existing set of f() is limited to the scope in which it is introduced.

To be exhaustive, we would break this down into the possible scope scenarios – local function, independent class, class hierarchy, namespace, and global scope – and analyze the extent of impact within each. I have not done that, but I suspect that it is really only the global and namespace introductions that are potentially disruptive and burdensome.

Class hierarchies are not burdensome because names do not overload across base/derived class boundaries in C++ since they maintain their own scope. Namespaces are potentially burdensome because of using declarations, and the global namespace is burdensome just because it is the global namespace.

1. the introduction of an f(String^) within a class or class hierarchy is not truly burdensome in its potential introduction of an ambiguity. While it may cause a cascade of ambiguities with the introduction of f(String^) at global scope, or within a heavily-used namespace, the real questions then is a design question, imo. What is the purpose of introducing the f(String^) in a space in which const char* is still heavily used? Perhaps in migrating the existing native code base, we should exercise some design refactoring of the interface? What is the benefit of supporting both const char* and String in a CLI environment? and so on.

So, I think the spirit of the change is in the right direction. The solution isn't perfect, however, since this represents the only potentially truncating trivial conversion – this is what I mean when I say that imo it is not strictly ISO-C++ conforming: there are no other trivial conversions that suffer a loss of precision – those are all more costly conversions. Of course, on the other hand, String^ is not ISO-C++, but I am not being that literal here. The problem is that if I place a wide-character in the string literal while it exactly matches String^ in the abstract, in practice, it is first parsed as a const char*, and so the second byte is discarded. (Thank you, Dave Waggoner, for pointing out that problem.)

String Literal Conversion to String: Is It a Disaster?

Additional resources