Interning Strings and immutability


Managed strings are subject to ‘interning’. This is the process where the system notices that the same string is used in several places, so it can fold all the references to the same unique instance.

Interning happens two ways in the CLR.

 

  1. It happens when you explicitly call System.String.Intern(). Obviously the string returned from this service might be different from the one you pass in, since we might already have an intern’ed instance that has been handed out to the application.
  2. It happens automatically, when you load an assembly. All the string literals in the assembly are intern’ed. This is expensive and – in retrospect – may have been a mistake. In the future we might consider allowing individual assemblies to opt-in or opt-out. Note that it is always a mistake to rely on some other assembly to have implicitly intern’ed the strings it gives you. Through versioning, that other assembly might start composing a string rather than using a literal.

 

One thing that might not be immediately obvious is that we intern strings across all AppDomains. That’s because assemblies can be loaded as domain-neutral. When this happens, we execute the same code bytes at the same address in all AppDomains into which that assembly has been loaded. Since we can burn the addresses of string literals into our native code as immediate data, we clearly benefit from intern’ing across all AppDomains rather than using per-AppDomain indirections in the code. However, this approach does add overhead to intern’ing: we are forced to use per-AppDomain reference counts into a shared intern’ing table, so that we can unload intern’ed strings accurately when the last AppDomain using them is itself unloaded.

Normally, strings should be compared with String.Equals and similar mechanisms. Note that the String class defines operator== to be String.Equals. However, if two strings are both known to have been intern’ed, then they can be compared directly with a faster reference check. In other words, you could call Object.operator==() rather than String.operator==(). This is only recommended for highly performance-sensitive scenarios when you really know what you are doing.

Of course, string intern’ing only works if strings are immutable. If they were mutable, then the sharing of strings that is implicit in intern’ing would corrupt all kinds of application assumptions – as we will see.

The good news is that strings are immutable… mostly. And they are immutable for many good reasons that have nothing to do with intern’ing. For example, immutable strings eliminate a whole host of multi-threaded race conditions where one thread uses a string while another string mutates it. In some cases, those race conditions could be used to mount security attacks. For example, you could satisfy a FileIOPermission demand with a string pointing to an innocuous section of the file system, and then use another thread to quickly change the string to point to a sensitive file before the underlying CreateFile occurs.

So how can strings be mutated?

Well, you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or Managed C++ code to write into a string’s buffer. In those cases, some highly trusted code is performing some clearly dirty operations. This case isn’t going to happen by accident.

A more serious concern comes with marshaling. Here’s a program that uses PInvoke to accidentally mutate a string. Since the string happens to have been intern’ed, it has the effect of changing a string literal in an unrelated part of the application. We pass ‘computerName’ to the PInvoke, but ‘otherString’ gets changed too!

 

using System;
using System.Runtime.InteropServices;

public class Class1
{

static void Main(string[] args)
{

String computerName = “strings are always immutable”;
String otherString = “strings are always immutable”;

int len = computerName.Length;
GetComputerName(computerName, ref len);

Console.WriteLine(otherString);

}

[DllImport(“kernel32”, CharSet=CharSet.Unicode)]
static extern bool GetComputerName(
[MarshalAs (UnmanagedType.LPWStr)] string name,
ref int len);

}

 

 

And here’s the same program written to avoid this problem:

 

using System;
using System.Runtime.InteropServices;

public class Class1
{

static void Main(string[] args)
{

String computerName = “strings are always immutable”;
String otherString = “strings are always immutable”;

int len = computerName.Length;
GetComputerName(ref computerName, ref len);

Console.WriteLine(otherString);

}

[DllImport(“kernel32”, CharSet=CharSet.Unicode)]
static extern bool GetComputerName(
[MarshalAs(UnmanagedType.VBByRefStr)]
ref string name,
ref int len);

}

 

 

In this second case, VBByRefStr is used for the marshaling directive. The argument is treated as ‘byref’ on the managed side, but remains ‘byval’ on the unmanaged side. If the unmanaged side scribbles into the buffer, it won’t pollute the managed string, which remains immutable. Instead, a different string is back-propagated to the managed side, thereby preserving managed string immutability.

If you are coding in VB, you can pretend that the VBByRefStr is actually byval on the managed side. The compiler works its magic on your behalf, so you don’t actually realize that you now have a different string. C# works no such magic, so I had to explicitly add the ‘ref’ keyword in all the right places.

If you’re like me, you probably find all the marshaling directives bewildering. I can’t recommend Adam Nathan’s book enough. It is “.NET and COM – The Complete Interoperability Guide”. It truly is the bible for interop.

Nevertheless, even with the book it’s easy to make a lot of mistakes. There’s a feature in the new CLR release called Customer Debug Probes. It makes finding certain kinds of bugs much easier. Fortunately for all of us, it’s particularly geared to finding bugs with marshaling and other Interop issues.

Comments (22)

  1. Oisin says:

    Chris, again you make it appear easy to just babble on coherently about a difficult topic — your words poke a finger in the eye of complexity. If I tried to explain this to someone in as many words, it would be meaningless. Keep it up!

    • Oisin
  2. Chris Brumme says:

    The CLR Interop team agreed with me: I do find the marshaling directives bewildering! They pointed out that I should have mentioned that StringBuilder is the preferred way to call unmanaged methods that mutate strings. (Be sure to set the capacity of the StringBuilder high enough to accept the text that will be placed in it).

    The VBByRefStr technique is really there for VB compatibility reasons.

  3. Payo says:

    First of all I´d like to congratulate you for your weblog, it´s really impressive.
    And now the question: when you talk about mutating strings and say:
    "Well, you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or Managed C++ code to write into a string’s buffer."
    could you give me an example of this with C# and pointers?
    I know how to do it with MSIL, but in C# I don´t know how to make a pointer point to the memory address pointed by an object reference (a string in this case).
    Maybe this is a stupid question, but I would really apreciate your answer.
    Thanks a lot for sharing your knowledge with us.

  4. Chris Brumme says:

    Payo,

    With strings, the trick is the m_firstChar field, which gives you addressability over the string buffer.

    Here’s a lame example, that probably writes the first character of strIn throughout a returned string of the same size. (I haven’t tested it). It shows how you can use C-style dereferencing via ‘*’ or via array syntax.

       unsafe static string SlamFirst strIn) {
            int length = strIn.Length;
            String strOut = FastAllocateString(length);
            fixed (char * inBuff = &strIn.m_firstChar, outBuff = &strOut.m_firstChar) {
                char c = *inBuff;
                for(int i = 0; i < length; i++) {
                    outBuff[i] = c;
                }
            return strOut;
        }
    

    With normal structs and classes, use explicit or sequential layout so that the C# compiler can predict what the CLR’s loader will decide with respect to layout decisions.

    When using constructs like this, be sure to mark your code as ‘unsafe’ and compile it with /unsafe.

  5. Payo says:

    Thanks a lot for your answer.
    I´d just found one way to make it work:
    // stupid sample code
    fixed(char* s1p=s1)
    {
    for(int i=0;i<4;i++)
    {
    (s1p+i)=’t’;
    }
    }
    At first I was quite confused with all this stuff because I thought the fixed statement was something "optional" (if you were willing to have problems with the GC and object reallocation), and when I tried to compile the code above without the fixed statement the csc gave me this not too clear error:
    "can not implicity convert type string to char
    "
    Using the fixed statement (so it looks it´s more or less compulsory) there´s no compiler error.

    There was a small problem with the code you submitted, I guess you are one of the developers of the string class, so you are used to work with private fields and methods like m_firstChar and FastAllocateString, but I´m in the "outer world", so I don´t have access to those private fields-methods 🙂

  6. David Goldstein says:

    Chris,

    There are times where if there were a safe place to have all your strings interned, you can write far more efficient code…

    like in Object Spaces, they seem to assume that you can compare strings using == and do this many thousands of times per second as the cost of integer compares…

    for my own O/R mapping projects, it’s either use == for strings, find some painful way of avoiding strings altogether (source code generation) or generate dynamic assemblies (yes, this is the ideal).

  7. Chris Brumme says:

    If you control the strings, you can explicitly intern them before proceeding with any comparisons (String.Intern, String.IsInterned). Rather than going through operator== (which performs String comparisons on objects that are statically typed as String), you can use Object.ReferenceEquals (which just compares the two references).

  8. Say you have a domainName hash table. And users send domainName queries to it (e.g. dns) You internal all ownerNames in all RRs. Now you get a query with ownerName "www.test.com". Should you intern that string and compare with your hash of ownerNames? That would seem fast, but would not the intern pool grow out of control as your adding strings to the pool even if ownername did not exist? Do strings ever expire out of the intern pool if no ref to them? TIA

  9. Chris Brumme says:

    If you Intern a string in an AppDomain, the Intern’ed string will remain until the AppDomain is unloaded. So it’s reasonable for you to Intern all the ownerNames, but it wouldn’t be reasonable for you to Intern random test strings to see if you have a match.

    Instead, use the String.IsInterned method on the random test string. If it isn’t already interned, there’s no point in trying to match it. If it is interned already, then you aren’t growing the set of Interned strings and you can now proceed with an efficient comparison against your Interned ownerNames.

  10. The sample code in this post demonstrates that the run time maintains a string intern pool. The intern…

  11. See Win App says:

    This post is actually a re-post of a post I did a little under year ago during PDC ’05 after attending…

  12. B# .NET Blog says:

    Introduction

    Time for some cool .NET 2.0 feature that might prove useful in some scenarios: string interning….

Skip to main content