Interning Strings and immutability


Managed
strings are subject to ‘interning’. 
This is the process where the system notices that the same string is used
in several places, so it can fold all the references to the same unique
instance.


"urn:schemas-microsoft-com:office:office" /> size=2> 


size=2>Interning happens two ways in the CLR.


size=2> 


style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in"> style="mso-fareast-font-family: Tahoma; mso-bidi-font-family: Tahoma"> style="mso-list: Ignore">1) style="FONT: 7pt 'Times New Roman'">     
It happens when you explicitly
call System.String.Intern(). 
Obviously the string returned from this service might be different from
the one you pass in, since we might already have an intern’ed instance that has
been handed out to the application.


size=2> 


style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .5in"> style="mso-fareast-font-family: Tahoma; mso-bidi-font-family: Tahoma"> style="mso-list: Ignore">2) style="FONT: 7pt 'Times New Roman'">     
It happens automatically, when you
load an assembly.  All the string
literals in the assembly are intern’ed. 
This is expensive and – in retrospect – may have been a mistake. style="mso-spacerun: yes">  In the future we might consider allowing
individual assemblies to opt-in or opt-out. style="mso-spacerun: yes">  Note that it is always a mistake to rely
on some other assembly to have implicitly intern’ed the strings it gives you.
 Through versioning, that other
assembly might start composing a string rather than using a literal.


size=2> 


One
thing that might not be immediately obvious is that we intern strings across all
AppDomains.  That’s because
assemblies can be loaded as domain-neutral. style="mso-spacerun: yes">  When this happens, we execute the same
code bytes at the same address in all AppDomains into which that assembly has
been loaded.  Since we can burn the
addresses of string literals into our native code as immediate data, we clearly
benefit from intern’ing across all AppDomains rather than using per-AppDomain
indirections in the code.  However,
this approach does add overhead to intern’ing: we are forced to use
per-AppDomain reference counts into a shared intern’ing table, so that we can
unload intern’ed strings accurately when the last AppDomain using them is itself
unloaded.


size=2> 


size=2>Normally, strings should be compared with String.Equals and similar
mechanisms.  Note that the String
class defines operator== to be String.Equals. style="mso-spacerun: yes">  However, if two strings are both known
to have been intern’ed, then they can be compared directly with a faster
reference check.  In other words,
you could call Object.operator==() rather than String.operator==(). style="mso-spacerun: yes">  This is only recommended for highly
performance-sensitive scenarios when you really know what you are
doing.


size=2> 


Of
course, string intern’ing only works if strings are immutable. style="mso-spacerun: yes">  If they were mutable, then the sharing
of strings that is implicit in intern’ing would corrupt all kinds of application
assumptions – as we will see.


size=2> 


The good
news is that strings are immutable… mostly. style="mso-spacerun: yes">  And they are immutable for many good
reasons that have nothing to do with intern’ing. style="mso-spacerun: yes">  For example, immutable strings eliminate
a whole host of multi-threaded race conditions where one thread uses a string
while another string mutates it.  In
some cases, those race conditions could be used to mount security attacks. style="mso-spacerun: yes">  For example, you could satisfy a
FileIOPermission demand with a string pointing to an innocuous section of the
file system, and then use another thread to quickly change the string to point
to a sensitive file before the underlying CreateFile occurs.


size=2> 


So how
can strings be mutated?


size=2> 


Well,
you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or
Managed C++ code to write into a string’s buffer. style="mso-spacerun: yes">  In those cases, some highly trusted code
is performing some clearly dirty operations. style="mso-spacerun: yes">  This case isn’t going to happen by
accident.


size=2> 


A more
serious concern comes with marshaling. 
Here’s a program that uses PInvoke to accidentally mutate a string. style="mso-spacerun: yes">  Since the string happens to have been
intern’ed, it has the effect of changing a string literal in an unrelated part
of the application.  We pass
‘computerName’ to the PInvoke, but ‘otherString’ gets changed too!


size=2> 


style="FONT-FAMILY: 'Lucida Console'">using
System;


style="FONT-FAMILY: 'Lucida Console'">using
System.Runtime.InteropServices;


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'">public class
Class1


style="FONT-FAMILY: 'Lucida Console'"> size=2>{


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">    static void prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /> w:st="on">Main(string[] args)


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">   style="mso-spacerun: yes"> {


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
String computerName = “strings are always
immutable”;


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
String otherString = “strings are always
immutable”;


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">        int
len = computerName.Length;


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
GetComputerName(computerName, ref len);


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
Console.WriteLine(otherString);


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">  style="mso-spacerun: yes">   }


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">    [DllImport(“kernel32”,
CharSet=CharSet.Unicode)]


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">    static extern bool
GetComputerName(


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
[MarshalAs (UnmanagedType.LPWStr)] string
name,


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">        ref
int len);


style="FONT-FAMILY: 'Lucida Console'"> size=2>}


size=2> 


And
here’s the same program written to avoid this problem:


size=2> 


style="FONT-FAMILY: 'Lucida Console'">using
System;


style="FONT-FAMILY: 'Lucida Console'">using
System.Runtime.InteropServices;


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'">public class
Class1


style="FONT-FAMILY: 'Lucida Console'"> size=2>{


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">    static void w:st="on">Main(string[] args)


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">   
{


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
String computerName = “strings are always
immutable”;


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
String otherString = “strings are always
immutable”;


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">        int
len = computerName.Length;


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
GetComputerName(ref computerName, ref len);


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
Console.WriteLine(otherString);


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">   
}


style="FONT-FAMILY: 'Lucida Console'"> size=2> 


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">    [DllImport(“kernel32”,
CharSet=CharSet.Unicode)]


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">    static extern bool
GetComputerName(


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       
[MarshalAs(UnmanagedType.VBByRefStr)]


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">        ref
string name,


style="FONT-FAMILY: 'Lucida Console'"> style="mso-spacerun: yes">       style="mso-spacerun: yes"> ref int
len);


style="FONT-FAMILY: 'Lucida Console'"> size=2>}


size=2> 


face=Tahoma>In this second case,
VBByRefStr is used for the marshaling directive. style="mso-spacerun: yes"> 
The argument is treated as ‘byref’ on
the managed side, but remains ‘byval’ on the unmanaged side.  If the
unmanaged side scribbles into the buffer, it won’t pollute the managed string,
which remains immutable.  Instead, a different string is back-propagated to
the managed side, thereby preserving managed string immutability. style="FONT-SIZE: 12pt; FONT-FAMILY: 'Times New Roman'">


face=Tahoma>  style="FONT-SIZE: 12pt; FONT-FAMILY: 'Times New Roman'">


face=Tahoma>If you are coding in VB,
you can pretend that the VBByRefStr is actually byval on the managed side. 
The compiler works its magic on your behalf, so you don’t actually realize that
you now have a different string.  C# works no such magic, so I had to
explicitly add the ‘ref’ keyword in all the right places.
style="FONT-SIZE: 12pt; FONT-FAMILY: 'Times New Roman'">


face=Tahoma>  style="FONT-SIZE: 12pt; FONT-FAMILY: 'Times New Roman'">


If
you’re like me, you probably find all the marshaling directives
bewildering.  I can’t recommend Adam
Nathan’s book enough.  It is “.NET
and COM – The Complete Interoperability Guide”. style="mso-spacerun: yes">  It truly is the bible for
interop.


size=2> 


size=2>Nevertheless, even with the book it’s easy to make a lot of
mistakes.  There’s a feature in the
new CLR release called Customer Debug Probes. style="mso-spacerun: yes">  It makes finding certain kinds of bugs
much easier.  Fortunately for all of
us, it’s particularly geared to finding bugs with marshaling and other Interop
issues.


size=2> 

Comments (22)

  1. Anonymous says:

    Chris, again you make it appear easy to just babble on coherently about a difficult topic — your words poke a finger in the eye of complexity. If I tried to explain this to someone in as many words, it would be meaningless. Keep it up!

    • Oisin
  2. Anonymous says:

    The CLR Interop team agreed with me: I do find the marshaling directives bewildering! They pointed out that I should have mentioned that StringBuilder is the preferred way to call unmanaged methods that mutate strings. (Be sure to set the capacity of the StringBuilder high enough to accept the text that will be placed in it).

    The VBByRefStr technique is really there for VB compatibility reasons.

  3. Anonymous says:

    First of all I´d like to congratulate you for your weblog, it´s really impressive.
    And now the question: when you talk about mutating strings and say:
    "Well, you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or Managed C++ code to write into a string’s buffer."
    could you give me an example of this with C# and pointers?
    I know how to do it with MSIL, but in C# I don´t know how to make a pointer point to the memory address pointed by an object reference (a string in this case).
    Maybe this is a stupid question, but I would really apreciate your answer.
    Thanks a lot for sharing your knowledge with us.

  4. Anonymous says:

    Payo,

    With strings, the trick is the m_firstChar field, which gives you addressability over the string buffer.

    Here’s a lame example, that probably writes the first character of strIn throughout a returned string of the same size. (I haven’t tested it). It shows how you can use C-style dereferencing via ‘*’ or via array syntax.

       unsafe static string SlamFirst strIn) {
            int length = strIn.Length;
            String strOut = FastAllocateString(length);
            fixed (char * inBuff = &strIn.m_firstChar, outBuff = &strOut.m_firstChar) {
                char c = *inBuff;
                for(int i = 0; i < length; i++) {
                    outBuff[i] = c;
                }
            return strOut;
        }
    

    With normal structs and classes, use explicit or sequential layout so that the C# compiler can predict what the CLR’s loader will decide with respect to layout decisions.

    When using constructs like this, be sure to mark your code as ‘unsafe’ and compile it with /unsafe.

  5. Anonymous says:

    Thanks a lot for your answer.
    I´d just found one way to make it work:
    // stupid sample code
    fixed(char* s1p=s1)
    {
    for(int i=0;i<4;i++)
    {
    (s1p+i)=’t’;
    }
    }
    At first I was quite confused with all this stuff because I thought the fixed statement was something "optional" (if you were willing to have problems with the GC and object reallocation), and when I tried to compile the code above without the fixed statement the csc gave me this not too clear error:
    "can not implicity convert type string to char
    "
    Using the fixed statement (so it looks it´s more or less compulsory) there´s no compiler error.

    There was a small problem with the code you submitted, I guess you are one of the developers of the string class, so you are used to work with private fields and methods like m_firstChar and FastAllocateString, but I´m in the "outer world", so I don´t have access to those private fields-methods 🙂

  6. Anonymous says:

    Chris,

    There are times where if there were a safe place to have all your strings interned, you can write far more efficient code…

    like in Object Spaces, they seem to assume that you can compare strings using == and do this many thousands of times per second as the cost of integer compares…

    for my own O/R mapping projects, it’s either use == for strings, find some painful way of avoiding strings altogether (source code generation) or generate dynamic assemblies (yes, this is the ideal).

  7. Anonymous says:

    If you control the strings, you can explicitly intern them before proceeding with any comparisons (String.Intern, String.IsInterned). Rather than going through operator== (which performs String comparisons on objects that are statically typed as String), you can use Object.ReferenceEquals (which just compares the two references).

  8. Anonymous says:

    Say you have a domainName hash table. And users send domainName queries to it (e.g. dns) You internal all ownerNames in all RRs. Now you get a query with ownerName "www.test.com". Should you intern that string and compare with your hash of ownerNames? That would seem fast, but would not the intern pool grow out of control as your adding strings to the pool even if ownername did not exist? Do strings ever expire out of the intern pool if no ref to them? TIA

  9. Anonymous says:

    If you Intern a string in an AppDomain, the Intern’ed string will remain until the AppDomain is unloaded. So it’s reasonable for you to Intern all the ownerNames, but it wouldn’t be reasonable for you to Intern random test strings to see if you have a match.

    Instead, use the String.IsInterned method on the random test string. If it isn’t already interned, there’s no point in trying to match it. If it is interned already, then you aren’t growing the set of Interned strings and you can now proceed with an efficient comparison against your Interned ownerNames.

  10. Anonymous says:

    The sample code in this post demonstrates that the run time maintains a string intern pool. The intern…

  11. Anonymous says:

    This post is actually a re-post of a post I did a little under year ago during PDC ’05 after attending…

  12. Anonymous says:

    Introduction

    Time for some cool .NET 2.0 feature that might prove useful in some scenarios: string interning….